---
title: "Outcome models with ebalance weights"
author: "Jens Hainmueller"
date: "`r format(Sys.Date(), '%B %Y')`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Outcome models with ebalance weights}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5,
  dpi = 96
)
set.seed(20260505)
```

This vignette shows how to plug `ebalance()` weights into common
outcome models. The pattern is almost always the same:

1. Build the weights with `ebalance()`.
2. Attach them to the data via `weights(fit)`.
3. Pass them to a downstream regression with `weights = w`.
4. Use a heteroskedasticity-robust variance estimator for inference.

```{r setup}
library(ebal)
```

## A simulated example

```{r data}
set.seed(20260505)
n <- 1000
X <- data.frame(
  x1 = rnorm(n),
  x2 = rbinom(n, 1, 0.4),
  x3 = rnorm(n)
)
# Selection on x1, x2: treatment more likely when x1 > 0 or x2 = 1
ps <- plogis(0.6 * X$x1 + 1.2 * X$x2 - 0.5)
treat <- rbinom(n, 1, ps)
# True ATT = +2; outcome depends on x1 too
y <- 1 + 0.7 * X$x1 + 0.5 * X$x2 + 2 * treat + rnorm(n, sd = 1)
df <- data.frame(treat = treat, y = y, X)

# Naive ATT
mean(df$y[df$treat == 1]) - mean(df$y[df$treat == 0])
```

The naive ATT is biased because covariates are unbalanced:

```{r raw-balance}
rbind(
  treated = colMeans(X[treat == 1, ]),
  control = colMeans(X[treat == 0, ])
)
```

## Fit

```{r fit}
fit <- ebalance(treat ~ x1 + x2 + x3, data = df)
df$w <- weights(fit)
```

## Weighted `lm()` with robust SEs

The point estimate comes straight from a weighted regression:

```{r weighted-lm}
mod <- lm(y ~ treat, data = df, weights = w)
coef(mod)
```

Default `lm()` standard errors are wrong here because the weights
induce heteroskedasticity. Use `sandwich::vcovHC()` (or `vcovCL()`
if you have a clustering variable):

```{r robust-se, eval = requireNamespace("sandwich", quietly = TRUE) && requireNamespace("lmtest", quietly = TRUE)}
library(sandwich); library(lmtest)
coeftest(mod, vcov = vcovHC(mod, type = "HC1"))
```

The `treat` coefficient should be near the true ATT of 2. The
ebal-balanced control group is a much better counterfactual than the
raw control group.

## Adding regression adjustment (doubly-robust)

Including covariates on the right-hand side gives a doubly-robust
estimator: the ATT coefficient is consistent if *either* the weighting
or the outcome model is correctly specified.

```{r dr}
mod_dr <- lm(y ~ treat + x1 + x2 + x3, data = df, weights = w)
coef(mod_dr)["treat"]
```

For the simulated DGP both are correct, so this should match `mod`'s
coefficient closely. In real data, regression adjustment is a useful
hedge.

## Survey-style inference

If you're already in the `survey` package world, ebalance weights
slot in as `weights = ` in `svydesign()`:

```{r survey, eval = requireNamespace("survey", quietly = TRUE)}
library(survey)
des <- svydesign(ids = ~1, weights = ~w, data = df)
svymod <- svyglm(y ~ treat, design = des)
summary(svymod)$coefficients["treat", , drop = FALSE]
```

The survey-package SEs are also robust to the weighting; they
typically agree with `sandwich::vcovHC(..., "HC1")` to a few percent
on cross-sectional data.

## Trimming if weights blow up

Sometimes the entropy-balancing weights have a heavy right tail
(`max(w) / mean(w)` in the dozens). Two options:

```{r trim}
library(generics)
glance(fit)[, c("ess_control", "max_weight_ratio_control")]

trimmed <- ebalance.trim(fit)        # automatic minimization
glance(trimmed)[, c("ess_control", "max_weight_ratio_control")]
```

`ebalance.trim()` returns an object with the same shape, so
`weights(trimmed)` and the downstream regression code are unchanged.
The trimmed fit relaxes balance slightly to keep the max weight
ratio low.

## Choice of estimand for inference

Everything above runs under the default `estimand = "ATT"`. For ATE
or ATC, the only thing that changes is `weights(fit)` (it returns
nontrivial weights for *both* groups under ATE). The `lm()` /
`svyglm()` syntax is identical. See
`vignette("estimands", package = "ebal")` for the comparison.