set.seed(123456789)
<- 200
N <- 2
a <- rnorm(N,mean=2,sd=3)
b # this creates variation in the slope with an average
# effect of 2.
<- rep(0,N) # creates a vector of zeros
x0 <- rep(1,N)
x1 <- rnorm(N)
u <- a + b*cbind(x0,x1) + u
y # y is a matrix, [a + u, a + b + u]
# rep creates a vector by repeating the first number by the
# amount of the second number.
Bounds Estimation
Introduction
In the first three chapters we estimated, or attempted to estimate, a single value of interest. This chapter considers situations where we are either unable or unwilling to estimate a single value for the policy parameter of interest. Instead, the chapter considers cases where we are limited to estimating a range of values. We are interested in using the data to estimate the bounds on the policy parameter of interest.
It is standard practice in econometrics to present the average treatment effect (ATE). This estimand provides the policy maker with the average impact of the policy if everyone was to receive the policy. That is, if everyone changes from not attending college to attending college, the ATE predicts what would happen. I would give an example, but I can’t think of one. In general, policies do not work like this. Consider the policy of making public state colleges free. Such a policy would encourage more people to attend college, but a bunch of people were already attending college and a bunch of people will not attend college even if it is free. What does the ATE tell us will happen to those that are newly encouraged to go to college? Not that much.
If attending college has the same effect on everyone, then the ATE provides useful information. If everyone has the same treatment effect, the average must be equal to the treatment effect. The difficulty arises when different people get different value from going to college. That is, the difficulty always arises.
This chapter considers two implications. In the first case, the data allows the ATE to be estimated, but we would prefer to know the distribution of the policy effect. In general, we cannot estimate this distribution. We can, however, bound it. These bounds are based on a conjecture of the great Soviet mathematician, Andrey Kolmogorov. The chapter explains how the Kolmogorov bounds work and when they provide the policy maker with useful information. These bounds are illustrated by analyzing a randomized controlled trial on the effect of “commitment savings” devices.
In the second case, the data does not allow the ATE to be estimated. Or more accurately, we are unwilling to make the non-credible assumptions necessary to estimate the ATE. The Northwestern econometrician, Charles Manski, argues that econometricians are too willing to present estimates based on non-credible assumptions. Manski shows that weaker but more credible assumptions often lead to a range of estimates. He suggests that presenting a range of estimates is better than providing precisely estimated nonsense. The chapter presents Manski’s natural bounds and discusses how assumptions can reduce the range of estimates of the policy effect. The chapter illustrates these ideas by estimating whether more guns reduce crime.
Potential Outcomes
You have been tasked by the AOC 2028 campaign to estimate the likely impact of a proposal to make state public universities tuition free.1 Your colleague is tasked with estimating how many more people will attend college once it is made free. You are to work out what happens to incomes of those that choose to go to college, now that is free. You need to estimate the **treatment effect} of college.
Model of Potential Outcomes
Consider a simple version of the problem. There are two possible outcomes. There is the income the individual receives if they attend college (\(y_i(1)\)) and the income they would receive if they did not attend college (\(y_i(0)\)).
\[ y_{i}(x_i) = a + b_i x_i + \upsilon_i \tag{1}\]
where \(y_i\) is individual \(i\)’s income, \(x_{i} \in \{0, 1\}\) is whether or not individual \(i\) attends college and \(\upsilon_i\) represents some unobserved characteristic that also affects individual \(i\)’s income. The treatment effect is represented by \(b_i\) and this may vary across individuals.
We are interested in determining the treatment effect for each individual \(i\).
\[ b_i = y_i(1) - y_i(0) \tag{2}\]
This is the difference between the two possible outcomes for each individual.
Simulation of Impossible Data
Imagine you have access to the impossibly good data set created below (actually, just an impossible data set). The data provides information on the simulated individual’s outcome (\(y\)) for both treatments (\(x=0\) and \(x=1\)). This is equivalent to knowing an individual’s income for both the case where they went to college and the case where they did not go to college. These counter-factual outcomes are called potential outcomes (Rubin 1974).
par(mfrow=c(2,1)) # creates a simple panel plot
par(mar=c(2,4,0.5,0.5)) # adjusts margins between plots.
plot(density(y[,1]),type="l",lwd=4,xlim=range(y),
ylab="density",main="")
lines(density(y[,2]),lwd=2)
abline(v=colMeans(y),lwd=c(4,2))
legend("topright",c("No College","College"),lwd=3,lty=c(1,2))
plot(ecdf(y[,1]), xlim=range(y),main="",do.points=FALSE,
lwd=4,xlab="y")
lines(ecdf(y[,2]),lwd=2,do.points=FALSE)
# ecdf empirical cumulative distribution function.

Figure 1 presents the density functions, means and cumulative distribution functions of the two potential outcomes for the simulated data. The figure suggests that individuals generally have better outcomes when \(x=1\). Let \(y\) be income and \(x=1\) be college attendance. Do you think this is evidence that people earn more money because they attend college? The mean of the distribution of income for those attending college is much higher than the mean of the distribution of income for those not attending college. Assuming that this simulated data represented real data, should AOC 2028 use these results as evidence for making college free?
A concern is that the two distributions overlap. Moreover, the cumulative distributions functions cross. There may be individuals in the data who are actually better off if \(x=0\). The average college attendee earns more than the average non-college attendee but some may earn less if they go to college. We can determine if this occurs by looking at the joint distribution of potential outcomes. We will see that the crossing observed in Figure 1 implies that some individuals are better off if \(x=0\) while others are better off if \(x=1\).
Distribution of the Treatment Effect
The Equation 2 states that the treatment effect may vary across individuals. If it does, then it has a distribution. Figure 2 presents the density and cumulative distribution function for the difference in outcome if the individual attended college and if the individual did not. The distribution shows that the treatment effect varies across individuals. It is heterogeneous. Moreover, the effect of college may either increase or decrease income, depending on the individual.
par(mfrow=c(2,1)) # creates a simple panel plot
par(mar=c(2,4,0.5,0.5)) # adjusts margins between plots.
plot(density(y[,2]-y[,1]),type="l",lwd=3,main="")
abline(v=0,lty=2,lwd=3)
plot(ecdf(y[,2]-y[,1]),main="",do.points=FALSE,lwd=3,
xlab="y")
abline(v=0,lty=2,lwd=3)

Average Treatment Effect
The average treatment effect (ATE) holds a special position in econometrics and statistics. A possible reason is that it measures the average difference in potential outcomes. That’s actually pretty neat given that outside of our impossible data we cannot observe the difference in potential outcomes. How can we measure the average of something we cannot observe?
ATE and Its Derivation
# mean of the difference vs difference of the means
mean(y[,2]-y[,1]) == mean(y[,2]) - mean(y[,1])
[1] TRUE
The mean of the difference is equal to the difference of the means. We cannot observe the difference in the treatment outcomes. But, we can observe the outcomes of each treatment separately. We can observe the mean outcomes for each treatment. This neat bit of mathematics is possible because averages are linear operators.
We can write out the expected difference in potential outcomes by the Law of Total Expectations.
\[ \mathbb{E}(Y_1 - Y_0) = \int_{y_0} \int_{y_1} (y_1 - y_0) f(y_1 | y_0) f(y_0) d y_1 d y_0 \tag{3}\]
where \(Y_x\) refers to the outcome that occurs if the individual receives treatment \(x\). It is the potential outcome for \(x\).
The rest follows from manipulating the conditional expectations.
\[ \begin{array}{ll} \mathbb{E}(Y_1 - Y_0) & = \int_{y_0} \left(\int_{y_1} y_1 f(y_1 | y_0) d y_1 - y_0 \right) f(y_0) d y_0\\ & = \int_{y_0} \left(\int_{y_1} y_1 f(y_1 | y_0) d y_1 \right) f(y_0) d y_0 - \int_{y_0} y_0 f(y_0) d y_0\\ & = \int_{y_1} y_1 f(y_1) d y_1 - \int_{y_0} y_0 f(y_0) d y_0\\ & = \mathbb{E}(Y_1) - \mathbb{E}(Y_0) \end{array} \tag{4}\]
Rubin (1974) presents the derivation in Equation 4. He points out that if we can estimate each of the average potential outcomes then we have an estimate of the average treatment effect.
But can we estimate the average potential outcome?
ATE and Do Operators
To answer this question, it is clearer to switch notation. At the risk of upsetting the Gods of Statistics, I will mix notations from two different causal models. The expected potential outcome if \(X=1\) is assumed to be equal to the expected outcome conditional on \(\mbox{do}(X)=1\) (Pearl and Mackenzie 2018).
\[ \mathbb{E}(Y_1) = \mathbb{E}(Y | \mbox{do}(X) = 1) \tag{5}\]
By “do” we mean that this is the expected outcome if individuals in the data faced a policy which forced the treatment \(X=1\). We are holding all other effects constant when the policy change is made. It is “do” as in “do a policy.”
The notation highlights the fact that the expected potential outcome of a treatment may not be equal to expected outcomes in a particular treatment. In general, \(\mathbb{E}(Y | \mbox{do}(X)=1) \neq \mathbb{E}(Y | X = 1)\), where the second term is observed in the data. The second term is standard notation for the expected outcome among individuals observed in the data with the treatment equal to 1. This is do operator notation for “correlation does not imply causation.”
To see why these numbers are not the same, consider the following derivation. We can write down the expected outcome conditional on the do operator by the Law of Total Expectations. We can write out the average outcome conditional on the policy as the sum of the average outcomes of the policy conditional on the observed treatments weighted by the observed probabilities of the treatments.
\[ \begin{array}{ll} \mathbb{E}(Y | \mbox{do}(X) = 1) & = \mathbb{E}(Y | \mbox{do}(X) = 1, X = 0) \Pr(X = 0) \\ & + \mathbb{E}(Y | \mbox{do}(X) = 1, X = 1) \Pr(X = 1) \end{array} \tag{6}\]
The expected outcome under a policy in which individuals go to college is a weighted sum of the effect of the policy on individuals who currently go to college and the effect of the policy on individuals who currently do not go to college.
We are generally able to observe three of the four numbers on the right-hand side of Equation 6. We observe the probability individuals are allocated to the current treatments. In addition, we assume that \(\mathbb{E}(Y | \mbox{do}(X) = 1, X = 1) = \mathbb{E}(Y | X = 1)\). That is, we assume that the expected outcome for people assigned to a treatment will be the same as if there was a policy that assigned them to the same treatment. The number we do not observe in the data is \(\mathbb{E}(Y | \mbox{do}(X)=1, X = 0)\). We cannot observe the expected outcome conditional on a policy assigning a person to one treatment when they are observed receiving the other treatment. We cannot observe the expected income from attending college for people who do not attend college.
ATE and Unconfoundedness
We can estimate the average treatment effect if we are willing to make the following assumption.
Unconfoundedness \(\mathbb{E}(Y | \mbox{do}(X) = x, X = x) = \mathbb{E}(Y | \mbox{do}(X) = x, X = x')\)
The Assumption 1 states that expected outcome of the policy does not vary with treatment observed in the data. Under the assumption, there is no information content in the fact that one group attends college and one group does not. This assumption may be reasonable if we have data from an ideal randomized controlled trial. For most other data, including many randomized controlled trials, the assumption may not be credible.
The assumption implies we can substitute the unknown expected value with the known expected value.
\[ \begin{array}{ll} \mathbb{E}(Y | \mbox{do}(X) = 1) & = \mathbb{E}(Y | \mbox{do}(X) = 1, X = 0) \Pr(X = 0) \\ & + \mathbb{E}(Y | \mbox{do}(X) = 1, X = 1) \Pr(X = 1)\\ & = \mathbb{E}(Y | \mbox{do}(X) = 1, X = 1) \Pr(X = 0) \\ & + \mathbb{E}(Y | \mbox{do}(X) = 1, X = 1) \Pr(X = 1)\\ & = \mathbb{E}(Y | X = 1) \end{array} \tag{7}\]
The implication is that we can estimate the average of the potential outcomes for each treatment. Thus we can estimate the average difference in potential outcomes. Said differently, unconfoundedness allows us to estimate the average treatment effect.
ATE and Simulated Data
<- runif(N) < 0.3 # treatment assignment
X <- (1-X)*y[,1] + X*y[,2] # outcome conditional on treatment Y
Consider a change to our simulated data to make it look more like an actual data set. In the new data we only see one outcome and one treatment for each individual. However, if we can make the unconfoundedness assumption then we can estimate the average treatment effect. Our new data satisfies the assumption because the assignment to treatment is random.
mean(Y[X==1]) - mean(Y[X==0])
[1] 2.432335
In the data the true average treatment effect is 2. Our estimate is 2.43.
What changes could you make to the simulated data that would increase the accuracy of the estimate?2
Kolmogorov Bounds
There are policy questions where the ATE provides a useful answer, but it is often provided as a statistic of convenience. In the data generated above, many simulated individuals are better off under treatment \(x = 1\). But not everyone is better off. It may be useful for policy makers to know something about the joint distribution of potential outcomes or the distribution of the treatment effect.3
We do not have access to the impossible data generated above. We cannot estimate the joint distribution of potential outcomes or the distribution of the treatment effect. However, we can **bound} these distributions.
Kolmogorov’s Conjecture
The Russian mathematician, Andrey Kolmogorov, conjectured that difference of two random variables with known marginals could be bounded in the following way. Note that I have written this out in a simplified way that will look more closely like the way it is implemented in R
.4
Theorem 1 (Kolmogorov’s Conjecture) Let \(\beta_i = y_{i}(1) - y_{i}(0)\) denote the treatment effect and \(F\) denote its distribution. Let \(F_0\) denote the distribution of outcomes for treatment (\(x=0\)) and \(F_1\) denote the distribution of outcomes for treatment (\(x=1\)). Then \(F^L(b) \le F(b) \le F^U(b)\), where
\[ F^L(b) = \max\{\max_{y} F_1(y) - F_0(y - b), 0\} \tag{8}\]
and
\[ F^U(b) = 1 + \min\{\min_y F_1(y) - F_0(y-b), 0\} \tag{9}\]
The Theorem 1 states that we can bound the distribution of the treatment effect even though we only observe the distributions of outcomes for each of the treatments. You may be surprised to learn how easy these bounds are to implement and how much information they provide about the distribution of the treatment effect.
Kolmogorov Bounds in R
We can use Theorem 1 as pseudo-code for the functions that bound the treatment effect distribution.
<- function(b, y1, y0) {
FL <- function(x) -(mean(y1 < x) - mean(y0 < x - b))
f # note the negative sign as we are maximizing
# (Remember to put it back!)
<- optimize(f, c(min(y1,y0),max(y1,y0)))
a return(max(-a$objective,0))
}<- function(b, y1, y0) {
FU <- function(x) mean(y1 < x) - mean(y0 < x - b)
f <- optimize(f, c(min(y1,y0), max(y1,y0)))
a return(1 + min(a$objective,0))
}
<- 50
K <- min(y[,1]) - max(y[,2])
min_diff <- max(y[,1]) - min(y[,2])
max_diff <- (max_diff - min_diff)/K
del_diff <- min_diff + c(1:K)*del_diff
y_K plot(ecdf(y[,2] - y[,1]), do.points=FALSE,lwd=3,main="")
lines(y_K,sapply(y_K, function(x) FL(x,y[,2],y[,1])),
lty=2,lwd=3)
lines(y_K,sapply(y_K, function(x) FU(x,y[,2],y[,1])),
lty=3,lwd=3)
abline(v=0,lty=2,lwd=3)

Figure 3 presents the distribution of the treatment effect for the simulated data as well as the lower and upper bounds. Remember that in normal data we cannot observe the treatment effect but thanks to math we can determine its bounds. If you look closely, you will notice that some simulated individuals must be harmed by the treatment. At 0, the bounds are strictly positive. Of course, we know that in our impossible data some simulated individuals are in fact worse off.
Do ``Nudges’’ Increase Savings?
Researchers in economics and psychology have found that individuals often make poor decisions. They make decisions that are against the individual’s own interest. Given this, can policies or products be provided that “nudge” individuals to make better decisions?
Ashraf, Karlan, and Yin (2006) describe an experiment conducted with a bank in the Philippines. In the experiment some customers were offered “commitment” savings accounts. In these accounts the customer decides upon a goal, such as a target amount or a target date, and can deposit but not withdraw until the goal is reached. Such products may help people with issues controlling personal finances or interacting with household members on financial matters. People offered accounts did not actually have to open an account and many did not.
Ashraf, Karlan, and Yin (2006) use a field experiment to determine the effectiveness of a commitment savings account.5 In the experiment there are three treatment groups; the first group is offered the commitment savings account at no extra cost or savings, the second group is provided information on the value of savings, and the third is a control. Here we will compare the commitment group to the control.
The section uses the data to illustrate the value of Kolmogorov bounds.
Field Experiment Data
We first replicate the findings in Ashraf, Karlan, and Yin (2006). The data is available at https://doi.org/10.7910/DVN/27854.
require(readstata13)
Loading required package: readstata13
# this data set was saved with Stata version 13.
<- read.dta13("seedanalysis_011204_080404.dta") x
Warning in read.dta13("seedanalysis_011204_080404.dta"):
Factor codes of type double or float detected in variables
amount, paid, spaying, value, value1,
value2, value3, value4, value5, lexpen1,
etypica2, fargreen, cosgreen, frabrao1,
costro1, frabrao2, costrvo2, frabrao3,
hsave, pgloan, hgowe, gtime, e4, e10,
dates_month, marketbuy, expensivebuy,
numchild, familyplan, assistfam,
personaluse, recreation, familypurchase,
workout, initiatepeace, schkids
No labels have been assigned.
Set option 'nonint.factors = TRUE' to assign labels anyway.
Warning in read.dta13("seedanalysis_011204_080404.dta"):
Missing factor labels for variables
repay5
No labels have been assigned.
Set option 'generate.factors=TRUE' to generate labels.
<- is.na(rowSums(cbind(x$treatment,
index_na $balchange,x$marketing)))==0
x<- x[index_na,]
x1 <- x1[x1$treatment==0 & x1$marketing==0,]$balchange
bal_0 <- x1[x1$treatment==1 & x1$marketing==0,]$balchange
bal_1 # we are just going to look at the people who did not receive
# the marketing information.
# These people are split between those that received
# the account
# (treatment = 1), and those that did not (treatment = 0).
# balchange - measure their balance changed in a year.
<- log(bal_0 + 2169)
lbal_0 <- log(bal_1 + 2169)
lbal_1 # the distribution of balances is very skewed.
mean(bal_1) - mean(bal_0)
[1] 411.4664
The average treatment effect is a 411 peso increase (about $200) in savings after 12 months for those offered the commitment accounts. This result suggests that commitment accounts have a significant impact on savings rates. However, it is not clear if everyone benefits and how much benefit these accounts provide.
Bounds on the Distribution of Balance Changes

Figure 4 presents the bounds on the distribution of the treatment effect. The figure shows that there is a small portion of the population that ends up saving a large amount due to the commitment savings device, over 10,000 pesos. It also shows that for a large part of the population the commitment savings may or may not increase savings.
There may even be people who actually end up saving less. Unlike the example above, we cannot show that the fraction must be greater than 0.
Intent To Treat Discussion
One issue with the analysis presented above, and with the main results of Ashraf, Karlan, and Yin (2006), is that they are the intent to treat estimates. We have estimated the treatment effect of being “assigned” to a commitment account. People are not lab rats. They have free will. In this case, people assigned to the commitment accounts had the choice of whether to open the account or not. Many did not.
Can you calculate the average treatment effect using the instrumental variable approach? Hint: it is much higher. Did you calculate the ATE or the LATE?
More generally, the concern is that we do not know what would happen to the savings of people who were assigned the commitment account but chose not to open it. Did these people know something that we do not?
Manski Bounds
In his seminal paper, Non-Parametric Bounds on Treatment Effects, Chuck Manski introduced the idea of set estimation to economics (Manski 1990). Manski argues that many of the assumptions underlying standard econometrics are ad hoc and unjustified. Rather than making such assumptions, Manski suggests presenting results based on assumptions that can be well justified. In many cases, such assumptions do not provide precise estimates.
Manski also points out that the econometrician and the policy maker may have different views on the reasonableness of assumptions. Therefore, the econometrician should present the results ordered from those based on the most reasonable assumptions to those results based on the least reasonable assumptions. This approach to presenting research gives the policy maker a better understanding of the relationship between the assumptions and the policy predictions (Manski and Pepper 2013).
The section presents the bounds approach and illustrates it with simulated data.
Confounded Model
Consider the following confounded version of the model presented above.
\[ y_{i}(x_i) = a + b_i x_i + \upsilon_{1i} \tag{10}\]
where \(y_i\) is individual \(i\)’s income, \(x_{i} \in \{0, 1\}\) is whether or not individual \(i\) attends college and \(\upsilon_{1i}\) represents some unobserved characteristic that also affects individual \(i\)’s income. The treatment effect is represented by \(b_i\) and this may vary across individuals.
This time, the value of the policy variable is also determined by the unobserved characteristic that determines income.
\[ \begin{array}{l} x_i^* = f + c \upsilon_{1i} + d z_i + \upsilon_{2i}\\ \\ x_i = \left \{\begin{array}{ll} 1 & \mbox{ if } x_i^* > 0\\ 0 & \mbox{ otherwise} \end{array} \right. \end{array} \tag{11}\]
where \(x_i^*\) is a latent (hidden) variable that determines whether or not the individual attends college. If the value of the latent value is high enough, then the individual attends college. Importantly, the value of this latent variable is determined by the same unobserved characteristic that determines income. That is, if \(\upsilon_{1i}\) is large and the parameter \(c\) is positive, then \(y_i\) will tend to be larger when \(x_i\) is 1 and lower when \(x_i\) is 0.
Simulation of Manski Bounds
Consider simulated data from a confounded data set.
<- 2
c <- 4
d <- -1
f <- round(runif(N))
Z <- rnorm(N)
u_2 <- f + c*u + d*Z + u_2
xstar <- xstar > 0 # treatment assignment
X <- (1-X)*y[,1] + X*y[,2] # outcome conditional on treatment
Y mean(Y[X==1]) - mean(Y[X==0])
[1] 3.506577
The simulated data illustrates the problem. If we assume unconfoundedness, we can estimate the average treatment effect. Our estimate is not close to the true value of 2. Try running OLS of \(y\) on \(x\). What do you get?
In economics we call this a selection problem. One solution is to use an instrumental variable estimate to determine \(b\). But what if we don’t have an instrument? What if we don’t believe the assumptions of the IV model are credible given our data? An alternative to making an unreasonable assumption is to bound the value of interest.
Bounding the Average Treatment Effect
The average treatment effect of college is the difference in the expected outcome given a policy of going to college and a policy of not going to college.
\[ ATE = \mathbb{E}(Y | \mbox{do}(X)=1) - \mathbb{E}(Y | \mbox{do}(X) = 0) \tag{12}\]
From above we know it can be written as the difference in expected income when the policy forces everyone to go to college and the expected income when the policy forces everyone not to go to college.
We can write out this via the Law of Total Expectation.
\[ \begin{array}{l} ATE = \mathbb{E}(Y | \mbox{do}(X)=1, X=1) \Pr(X=1) +\\ \mathbb{E}(Y | \mbox{do}(X)=1, X=0) \Pr(X=0)\\ - \left(\mathbb{E}(Y | \mbox{do}(X)=0, X=1) \Pr(X=1) + \right.\\ \left.\mathbb{E}(Y | \mbox{do}(X)=0, X=0) \Pr(X=0) \right) \end{array} \tag{13}\]
Each expectation can be split into the group that attends college and the group that does not attend college. We observe the outcome of the policy that sends the individuals to college for the group that actually goes to college. If we assume that their outcome from the policy is the same as we observe, then we can substitute the observed values into the equation.
\[ \begin{array}{l} ATE = \Pr(X=1) \left(\mathbb{E}(Y | X=1) - \mathbb{E}(Y | \mbox{do}(X)=0, X=1) \right)\\ + \Pr(X=0) \left(\mathbb{E}(Y | \mbox{do}(X)=1, X=0) - \mathbb{E}(Y | X=0)\right) \end{array} \tag{14}\]
We don’t know the outcome of the policy that sends individuals to college for the group that actually does not go to college. Note that I rearranged the equation a little.
We cannot determine the ATE. But we can bound the ATE by replacing the values we cannot observe with values we can observe. Importantly, we know these observed values must be larger (smaller) than the values we cannot observe.
Natural Bounds of the Average Treatment Effect
What is the weakest assumption we could make? An expectation is bounded by the smallest possible value and the largest possible value. An average cannot be smaller than the smallest possible value in the set being averaged. Similarly, the average cannot be larger than the largest possible value in the set being averaged.
The bounds are created by replacing the unknown values with the smallest (largest) values they could be. Let \(\underline{Y}\) represent the lower bound (the lowest possible value) and \(\overline{Y}\) represent the upper bound (the largest possible value). Manski calls this the worst-case bounds, while Pearl uses the term natural bounds.6
Given these values, we can calculate the bounds on the average treatment effect.
\[ \begin{array}{l} \overline{ATE} = (E(Y | X=1) - \underline{Y}) \Pr(X=1) + (\overline{Y} - \mathbb{E}(Y | X=0) \Pr(X=0)\\ \\ \underline{ATE} = (E(Y | X=1) - \overline{Y}) \Pr(X=1) + (\underline{Y} - \mathbb{E}(Y | X=0) \Pr(X=0) \end{array} \tag{15}\]
Note how the bounds on the ATE are calculated. The maximum on the ATE is denoted by the overline. It is when the first expected outcome is as large as possible and the second expected outcome is as small as possible. Similarly, the minimum on the ATE is when the first outcome is as small as possible and the second outcome is as large as possible. The minimum on the ATE is denoted by the underline.
Natural Bounds with Simulated Data
In the simulated data we can use the observed minimum and maximum.
= mean(X==1)
PX1 = mean(X==0)
PX0 = mean(Y[X==1])
EY_X1 = mean(Y[X==0])
EY_X0 = min(Y)
minY = max(Y) maxY
The bounds are calculated by replacing the unknown outcome with the minimum possible value of the outcome and, alternatively, the maximum possible value for the outcome.
# ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 7.975223
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -5.010368
These bounds are wide. The average treatment effect of \(X\) on \(Y\) is between -5.01 and 7.98. The true value is 2.
Are Natural Bounds Useless?
The bounds presented above are wide and don’t even predict the correct sign for the ATE. What can we take away from this information?
First, if we are unwilling to make stronger assumptions, then the data may simply not help us answer the policy question of interest. Manski calls the willingness to make incredible assumptions in order to get more certain results, the “lure of incredible certitude” (Manski 2018). He argues that this practice reduces the public and the policy maker’s willingness to rely on science and accept new knowledge.
Second, it is not that we don’t learn anything from the data. In this case we learn that the effect of a policy \(\mbox{do}(X) = 1\) cannot have a larger effect than 8. There are cases where this information may be enough for policy makers to seek an alternative. For example, a cost benefit analysis may have suggested that for a policy to be of value, the effect of the policy must be greater in magnitude than 8. In that case, the bounds provide enough information to say that the policy’s benefits are outweighed by its costs.
Third, there may be assumptions and data that are reasonable and allow tighter bounds. Those are discussed more in the following sections.
Bounds with Exogenous Variation
We may have tighter bounds through variation in the data. In particular, we need variation such that the effect of the policy doesn’t change across different subsets of the data, but the bounds do.
Level Set \[ \begin{array}{l} \mathbb{E}(Y | \mbox{do}(X) = 1, Z = z) - \mathbb{E}(Y | \mbox{do}(X) = 0, Z = z)\\ = \mathbb{E}(Y | \mbox{do}(X) = 1, Z = z') - \mathbb{E}(Y | \mbox{do}(X) = 0, Z = z') \end{array} \] for all \(z, z'\).
The Assumption 2 is like an instrumental variables assumption. Manski calls it a level-set assumption.7 It states that there exists some observable characteristic such that the average treatment effect does not change with changes in the observable characteristic. Given this property it is possible to get tighter bounds by estimating the bounds on the average treatment effect for various subsets of the data. Under the assumption, the average treatment effect must lie in the intersection of these bounds. Thus the new bounds are the intersection of these estimated bounds.
\[ \begin{array}{l} \overline{ATE} = \min\{(\mathbb{E}(Y | X=1, Z=1) - \underline{Y}) \Pr(X=1 | Z=1)\\ + (\overline{Y} - \mathbb{E}(Y | X=0, Z=1) \Pr(X=0 | Z=1),\\ (\mathbb{E}(Y | X=1, Z=0) - \underline{Y}) \Pr(X=1 | Z=0)\\ + (\overline{Y} - \mathbb{E}(Y | X=0, Z=0) \Pr(X=0 | Z=0)\}\\ \\ \underline{ATE} = \max\{(\mathbb{E}(Y | X=1, Z=1) - \overline{Y}) \Pr(X=1 | Z=1)\\ + (\underline{Y} - \mathbb{E}(Y | X=0, Z=1) \Pr(X=0 | Z=1),\\ (\mathbb{E}(Y | X=1, Z=0) - \overline{Y}) \Pr(X=1 | Z=0)\\ + (\underline{Y} - \mathbb{E}(Y | X=0, Z=0) \Pr(X=0 | Z=0)\} \end{array} \tag{16}\]
These are the bounds when the instrument-like variables has two values (\(Z \in \{0, 1\}\)).
Exogenous Variation in Simulated Data
We haven’t used it yet, but there is a variable \(Z\) in the simulated data that is associated with changes in the policy variable but does not directly affect income.8
= mean(Y[X==1 & Z==1])
EY_X1Z1 = mean(Y[X==1 & Z==0])
EY_X1Z0 = mean(Y[X==0 & Z==1])
EY_X0Z1 = mean(Y[X==0 & Z==0])
EY_X0Z0 = mean(X[Z==1]==1)
PX1_Z1 = mean(X[Z==0]==1)
PX1_Z0 = mean(X[Z==1]==0)
PX0_Z1 = mean(X[Z==0]==0) PX0_Z0
# ATE upper bound
min((EY_X1Z1 - minY)*PX1_Z1 + (maxY - EY_X0Z1)*PX0_Z1,
- minY)*PX1_Z0 + (maxY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] 7.049019
# ATE lower bound
max((EY_X1Z1 - maxY)*PX1_Z1 + (minY - EY_X0Z1)*PX0_Z1,
- maxY)*PX1_Z0 + (minY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] -4.00698
We see that using the level set restriction we do get tighter bounds, but the change is not very large. What changes could you make in the simulated data to get a larger effect of using the **level set restriction}?
Bounds with Monotonicity
Can the bounds be tighter with some economics? Remember that we observe the cases where \(\mbox{do}(X)=x\) and \(X=x\) match. We don’t observe the cases where they don’t match. However, we can use the observed cases to bound the unobserved cases. Mathematically, there are a couple of options regarding which observed outcomes can be used for the bounds. Which option you choose depends on the economics.
In the simulated data a higher unobserved term is associated with a greater likelihood of choosing treatment \(x = 1\). That is, holding everything else constant, observing someone receiving treatment \(x=1\) means that they will have higher outcomes. This is a monotonicity assumption. In math, the assumption is as follows.
Monotonicity \(\mathbb{E}(Y | \mbox{do}(X)=1, X=1) \ge \mathbb{E}(Y | \mbox{do}(X)=1, X= 0)\) and \(\mathbb{E}(Y | \mbox{do}(X)=0, X=0) \le \mathbb{E}(Y | \mbox{do}(X)=0, X= 1)\)
The Assumption 3 states that observing someone receive treatment \(x=1\) tells us about their unobserved term. For example, if we hold the treatment the same for everyone, then the people who choose \(x=1\) will have higher expected outcomes. Those that are ``selected’’ into college may have better returns to schooling than the average person. The treatment has monotonic effects on outcomes. We can use this assumption to tighten the bounds on the ATE. In particular, the upper bound can be adjusted down.
\[ \begin{array}{l} \overline{E(Y | \mbox{do}(X) = 1)} = \mathbb{E}(Y | X = 1)\\ \\ \underline{E(Y | \mbox{do}(X) = 0)} = \mathbb{E}(Y | X = 0) \end{array} \tag{17}\]
The monotonicity assumption implies that forcing everyone into treatment \(x=1\) cannot lead to better expected outcomes than the outcomes we observe given the treatment. Similarly, forcing everyone into treatment \(x=0\) cannot have a worse expected outcome than the outcomes we observe given the treatment.
\[ \begin{array}{l} \overline{ATE} = (\overline{Y} - \mathbb{E}(Y | X=0)) \Pr(X=0)\\ \\ \underline{ATE} = (E(Y | X=1) - \overline{Y}) \Pr(X=1) \end{array} \tag{18}\]
Bounds with Monotonicity in the Simulated Data
# ATE upper bound
- EY_X0)*PX0 (maxY
[1] 3.76668
# ATE lower bound
- maxY)*PX1 (EY_X1
[1] -3.64774
Imposing Assumption 3 on the simulated data allows us to tighten the bounds. They reduce down to \([-3.65,3.77]\). Remember the true average in the simulated data is 2. It lowers the potential value of the treatment from 8 to 4.
Note that the impact of these assumptions is presented in the order that Manski and Pepper (2013)} prefer. We started with the most credible assumption, the natural bounds. Then we moved to make a level-set restriction because we had a variable that satisfied the assumption. Finally, we made the monotonicity assumption.
More Guns, Less Crime?
One of the most controversial areas in microeconometrics is estimating the effect of various gun laws on crime and gun related deaths. To study these effects, economists and social scientists look at how these laws vary across the United States and how those changes in laws are related to changes in crime statistics (Manski and Pepper 2018).
Justice Louis Brandeis said that a “state may, if its citizens choose, serve as a laboratory; and try novel social and economic experiments without risk to the rest of the country.”9 The US states are a “laboratory of democracy.” As such, we can potentially use variation in state laws to estimate the effects of those laws. The problem is that US states are very different from each other. In the current terminology, the states with strong pro-gun laws tend to be “red” states or at least “purple” states. They also tend to have large rural populations.
Between 1980 and 1990, twelve states adopted Right to Carry (RTC) laws. We are interested in seeing how crime fared in those states relative to states that did not adopt those laws. To do this we can look at crime rates from the 1980s and 1990s. A potential problem is that the crack epidemic hit the United States at exactly this time, rising through the 80s and 90s before tailing off. The crack cocaine epidemic was associated with large increases in crime rates in urban areas (Aneja, III, and Zhang 2011).
This section uses publicly available crime data to illustrate the value of the bounds approach.
Crime Data
The data is downloaded from John Donohue’s website.10 While there is quite a lot of variation in gun laws, the definition of RTC is “shall issue” in data set used. For crime, we use the per population rate of aggravated assaults per state, averaged over the post 1990 years. The code also calculates the physical size of the state, which is a variable that will be used later.
library(foreign)
# the data is standard Stata format, the library foreign
# allows this data to be imported.
<- read.dta("UpdatedStateLevelData-2010.dta")
x <- X <- Z <- NULL
Y # the loop will create variables by adding to the vectors
for (i in 2:length(unique(x$state))) {
# length measures the number of elements in the object.
= sort(unique(x$state))[i]
state # note the first state is "NA"
<- c(X,sum(x[x$state==state,]$shalll, na.rm = TRUE) > 0)
X # determines if a state has an RTC law at some point in time.
# na.rm tells the function to ignore NAs
<- c(Y,mean(x[x$state==state & x$year > 1990,]$rataga,
Y na.rm = TRUE))
# determines the average rate of aggrevated assualt for the
# state post 1990.
<- c(Z,mean(x[x$state==state & x$year > 1990,]$area,
Z na.rm = TRUE) > 53960)
# determines the physical area of the state
# Small state = 0, large stage = 1
# print(i)
}
Figure 5 shows the histogram for the average aggravated assault rate per state in the post 1990 years. It shows that rate per 100,000 is between 0 and 600 for the most part.

ATE of RTC Laws under Unconfoundedness
If we assume unconfoundedness, then RTC laws lower aggravated assault. Comparing the average rate of aggravated assault in states with RTC laws to states without RTC laws, we see that the average is lower with RTC laws.
<- mean(Y[X==1])
EY_X1 <- mean(Y[X==0])
EY_X0 - EY_X0 EY_X1
[1] -80.65852
Unconfoundedness is not a reasonable assumption. We are interested in estimating the average effect of implementing an RTC law. We are not interested in the average rate of assaults conditional on the state having an RTC law.
Natural Bounds on ATE of RTC Laws
We cannot observe the effect of RTC laws for states that do not have RTC laws. We could assume that the assault rate lies between 0 and 100,000 (which it does).
<- mean(X==0)
PX0 <- mean(X==1)
PX1 <- 0
minY <- 100000
maxY # ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 23666.01
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -76333.99
The natural bounds are very very wide. An RTC policy may lead to assault rates decreasing by 75,000 or increasing by 24,000 per 100,000 people.
We can make these bounds tighter by assuming that assault rates of the policy cannot lie outside the rates observed in the data.
<- min(Y)
minY <- max(Y)
maxY # ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 334.1969
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -624.7655
These bounds are lot tighter. A policy that introduces RTC for the average state could decrease the assault rate by 625 or increase the assault rate by 334. Given that range, it could be that RTC laws substantially reduce aggravated assaults or it could be that they have little or no effect. They may even cause an increase in aggravated assaults.
Bounds on ATE of RTC Laws with Exogenous Variation
<- mean(X[Z==1]==1)
PX1_Z1 <- mean(X[Z==0]==1)
PX1_Z0 <- mean(X[Z==1]==0)
PX0_Z1 <- mean(X[Z==0]==0)
PX0_Z0 <- mean(Y[X==1 & Z==1])
EY_X1Z1 <- mean(Y[X==1 & Z==0])
EY_X1Z0 <- mean(Y[X==0 & Z==1])
EY_X0Z1 <- mean(Y[X==0 & Z==0])
EY_X0Z0 # ATE upper bound
min((EY_X1Z1 - minY)*PX1_Z1 + (maxY - EY_X0Z1)*PX0_Z1,
- minY)*PX1_Z0 + (maxY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] 323.2504
# ATE lower bound
max((EY_X1Z1 - maxY)*PX1_Z1 + (minY - EY_X0Z1)*PX0_Z1,
- maxY)*PX1_Z0 + (minY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] -613.3812
We can make a level set assumption. Assume that the instrument-like variable is the physical size of the state. The assumption is that the average treatment effect of implementing an RTC law must be the same irrespective of the physical size of the state. Note that observable outcomes like the assault rate and the proportion of states with RTC laws may vary with the physical size. The assumption is on the average treatment effect which is unobserved.
The bounds are tighter, although not much. RTC laws could reduce aggravated assaults by 613 or increase rates by 323.
Bounds on ATE of RTC Laws with Monotonicity
Would it be reasonable to use the monotonicity assumption above (Assumption 3)?
Let’s assume that states that currently have RTC laws will also tend to have lower levels of aggravated assault. Moreover, forcing states that don’t currently have RTC laws will not reduce the expected aggravated assaults below that level. This is the ``negative’’ of the monotonicity assumption in the simulated data.
Monotonicity (version 2) \(\mathbb{E}(Y | \mbox{do}(X)=1, X=1) \le \mathbb{E}(Y | \mbox{do}(X)=1, X = 0)\) and \(\mathbb{E}(Y | \mbox{do}(X)=0, X=0) \ge \mathbb{E}(Y | \mbox{do}(X)=0, X= 1)\)
We can summarize this with Assumption 4.
The Assumption 4 implies the following change to the bounds on the unobserved expectations.
\[ \begin{array}{l} \underline{\mathbb{E}(Y | \mbox{do}(X) = 1)} = \mathbb{E}(Y | X = 1)\\ \\ \overline{\mathbb{E}(Y | \mbox{do}(X) = 0)} = \mathbb{E}(Y | X = 0) \end{array} \tag{19}\]
Plugging these into the bounds on the ATE we have the following bounds on the effect of the RTC laws.
# ATE upper bound
- minY)*PX1 (EY_X1
[1] 184.2203
# ATE lower bound
- EY_X0)*PX0 (minY
[1] -75.66166
These bounds are substantially tighter. They suggest that the estimate of the ATE under unconfoundedness is actually at the high end of the possible effect of RTC laws. This is evidence that the unconfoundedness assumption cannot hold. At least, it is inconsistent with the weaker monotonicity assumption.
The results in this section suggest the slogan may be more accurately stated as “more guns, more or less crime.”
Discussion and Further Reading
This chapter argues that it may be better to provide less precise estimates than precise predictions of little value to policy makers.
I strongly believe that the average treatment effect is given way too much prominence in economics and econometrics. ATE can be informative, but it can also badly mislead policy makers and decision makers. If we know the joint distribution of potential outcomes, then we may be able to better calibrate the policy. I hope that Kolmogorov bounds will become a part of the modern econometrician’s toolkit. A good place to learn more about this approach is Fan and Park (2010). Mullahy (2018) explores this approach in the context of health outcomes.
Chuck Manski revolutionized econometrics with the introduction of set identification. He probably does not think so, but Chuck has changed the way many economists and most econometricians think about problems. We think much harder about the assumptions we are making. Are the assumptions credible? We are much more willing to present bounds on estimates, rather than make non-credible assumptions to get point estimates.
Manski’s natural bounds allow the researcher to estimate the potential effect of the policy with minimal assumptions. These bounds may not be informative, but that in and of itself is informative. Stronger assumptions may lead to more informative results but at the risk that the assumptions, not the data, determine the results.
I highly recommend any book by Chuck Manski. However, Manski (1995) is the standard on non-parametric bounds. To understand more about potential outcomes see Rubin (1974). To understand more about **do operators} see Pearl and Mackenzie (2018).
Manski and Pepper (2018) use the bounds approach to analyze the relationship between right-to-carry laws and crime.
References
Footnotes
Alexandria Ocasio-Cortez is often refered to as AOC.↩︎
Some of these changes are discussed in Chapter 1.↩︎
This chapter discusses the second, but the two are mathematically related.↩︎
Field experiments are randomized trials in which people or villages or schools are assigned between trial arms.↩︎
If we don’t know the possible values, we could use the observed values. This assumption may be less “natural” than we may prefer.↩︎
Would an IV estimator discussed in Chapter 3 satisfy Assumption 2?\index{subjectindex}{level-set assumption↩︎
Does \(Z\) satisfy the assumptions of an instrumental variable?↩︎
See New State Ice Co vs. Liebmann 285 US 262 (1932).↩︎