Stata Blogger?

Are you a Stata blogger? Join the aggregator!

Endogenous Binary Regressors


* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.

* It is not obvious how to simulate data that will fit the criteria specifications that we desire.

* First let’s think about the standard IV setup.

* y = b0 + xb1 + wb2 + u
* w = g0 + zg1 + v

* With u and v distributed normally and independently of x, w, and z.

* We know this setup is generally not correct if either y is binary or w is binary .

* When y is binary we might choose to use a MLE probit estimator with the following assumptions:

* P(y=1) = normal(b0 + xb1 + wb2)
* P(w=1) = normal(g0 + zg1)

* However, it is not easy to generate data in this form.

* Instead we will generate data introducing endogeneity by use of an unobserved addative error.

clear
set obs 1000

gen z = rnormal()
  label var z “Exogenous instrument”

gen v = rnormal()
  label var v “Unobserved error”

gen wp = normal(-.5 + z*1.25 + 1*v)
  label var wp “Endogenous variable”

gen w = rbinomial(1, wp)

* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.
* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2
* Thus, when the probit estimator is run it automatically scales the equation to be unit.
* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.
di -.5/(2^.5)  ” for the constant”
di 1.25/(2^.5) ” for the coefficient on z”

gen x1 = rnormal()
  label var x1 “Exogenous variable”

probit w z
  * Pretty close estimates to what we expect.

gen yp = normal(1 + w*1.5 + x1 – (2^.5)*v)
  * Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.
  * We need to adjust our estimated coefficients.
  di “Variance of the unobserved =” 1^2 + 2^.5^2
  di “Standard deviation =” (1^2 + 2^.5^2)^.5
  di “Constant coefficient=” 1/(1^2 + 2^.5^2)^.5
  di “x coefficient=” 1/(1^2 + 2^.5^2)^.5
  local w_est =1.5/(1^2 + 2^.5^2)^.5
  di “w coefficient=” `w_est’

gen y = rbinomial(1, yp)

* First let’s see what happens if we neglect to make any effort at controlling for the endogeneity.
probit y w x1

test w= `w_est’

* Our estimate of the coefficient on w is way too small

* In order to estimate our relationship in a consistent manner we will use the biprobit command.
* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.
* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.
* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.
biprobit (y = w x1) (w=z x1)
test w= `w_est’

* We can test the “endogeneity” of w by testing the significance of /athrho.  Which appears in this case to be quite significant.

* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.

* This is working well though with 1,000 observations.  It seemed to be extremely ineffective with 100 observations however.

* What do we do if there are multiple endogenous binary variables?

* Let’s generate similar data:
clear
set obs 1000

gen z1 = rnormal()
gen z2 = rnormal()

gen v1 = rnormal()
gen v2 = rnormal()

gen x1 = rnormal()

gen wp1 = normal(-.5 + z1*.5 – z2*.2 + .5^.5*v1)
gen w1 = rbinomial(1, wp1)
  label var w1 “Endogenous variable 1″

gen wp2 = normal(.75 – z1*.5 + z2*.2 + .5^.5*v2)
gen w2 = rbinomial(1, wp2)
  label var w2 “Endogenous variable 2″

gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 – .5^.5*v1 + .5^.5*v2)
gen y = rbinomial(1, yp)

* Once again we must adjust our expectation of the coefficients
local var_unob = 1+.5^.5^2+.5^.5^2
di “Variance of unobservables =” `var_unob’
  local est_b0 = 1/(`var_unob’)^.5
di “Constant coefficient =” `est_b0′
  local est_w1 = 1/(`var_unob’)^.5
di “w1 coefficient coefficient =” `est_w1′
  local est_w2 = .7/(`var_unob’)^.5
di “w2 coefficient coefficient =” `est_w2′
  local est_x1 = .5/(`var_unob’)^.5
di “x1 coefficient coefficient =” `est_x1′

probit y w1 w2 x1
test [y]w1=`est_w1′
test [y]w2=`est_w2′
test [y]_cons=`est_b0′
test [y]x1=`est_x1′

* The majority of estimates are not working well.

* Let’s try doing the joint MLE

* Previously the biprobit was sufficient.  However, biprobit only allow for a two-way probit.

* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.

* install mvprobit if not yet installed.
* ssc install mvprobit

* The syntax of mvprobit is very similar to that of biprobit

mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)
test [y]w1=`est_w1′
test [y]w2=`est_w2′
test [y]_cons=`est_b0′
test [y]x1=`est_x1′

* Unfortunately the estimates are still too far away from the true.

* However, they are closer.

* Let’s see how the fitted probability line compares with the true.
predict yp_hat
replace yp_hat = normal(yp_hat)

reg yp_hat yp, nocon
predict yp_hat1_hat

two   (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat      yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)

1>

* Overall, not a bad fit.

* This might not be sufficient for many applications.

* What happens when one of the endogenous variables is continuous?

Continue reading Endogenous Binary Regressors

Joining the Stata Bloggers aggregator: An opening post

A small subset of posts from this Tumblr log has been reformatted into this longer piece and submitted for inclusion into Francis Smart’s recently created aggregator for Stata blogs, Stata Bloggers. This post is an announcement message that explains the whys, hows and what of that operation.

First, why this blog? I am using Tumblr to run two course companions on health policy and data analysis. The latter, SRQM, is the one that you are reading from. It is named after the course “Statistical Reasoning and Quantitative Methods”, which I co-teach in Paris with Ivaylo Petev.

Why do you teach with Stata? Iavylo and I chose Stata for practical reasons: we both knew how to use it, the software was available where we teach the course, and we needed a software that could be taught to large groups of postgraduate beginners. We also teach an optional course with R.

Choosing a statistical software to work with is never an easy choice, but it has recently been made simple for a large category of users, for which the choice should be R. Anthony Damico is right when he jokingly writes the following:

confidential to sas, spss, stata, and sudaan users: the eighties called. they want their statistical languages back. time to transition to r. :D [Anthony also wrote that if Stata has a better learning curve than R, “so do bicycles with training wheels ;) ”]

R is also marked by the eighties in many ways, but it has indeed made other statistical software rather obsolete. Its ggplot2 library, for instance, is just much better than Stata graphs. Even if you take the time to tweak them with complex code or alternative colors, Stata plots are often ugly.

Stata yet remains a good choice for those who are learning statistical analysis next to other things and therefore have limited time to learn programming. It is quick, cheap enough for universities, and copes well with large surveys. It is also easily scriptable and open to user contributions.

For these reasons, Stata has a good user base among academics, especially in sociology, economics and political science. Nate Silver also uses it. There’s great documentation for Stata, in English as well as in different languages, like this page in Lithuanian.

There’s more cool things about Stata. The World Bank has an awesome Stata package to download its data. Its syntax is even supported by a few plain text editors like TextMate, thanks to Tim Beatty and Phil Schumm (now on GitHub), and it might get ported to the Pygments engine used at GitHub and elsewhere.

The real trouble with Stata might actually lie with the overwhelming dominance of “regression quants” in its user base. Regression analysis curbs how you think towards net effects, which is not necessarily what you need. I will probably come back to this in later posts.

Why aggregate this blog? For some time, I have been hoping to connect this blog and course to a larger community of Stata users. Neither have ever been advertised to Statalist, but the blog has gained a small readership, and the course is also public thanks to its hosting as a GitHub repository.

How is this blog aggregated? Blog aggregators work by making use of RSS feeds, which are a handy way to syndicate a website’s content. Most blogging engines offer at least one blog-wide feed. Tumblr also offers hidden tag-specific feeds. This post starts a series that will be tagged stata.

Continue reading Joining the Stata Bloggers aggregator: An opening post

New Project – Stata Blog Aggregator

I have decided to start a Stata blog aggregator.  I believe I am the first.  You can find my preliminary efforts at http://www.stata-bloggers.com/The purpose of a blog aggregator is to provide a system by which users are connected with v…

Continue reading New Project – Stata Blog Aggregator

An R-squared for logistic regression, packaged

This morning I checked Paul Allison’s Statistical Horizons blog and found a post on measures for logistic regression. It introduced me to Tjur’s by way of an example, which I repackaged below: // Reference: http://www.statisticalhorizons.com/r2logistic // program definition capture prog drop tjur2 program tjur2, rclass if !inlist(e(cmd),”logit”,”logistic”) { di as err “Tjur’s R-squared only works [...]

Continue reading An R-squared for logistic regression, packaged

Interpreting the Control Function Coefficient

* Is the control function coefficient a measure of the direction and size of the bias caused by endogeneity?* Imagine the endogenous variable w being composed of three components: 1 endogenous portion, 2 exogenous portion correlated with z, 3 exogenous…

Continue reading Interpreting the Control Function Coefficient

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges.  Unlike polynomials, information at one e…

Continue reading Plotting restricted cubic splines in Stata [with controls]

Regression with Endogenous Explanatory Variables

* Imagine you would like to estimate the agricultural production process.* You have two explanatory variables.  Rain and use of Hybrid or traditional seeds.* You are concerned that better off (in terms of SES) framers will be more likely to use Hy…

Continue reading Regression with Endogenous Explanatory Variables

Non-Parametric Regression Discontinuity

* I recently went to an interesting seminar today by Matias Cattaneo from the University of Michigan.* He was presenting some of his work on non-parametric regression discontinuity design which I found interesting.* What he was working on and the concl…

Continue reading Non-Parametric Regression Discontinuity

2SLS with multiple endogenous variables

* I am wondering if when using 2SLS you must use a multivariate OLS in the reduced form or if you can just do each individual endogenous variable.* Let’s see!clearset obs 10000* First generate the instrumentsgen z1 = rnormal()gen z2 = rnormal()* Now th…

Continue reading 2SLS with multiple endogenous variables

Non-Parametric PDF Fit Test

* This is an idea that I decided to explore before inspecting how others have addressed the problem.* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.* The following command wil…

Continue reading Non-Parametric PDF Fit Test