Stata Blogger?

Are you a Stata blogger? Join the aggregator!

Contributed by Francis Smart

Endogenous Binary Regressors



* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.

* It is not obvious how to simulate data that will fit the criteria specifications that we desire.

* First let's think about the standard IV setup.

* y = b0 + xb1 + wb2 + u
* w = g0 + zg1 + v

* With u and v distributed normally and independently of x, w, and z.

* We know this setup is generally not correct if either y is binary or w is binary .

* When y is binary we might choose to use a MLE probit estimator with the following assumptions:

* P(y=1) = normal(b0 + xb1 + wb2)
* P(w=1) = normal(g0 + zg1)

* However, it is not easy to generate data in this form.

* Instead we will generate data introducing endogeneity by use of an unobserved addative error.

clear
set obs 1000

gen z = rnormal()
  label var z "Exogenous instrument"

gen v = rnormal()
  label var v "Unobserved error"

gen wp = normal(-.5 + z*1.25 + 1*v)
  label var wp "Endogenous variable"

gen w = rbinomial(1, wp)

* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.
* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2
* Thus, when the probit estimator is run it automatically scales the equation to be unit.
* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.
di -.5/(2^.5)  " for the constant"
di 1.25/(2^.5) " for the coefficient on z"

gen x1 = rnormal()
  label var x1 "Exogenous variable"

probit w z
  * Pretty close estimates to what we expect.

gen yp = normal(1 + w*1.5 + x1 - (2^.5)*v)
  * Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.
  * We need to adjust our estimated coefficients.
  di "Variance of the unobserved =" 1^2 + 2^.5^2
  di "Standard deviation =" (1^2 + 2^.5^2)^.5
  di "Constant coefficient=" 1/(1^2 + 2^.5^2)^.5
  di "x coefficient=" 1/(1^2 + 2^.5^2)^.5
  local w_est =1.5/(1^2 + 2^.5^2)^.5
  di "w coefficient=" `w_est'

gen y = rbinomial(1, yp)

* First let's see what happens if we neglect to make any effort at controlling for the endogeneity.
probit y w x1

test w= `w_est'

* Our estimate of the coefficient on w is way too small

* In order to estimate our relationship in a consistent manner we will use the biprobit command.
* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.
* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.
* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.
biprobit (y = w x1) (w=z x1)
test w= `w_est'

* We can test the "endogeneity" of w by testing the significance of /athrho.  Which appears in this case to be quite significant.

* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.

* This is working well though with 1,000 observations.  It seemed to be extremely ineffective with 100 observations however.

* What do we do if there are multiple endogenous binary variables?

* Let's generate similar data:
clear
set obs 1000

gen z1 = rnormal()
gen z2 = rnormal()

gen v1 = rnormal()
gen v2 = rnormal()

gen x1 = rnormal()

gen wp1 = normal(-.5 + z1*.5 - z2*.2 + .5^.5*v1)
gen w1 = rbinomial(1, wp1)
  label var w1 "Endogenous variable 1"

gen wp2 = normal(.75 - z1*.5 + z2*.2 + .5^.5*v2)
gen w2 = rbinomial(1, wp2)
  label var w2 "Endogenous variable 2"

gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 - .5^.5*v1 + .5^.5*v2)
gen y = rbinomial(1, yp)

* Once again we must adjust our expectation of the coefficients
local var_unob = 1+.5^.5^2+.5^.5^2
di "Variance of unobservables =" `var_unob'
  local est_b0 = 1/(`var_unob')^.5
di "Constant coefficient =" `est_b0'
  local est_w1 = 1/(`var_unob')^.5
di "w1 coefficient coefficient =" `est_w1'
  local est_w2 = .7/(`var_unob')^.5
di "w2 coefficient coefficient =" `est_w2'
  local est_x1 = .5/(`var_unob')^.5
di "x1 coefficient coefficient =" `est_x1'

probit y w1 w2 x1
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* The majority of estimates are not working well.


* Let's try doing the joint MLE

* Previously the biprobit was sufficient.  However, biprobit only allow for a two-way probit.

* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.

* install mvprobit if not yet installed.
* ssc install mvprobit

* The syntax of mvprobit is very similar to that of biprobit

mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* Unfortunately the estimates are still too far away from the true.

* However, they are closer.

* Let's see how the fitted probability line compares with the true.
predict yp_hat
replace yp_hat = normal(yp_hat)

reg yp_hat yp, nocon
predict yp_hat1_hat

two   (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat<1 p="">      yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)




* Overall, not a bad fit.

* This might not be sufficient for many applications.

* What happens when one of the endogenous variables is continuous?

Contributed by SRQM

Joining the Stata Bloggers aggregator: An opening post

A small subset of posts from this Tumblr log has been reformatted into this longer piece and submitted for inclusion into Francis Smart’s recently created aggregator for Stata blogs, Stata Bloggers. This post is an announcement message that explains the whys, hows and what of that operation.

First, why this blog? I am using Tumblr to run two course companions on health policy and data analysis. The latter, SRQM, is the one that you are reading from. It is named after the course “Statistical Reasoning and Quantitative Methods”, which I co-teach in Paris with Ivaylo Petev.

Why do you teach with Stata? Iavylo and I chose Stata for practical reasons: we both knew how to use it, the software was available where we teach the course, and we needed a software that could be taught to large groups of postgraduate beginners. We also teach an optional course with R.

Choosing a statistical software to work with is never an easy choice, but it has recently been made simple for a large category of users, for which the choice should be R. Anthony Damico is right when he jokingly writes the following:

confidential to sas, spss, stata, and sudaan users: the eighties called. they want their statistical languages back. time to transition to r. :D [Anthony also wrote that if Stata has a better learning curve than R, “so do bicycles with training wheels ;)”]

R is also marked by the eighties in many ways, but it has indeed made other statistical software rather obsolete. Its ggplot2 library, for instance, is just much better than Stata graphs. Even if you take the time to tweak them with complex code or alternative colors, Stata plots are often ugly.

Stata yet remains a good choice for those who are learning statistical analysis next to other things and therefore have limited time to learn programming. It is quick, cheap enough for universities, and copes well with large surveys. It is also easily scriptable and open to user contributions.

For these reasons, Stata has a good user base among academics, especially in sociology, economics and political science. Nate Silver also uses it. There’s great documentation for Stata, in English as well as in different languages, like this page in Lithuanian.

There’s more cool things about Stata. The World Bank has an awesome Stata package to download its data. Its syntax is even supported by a few plain text editors like TextMate, thanks to Tim Beatty and Phil Schumm (now on GitHub), and it might get ported to the Pygments engine used at GitHub and elsewhere.

The real trouble with Stata might actually lie with the overwhelming dominance of “regression quants” in its user base. Regression analysis curbs how you think towards net effects, which is not necessarily what you need. I will probably come back to this in later posts.

Why aggregate this blog? For some time, I have been hoping to connect this blog and course to a larger community of Stata users. Neither have ever been advertised to Statalist, but the blog has gained a small readership, and the course is also public thanks to its hosting as a GitHub repository.

How is this blog aggregated? Blog aggregators work by making use of RSS feeds, which are a handy way to syndicate a website’s content. Most blogging engines offer at least one blog-wide feed. Tumblr also offers hidden tag-specific feeds. This post starts a series that will be tagged stata.

Contributed by Francis Smart

New Project – Stata Blog Aggregator

I have decided to start a Stata blog aggregator.  I believe I am the first.  You can find my preliminary efforts at http://www.stata-bloggers.com/

The purpose of a blog aggregator is to provide a system by which users are connected with various perspectives and expertise from many bloggers.  If you have favorite Stata bloggers who you think would appreciate the additional publicity from such an aggregator, please suggest that they contact me.

For the blogger, there is very little they have to do.  All I need is a feed from their blog.  The aggregation software will do the rest.

Contributed by Gabi Huiber

An R-squared for logistic regression, packaged

This morning I checked Paul Allison's Statistical Horizons blog and found a post on R^2 measures for logistic regression. It introduced me to Tjur's R^2 by way of an example, which I repackaged below:


// Reference: http://www.statisticalhorizons.com/r2logistic

// program definition
capture prog drop tjur2
program tjur2, rclass

if !inlist(e(cmd),"logit","logistic") {
   di as err "Tjur's R-squared only works after logit or logistic."
   exit 498 // Thank you, Nick Cox.
}
tempname yhat
predict `yhat' if e(sample)
local y `e(depvar)'
quietly ttest `yhat', by(`y')
local r2logistic r(mu_2)-r(mu_1)
di "Tjur's R-squared " _col(20) %4.3f `r2logistic'
return local r2logistic `r2logistic'

end

// use case
use "http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta", clear
logistic inlf kidslt6 age educ huswage city exper
tjur2

I'm not sure yet if it's worth saving this program as ado/personal/t/tjur2.ado for my future logistic regression diagnostic needs, but I haven't posted anything Stata-related in too long, so there you have it.

Contributed by Francis Smart

Interpreting the Control Function Coefficient


* Is the control function coefficient a measure of the direction and size of the bias caused by endogeneity?

* Imagine the endogenous variable w being composed of three components: 1 endogenous portion, 2 exogenous portion correlated with z, 3 exogenous portion uncorrelated with z.

clear
set obs 10000

gen z = rnormal()
gen v = rnormal()

gen endogenous = rnormal()
gen exog_with_z = z
gen exog_without_z = rnormal()

gen w = endogenous + exog_with_z + exog_without_z

* Likewise we can think of the error u as composed of both an exogenous portion and an endogenous portion (correlated with part of w)

gen u = endogenous + rnormal()*3

gen y = 1*w + 3*u

reg y w
* We can see that OLS is clearly upward biased

ivreg y (w=z)
* Instrumental variables seems to be working well

* Now for the control function
reg w z

predict v_hat, resid

reg y w v_hat
* I was thinking that the control function coefficient could generally not be interpretted directly the sign of the bias but looking at this simulation it appears I was wrong.

* I will have to do some more thinking on this.

Contributed by Solomon Hsiang

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges.  Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes. 

In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.

Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as

plot_rcspline y x

and you get something like


where the "knots" are plotted as the vertical lines (optional).

Help file below the fold.  Enjoy!

Related non-linear plotting functions previously posted on FE:
  1. Boxplot regression
  2. Visually-weighted regression
  3. Watercolor regression
  4. Non-parametric three-dimensional regression
  5. Binned non-parametric regression
  6. Polynomial regression of any degree
S. HSIANG
SHSIANG@PRINCETON.EDU
2/2013
-------------------------------------------------------

SYNTAX

PLOT_RCSPLINE Yvar Xvar [Zvars] [if] , [NKNOTS(integer 5) KNOTPREF(string) BINS(integer 100) SAVE_data_file(string) plotcommand(string) regcommand(string) cilevel(integer 95) PLOTKNOTS NOCI]

-------------------------------------------------------

PLOT_RCSPLINE is designed to show a nonlinear relationship between Xvar and Yvar, adjusted for other variables Zvar1 ... Zvarn that are specified as controls. Confidence intervals are computed by the delta method.

PLOT_RCSPLINE uses a restricted cubic spline approach to estimating the nonlinear relationship between Yvar and Xvar. The spline is estimated with NKNOTS (5 is the default) and the spline variables in Xvar are generated by PLOT_RCSPLINE. One usage of PLOT_RCSPLINE is to generate the spline variables that will be used elsewhere in an analysis, while allowing the user to visually check that the partial response function (in Xvar) is reasonably specified using the spline.

PLOT_RCSPLINE does not automatically save the figure. Use GRAPH EXPORT or GRAPH SAVE afterwards to save the figure.

-------------------------------------------------------

Required arguments:

Yvar - the dependent variable to be plotted
Xvar - the independent variable to be plotted

Optional arguments: 

Zvars - control variables in the model

PLOTKNOTS - plots the knots as vertical lines

NKNOTS - the number of knots used in the spline, default = 5 (the number of variables representing the spline in the regression is one less than NKNOTS)

KNOTPREF() - a prefix used to name the variables generated for the spline

NOCI - suppress the plotting of the confidence intervals

CILEVEL() - Integer specifying the confidence level to be plotted. The default is cilevel(95).

REGCOMMAND() - String that is added to the regression command as an option, eg. "cluster(year)"

BINS() - Number of "bins" that are used to plot the final response funtion. Fewer bins will make the response function look "jagged", more bins will make the response look smoother. Default is 100. 

SAVE_data_file() - String that specifies a path/file to save the dataset of estimates used to plot the response function. These are the values of the response function (with SE) at each bin of the independent variable. If nothing is specified, the data used to plot the results will not be saved.

PLOTCOMMAND() - String arguments that are appending to the final plotting command. Eg. to label the plot with the title "X vs Y" and xtitle "X" type: plot_options("title(X vs Y) xtit(X)").

-------------------------------------------------------

EXAMPLES: 

set obs 1000
gen year = ceil(_n/100)
gen x=4*uniform()
gen z=2*uniform()
gen w=2*uniform()
gen e = 4*rnormal()
gen y= w + 3*z + 4*x - x^2 + .1*x^3 + e

plot_rcspline y x

plot_rcspline y x z w, nknots(7)  noci plotknots bins(20) knotpref(_x_knots_)

plot_rcspline y x z w, nknots(7) plotknots cilevel(90) plotcommand("tit(E[y|x,z,w])") regcommand("cluster(year)")

-------------------------------------------------------

*/

Contributed by Francis Smart

Regression with Endogenous Explanatory Variables


* Imagine you would like to estimate the agricultural production process.

* You have two explanatory variables.  Rain and use of Hybrid or traditional seeds.

* You are concerned that better off (in terms of SES) framers will be more likely to use Hybrid seeds.

* So should you include hybrid as an explanatory variable?

* It explains much of your variation quite well.

* The question boils down to a single question.

* Is the use of hybrid seeds correlated with rainfall?

* This only makes sense of course if people decide to use hybrid seeds if they have some knowledge of season rain in their area and they make the purchase of hybrid seeds based that or if they make the decision of if and when to purchase hybrid seeds based on what they have observed this season so far (and there is some time dependency of rainfall).


* Our model is y = rainfall*br + hybrid*bh + u0


* First let's image the case when hrybid seed choice and rainfall are correlated.

* In that case we can no longer really call even rainfall exogenous.

* That is, rainfall is only exogenous conditional upon controlling for seed choice.

* ie. cov(rainfall, u| hybrid) = 0 because we know cov(rainfall,hrybid) ~= 0

* Why, because excluding hybrid from our model gives us:

* y = rainfall*br + u1    ... where u1 = hyrbid*bh + u0

* So cov(rainfall, u1) = cov(rainfall, u| hybrid) + cov(rainfall, hybrid*bh| hybrid) ~= 0

* This implies that we should include hybrid.

* This is true whenever we have two correlated explanatory variables.  Leaving one out will cause the other explanatory variable to be attributed a portion of the effect of the alternative variable.

* The bias is probably equal to: (x1 and x2 are explanatory variables, both exogenous (when both included) with x2 being correlated with x1.

* The true model: y = x1B1 + x2B2 + u

* Call the linear projection of x1 into x2: L(x2|x1)=A*x1

* Now estimating just y = x1B1 gives us y = x1B1 + L(x2|x1)*B2 + uhat  ... substituting the linear project of x1 on x2 into x2.

* Which reduces to y = x1B1 + A*x1*B2 = (B1 + A*B2)*x1

* The bias is therefore A*B2

* Let's see this in action

*****************
* Two "exogenous" variables

clear
set obs 10000

gen x1 = rnormal()

gen x2 = rnormal() + .5*x1    // A = .5
  * Formulating x2 this way is not important for the above result.
  * Any random draw of x1 and x2 that has them covarying together will produce the same results.
  * Only this way it is easy to see that A=.5

gen u = rnormal()*5

gen y = x1*2 + x2*(-3) + u    // B1 = 2,  B2 = -3,  A = .5

* No problem here
reg y x1 x2

* But we will try to not include x2

reg y x1

* Expected coefficient estimate = B1 + A*B2 =  2 + .5*(-3) = .5

* So pretty close.  In this case the bias is equal to A*B2 = -1.5

**********
* One "exogenous" and one "endogenous" variable

* However, we are concerned that hyrbid (x2) is correlated with our error.

* Should we include it?

* The answer is a little subtle.

* If we include it then we will introduce error

clear
set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(rain + ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

* No problem here
reg y rain hybrid ses

* However, we do not have data on ses so:
reg y rain hybrid

* Suddenly big problems.

* But we can still get an unbiased estimate of the effect of rainfall on production in the following manner.
reg y rain

*****************
* Rainfall uncorrelated with hybrid and hybrid endogenous

* However, let's see what happens if hybrid choice is endogenous but uncorrelated with rainfall.

clear
set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

reg y rain hybrid
  * The estimate on rainfall is good

reg y rain
  * The estimate on rainfall is not as accurate.  Why?

  * Because less of the unobserved variation is explained.



*****************
* Therefore I suggest taking the following steps:

* 1. Think through if any of your variables are endogenous.
* 2. If they are endogenous, estimate their correlation with the explanatory explanatory variables.
* 3. If there is no significant correlation then include them in the regression.  This will give you more precise estimates of the coefficients on the exogenous variables.  Do not interpret the coefficients on the endogenous variables.  Also, do not interpret the F-test for rejection of the model since you know that your endogenous variables are picking up part of the unobserved error.  If you instead want to test model fit, test the joint significance of all of your exogenous variables together with a Wald test.
* 4. If there is a correlation, then it is not clear what to do.  If you exclude endogenous variables correlated with the exogenous variables then this will introduce endogeneity in the the "exogenous variables".
* 5. However, if you introduce the endogenous variable as a control in the equation then you will introduce a level of bias and inconsistency.  I don't think there are any clear rules asking what to do in these circumstances.

Contributed by Francis Smart

Non-Parametric Regression Discontinuity


* I recently went to an interesting seminar today by Matias Cattaneo from the University of Michigan.

* He was presenting some of his work on non-parametric regression discontinuity design which I found interesting.

* What he was working on and the conclusions of the paper was interesting but even more interesting was a release by him and coauthors of a Stata package that implements RD design for easy

* net install rdrobust, from(http://www-personal.umich.edu/~cattaneo/rdrobust) replace

* Regression discontinuity is a technique which allows identification of a localized effect around a natural or structured policy discontinuity.

* For instance, if you wondering what the effect federal grants have on college attendance, then you may be concerned that just looking at those students who are eligible for federal grants in contrast with those who are not eligible will be problematic because students who are eligible (low income) may different than those who are not eligible for the grant (not low income).

* The RD argument is that if individuals do not, as a response to the grant being available, move their reported income level to become eligible for the grant than those who are near the cut off for the grant and those not near the cut off will be fundamentally very similar.

* This may occur if for instance the income cut-off for the grant is unknown.

* So even if students are systematically under-reporting their income, they are not doing it aware of the actual cut off, so the students sufficiently close, above and below the cut off are arguably the "same" or drawn from the same pool except that one group received the program and another group did not.

* The previous post deals some with assuming a linear structure of the underlying characteristics.

* http://www.econometricsbysimulation.com/2012/08/sharp-regression-discontinuity-example.html

* However, the more interesting case (potentially) may be when we assume a nonlinear response to the income in our dependent variable.

* But before going there let's think about what this method boils down to.

* Like all identification methods in statistics or econometrics when we do not have experimental data, identification of an effect is driven by some exogeneity argument.

* That is, x causes y and is unrelated to u (the error).  In the case when u may be correlated with the error the use an exogenous variable to force the movement in the variable of interest may be sufficient to identify a causal effect.

* In this case, clearly it is not enough to simply see what the average y response (GPA, attendance, graduation rates, whatever) is to a change in grant level because those who receive the grants are systematically different from those who do not.

* However, because the cut off for receiving the grant is unknown, around the cut off the two samples who receive the grant and who do not can arguably be considered the same.

* Thus, we could say that the unknown position of the cut off is the random exogenous variable which near the cut off forces some students into the group that receives the grant and some students into the group that does not.

* Let's imagine some non-parametric relationship between income and performance:

clear

set obs 10000

gen income = 3^((runiform()-.75)*4)
  label var income "Reported Income"

  sum income
gen perf0 = ln(income) + sin((income-r(min))/r(max)*4*_pi)/3 + 3
  label var perf0 "Performance Index - Base"

scatter perf0 income


* Looks pretty non-parametric

* Let's add in some random noise
gen perf1 = perf0 + rnormal()*.5
  label var perf1 "Performance Index - with noise"

scatter  perf1 income

* Using the user written command rcspline, we can see the local average performance as a function of income.

* ssc install rcspline

rcspline perf1 income,  nknots(7) showknots title(Cubic Spline)
* I specify "7" knots which are the maximum allowed in the rcspline command.



* The spline seems to fit the generated data well.

* Now let's add a discontinuity at .5.

gen grant = income&lt;.5
sum grant

* So about 50% of our sample is eligible for the grant.

* Now let's add the grant effect.

* First let's generate an income variable that is centered at the grant cut point.
gen income_center = income-.5

gen perf2 = perf1 + .5*grant - .1*income_center*grant
  * Thus the grant is more effective for students with lower income.
  label var perf2 "Observed Performance"

**** Simulation done: Estimation Start ****

rcspline perf2 income,  knots(.15 .25 .35 .37 .4 .45 .5 .55 .6 .65 .75 .85 1.1 1.25 1.5) title(Cubic Spline)
* This is obviously not the ideal plot and I have had some difficulty finding a command which will generate the plot that I would like.



* However, we can see that there does appear to be "something" going on.

reg perf2 income grant
* We can see that our itial estimate of the effect of the grant is entirely wrong.

* It appears so far that the effect of the grant on performance is actually hindering performance (which we know is false).

* Now, let's try our new command rdrobust

rdrobust perf2 income_center
* The default cut point is at 0.  Thus using income_centered works.

* Though this estimate is negative and thus seems the reverse of what we would expect, it is actually working quite well.

* That is because regression discontinuity is trying to identify the effect of the discontinuity on the outcome variable with the default assumption that at the discontinuity the forcing variable is becoming 1.

* In this case however, the discontinuity is really driving the grant to be equal to zero.

* Thus we must inverse the sign on the rd estimator in order to identify the true effect in this case.

* Alternatively, we could switch the sign of income.

gen nincome_center = income_center*(-1)

rdrobust perf2 nincome_center

* rdrobust is a newly designed command that has some extra bells and whistles that other regression discontinuity commands have as well as some oddities.

* I would suggest also looking to the more official stata command rd (ssc install rd)
rd perf2 nincome_center

* This command is nice because it estimates many bandwidths through the mbw option.

* The default mbw is "100 50 200" which means, use the 100 MSE (mean squared error) minimizing bandwidth, half of it and twice it.

* We can plot our estimates of the treatment effect using a range of bandwidths.

gen effect_est = .
  label var effect_est "Estimated Effect"

gen band_scale = .
  label var band_scale "Bandwidth as a Scale Factor of Bandwidth that Minimizes MSE"


forv i = 1/16 {
  rd perf2 nincome_center, mbw(100 `=`i'*25')
    if `i' ~= 4 replace effect_est = _b[lwald`=`i'*25'] if _n==`i'
    if `i' == 4 replace effect_est = _b[lwald] if _n==`i'
    replace band_scale = `=`i'*25'     if _n==`i'  
}
gen true_effect = .5
  label var true_effect "True effect"

two (scatter effect_est band_scale) (line true_effect band_scale)



* We can see around the 100% MSE bandwidth estimates are fairly steady though they dip a tiny bit.

Contributed by Francis Smart

2SLS with multiple endogenous variables


* I am wondering if when using 2SLS you must use a multivariate OLS in the reduced form or if you can just do each individual endogenous variable.

* Let's see!

clear
set obs 10000

* First generate the instruments
gen z1 = rnormal()
gen z2 = rnormal()

* Now the error.  u1 and u2 are the parts of w1 and w2 correlated with the error u.
gen u1 = rnormal()
gen u2 = rnormal()
gen u3 = rnormal()*3

gen w1 = 1*z1 + .5*z2 + rnormal() + 2*u1
gen w2 = -.5*z1 + 1.5*z2 + rnormal() - 2*u2 + .2*u1

gen u = u1 + u2 + u3
gen y = 4 + w1*1 + w2*1 + u

* We can see because u is correlated with w1 and w2, OLS will be biased and inconsistent.

reg y w1 w2

* Instrumental variable regression could solve this problem

ivreg y (w1 w2 = z1 z2)

* Let's see about our 2SLS (which should be the same as IV)

reg w1 z1 z2
predict w1_hat

reg w2 z1 z2
predict w2_hat

reg y w1_hat w2_hat
* It seems that 2SLS using separate regressors is producing the same results.

* This is probably because in SOLS if you use the same regressors then I think the coefficients are the same but the standard errors may be adjusted for cross equation correlations (I think I recall).

Contributed by Francis Smart

Non-Parametric PDF Fit Test


* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

  * The syntax command parses the input into useful macros used for later calculations.
  syntax  varlist, Tgenerate(numlist >= 100) [Dist(string)]
 
 
  if "`dist'"=="" local dist="normal"
 
  if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
  if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
  if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

  di "Simulate `tgenerate' draws"
 
  * Construct a t-vector in Mata to store estimation results.
  mata: tdist = J(`tgenerate', 1, .)
 
  *******
  * This will randomly draw from the distribution being tested a number of times equal to tgenerate.
  *******
 
  preserve
  qui: drop if `varlist' == .
 
  forv i=1(1)`tgenerate' {
 
    tempvar x zx z p d

    if "`dist'"=="normal" qui: gen `x' = rnormal()

if "`dist'"=="poisson" {
      qui: sum `varlist'
      qui: gen `x' = rpoisson(r(mean))
    }

if "`dist'"=="uniform" qui: gen `x' = runiform()

    sort `x'
    qui: gen `p' = (_n-.5)/_N

if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2
   }

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
 * The invpoisson distribution in Stata is misspecified.  1-p is neccessary.
      qui: gen `d' = (`z'-`x')^2
    }

if "`dist'"=="uniform" {
      qui: gen `z' = _n/_N
      qui: gen `d' = (`z'-`x')^2
    }
   
qui: reg `d'
    local t = _b[_cons]/_se[_cons]
    mata: tdist[`i', 1]=`t'

drop `x' `z' `p' `d'
cap drop `zx'
  }
 
  * From the above loop we should have a vector of t values.
  * We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.
 
  mata: tsorted = sort(tdist, 1)

  * One tailed test
  local c90 = floor(`tgenerate'*.90)+1
  mata: st_local("t90", strofreal(tsorted[`c90']))
 
  local c95 = floor(`tgenerate'*.95)+1
  mata: st_local("t95", strofreal(tsorted[`c95']))

  local c99 = floor(`tgenerate'*.99)+1
  mata: st_local("t99", strofreal(tsorted[`c99']))
 
  * Two Tailed
    * 90% CI
  local c90U = floor((`tgenerate'+.5)*.95)+1
  mata: st_local("t90U", strofreal(tsorted[`c90U']))
 
  local c90L = ceil((`tgenerate'+.5)*.05)-1
  mata: st_local("t90L", strofreal(tsorted[`c90L']))
 
    * 95% CI
  local c95U = floor((`tgenerate'+.5)*.975)+1
  mata: st_local("t95U", strofreal(tsorted[`c95U']))

  local c95L = ceil((`tgenerate'+.5)*.025)-1
  mata: st_local("t95L", strofreal(tsorted[`c95L']))

    * 99% CI
  local c99U = floor((`tgenerate'+.5)*.99)+1
  mata: st_local("t99U", strofreal(tsorted[`c99U']))

  local c99L = ceil((`tgenerate'+.5)*.01)-1
  mata: st_local("t99L", strofreal(tsorted[`c99L']))

  ** Now we do the estimation
    tempvar x zx z p d

qui: gen `x' = `varlist'

    sort `x'
    qui: gen `p' = (_n-.5)/_N

  * We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2
   }

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
      qui: gen `d' = (`z'-`x')^2
    }

if "`dist'"=="uniform" {
      qui: sum `x'
      qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
      qui: gen `d' = (`z'-`x')^2
    }

* This is the regression of interest.
    qui: reg `d'
    local t = _b[_cons]/_se[_cons]

* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t:  `: di %9.3f `t''" _newline

di "One-way Analysis"
di "  CI   (1%) :    0.000   to `: di %9.3f `t99''"
di "  CI   (5%) :    0.000   to `: di %9.3f `t95''"
di "  CI   (10%):    0.000   to `: di %9.3f `t90''"

    mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
    mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2

di "One-sided  p:`: di %9.3f `p1''   (`position' out of `tgenerate')"

di _newline "Two-way Analysis"
di "  CI   (1%): `: di %9.3f `t99L''    to `: di %9.3f `t99U''"
di "  CI   (5%): `: di %9.3f `t95L''    to `: di %9.3f `t95U''"
di "  CI  (10%): `: di %9.3f `t90L''    to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"

    return scalar p1 = `p1'
    return scalar p2 = `p2'

  restore
end

* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
  syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
  clear
  set obs `obs'
  gen x = `rdraw'
  dist_check x, t(`reps') dist(`dist')
end

* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum
* mean r1_10 and  mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.

* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.

* This looks great.

* Now let's see about the test's power to reject the null when the true distribution is not the null.

checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

* I think there might be some slight errors.  I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.