Stata Blogger?

Are you a Stata blogger? Join the aggregator!

Contributed by Francis Smart

A Prime Sieve for Stata As Well

* I feel like I have been neglecting Stata since most of my posts have been in R.

* But like speaking different languages it takes effort to switch mental states and programming styles from Stata to R.

* For really, there is a great deal of differences between programming in Stata and in R.

* Both are powerful but excel in different strengths.

* Stata for providing a unified, cutting edge data analysis tool, and R for providing a powerful and flexible tool with a vibrant and growing open source community.

* As a means of comparison I will program up a prime sieve in Stata that accomplishes the same thing as the prime sieve I programmed in my previous post in R:

* http://www.econometricsbysimulation.com/2013/05/my-prime-sieve-homage-to-yitan-zhang.html

* The way a prime sieve works is that it goes through a number line starting at 2 up until some pre-specified maximum.

* In each step the prime sieve removes from the available numbers the first number in the list and keeps that number as a "prime" and removes all remaining numbers in the list divisible by that number.

* Since all smaller primes cannot be divisible by a larger prime, starting from the smallest is the natural way of performing a sieve.

clear

* First lets set out max that we will search for.
set obs 1000

* Now we will set our index
gen i=_n

* Now we generate our global that we will keep our primes in
gen prime=0
replace prime=1 if _n==1

* Sorting will place our primes at the end of the list
sort prime i

* We specify the stopping condition for the sieve that the first element of the list is a prime.
local nextprime = prime[1]

* Loop through all of the elements until our stopping condition is achieved.
while `nextprime'==0 {
  * The first element in our sort will be out prime
  qui replace prime=1 if _n==1

  * Drop anything divisible by that number
  qui drop if i/i[1]==int(i/i[1]) & prime==0

  * Sort to place primes at the end of our list
  sort prime i

  * Set the indicator nextprime to be equal to the prime value of the next value in the list.
  local nextprime = prime[1]

  count if prime==0
}

* Looking good.
hist i, title("Finding the primes for the first 1000 numbers is no sweat") bin(20)



hist i, title("Finding the primes for the first 100,000 is harder") bin(20)



* Unlike with my R post, I did not calculate the primes for the first 1,000,000 numbers.

* The above code took a while with the 100,000.  It probably runs much slower than R but that is to be expected.

* The above program is not the optimal way of writing such routines.  They should ideally be done using Mata.

* However, despite my best efforts I have been unable to find a sufficiently comprehensive text to learn Mata syntax from.

* The result is programs that run inefficiently.  Which usually in not a big deal because programming speed is typically the least important feature of most Stata problems.

* More typically the most important feature is Stata's ability to handle single data sets (such as panel data) well and efficiently.

* I hope you can see by the above code some of the differences between Stata and R.

* Some thing are intuitive and easy in each language while they seem convoluted or unnecessarily challenging in the other.  Pick your poison, but try not to be wed to it.  Some software is good for some things and other good for other thing.  Be ready to change and learn as the need arises.

Contributed by SRQM

R, Stata and matching additional learning costs

Francis Smart recently pointed to an important difference between R and Stata from a teaching perspective, which has to do with the additional learning costs of vectorization in R over the single-dataset orientation of Stata.

Stata makes it easy to manipulate names, or more specifically, variable names, as in a dataset with three variables for social expenditure called party1 party2 party3. This is common to many empirical preprocessed datasets.

    // example
    mvdecode party*, mv(999)

Furthermore, Stata works like an accountant’s book, so all variables belong to a same data object that never needs to be called beyond loading. This naturally suppresses a lot of possibilities, compensated in part by macros and scalars.

    // example
    loc regressors "age sex"

Macros in particular then branch with loops like the forval and foreach commands to allow more complex data processing. At that level of use, the software is flexible enough for most applied data cleaning.

    // example
    forval i = 1/3 {
      replace socx`i' = socx`i' / 10^6
    }

To access matrix notation, the Stata user needs to move to Mata syntax, while R immediately offers the user to manipulate objects through vectorization. Thinking in these terms is more demanding as there are more possibilities for errors, starting with calls to undeclared objects.

I teach both R and Stata. My experience with social science students is that the additional learning costs of R syntax need to be matched with other benefits to become valuable to them. To me, these benefits lie primordially in the more diverse array of data that R allows to access.

Contributed by Francis Smart

Quandl Package – 5,000,000 free datasets at the tip of your fingers!

# Yes, you read that correctly and no Quandl (http://www.quandl.com/) did not pay me anything.

# Quandl is a new database management tool which seeks to become the place to find datasets.  They boast of having over 5x10^6 data sets available though after examining them, I have decided that they are not entirely what everybody might think of as data sets.  That is, each unique indicator is considered an independent data set.  This helps them to seem to have a ginormous quantity of data sets.

# That said, they are not wrong in calling each indicator its own data set since much of their data, like financial data or government data is collected by disjoint teams.  The scope of their ambition is fantastic yet it is doable and frankly someone needed to do it.

# Currently, data seekers can access the Inter-University Consortium for Political and Social Research (IPCSR).  This great resource is composed mostly of cross section and panel data sets which are great for much analysis but IPCSR resricts access to data to member universities.  In addition, the kind of data that Quandl is indexing is a lot of data that would not show up on IPCSR database.  In addition, Quandl is integrating an automated structure that will be self-updating.

# For an example of how Quandl is a good step ahead of the game take a look at this search quiery:

http://www.quandl.com/search/lansing,%20michigan

# In this search, I searched out Lansing, Michigan where I live and returned results of data for the last decade or earlier up to today from sources such as the Federal Reserve and the US Energy Information Administration.

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies?q=Lansing%2C+Michigan&permit%5B0%5D=AVAILABLE

# In constrast when queirying ICPSR, I found a few databases listed but they were historical databases that spanned back generally between 30 and 70 years.  That said both sources could provide valuable information depending upon what I am interested in modeling.

# Quandl is very clever for a number of reasons.  One of these reasons is that they have simultaneously released 8 software packages that can be used in a number of statistical packages such as R, Stata, and Excel.

# In order to demonstrate the use of Quandl I will grab a few data sets from the Lansing quiery drawn from the Federal Reserve.

install.packages("Quandl")
library(Quandl)

# Employment numbers (thousands of people") for Lansing, Michigan
NonFarm = Quandl("FRED/LANS626NAN")
CivLaborForce = Quandl("FRED/LANS626LFN")
PerCapitaIncome = Quandl("FRED/LANS626PCPI")

# Now let's combine the data so that we can related data values.
Labor = merge(NonFarm, CivLaborForce, by="Date")
Combined = merge(Labor, PerCapitaIncome, by="Date")
colnames(Combined) = c("Date", "NonFarm", "CivLaborForce", "PerCapitaIncome")
  # Notice that though our data had many more data points, the default option of merge only keeps data that exists in both data sets.  In this case, it is per capital income that has the least number of data points.

# Let's see if we can predict income as a function of employment:
summary(lm(PerCapitaIncome~NonFarm+CivLaborForce, data=Combined))

# Our naive prediction as a result of this is that as the Civilian Labor Force increases, wages rise.  This is of course a naive example ignoring completely issues of causation and endogeneity not to mention probable random walks and other challenging features of this kind of data.

# The overall take away though, should be "cool", I think.  Maybe this data bank does not provide information currently on many issues of interest to those looking for data.  But it does make things easier and self-updating, which are great features.

Contributed by Kung Hiu

Crack Limited Dependent Variable’s Regression Using Stata

Neophyte usually finds it difficult to crack the problem they meet when they have some data-that is not randomly collected or truncated or censored-to analysis. But in the real world, these kinds of problems always exist.

This post will dig a little deeper in this area by presenting limited dependent variable’s type and relevent Stata cracking commands. The methods are:

Truncated Regression

Censored Regression (Tobit model)

Incidental Trucation Regression (sample selection model, Heckman model)

Treatment Effects Model

Instrumental Variable Regression

1, Truncated Regression

Truncated regression deals with truncated data. Then, what is truncated data? It shows like graphs below:

Left truncation(do neglect the numbers in the middle and on the axises):

20130501210935

 

Right truncation:

20130501210923

 

Two sides truncation:

20130501210858

 

If you meet this kind of data distribution, no matter what the actual specific situation you meet, you can use Stata command “truncreg” to solve the problem.

truncreg y x1 x2 x3, ll(#)  (lower limit, Left truncation)

truncreg y x1 x2 x3, ul(#)  (upper limit, Right truncation)

truncreg y x1 x2 x3, ll(#) ul(#)  (lower and upper limits, Two sides truncation)

# is the value of where the truncation happens.

The # should be chosen carefully because it could affect the regression outcomes. The number of obs that out of the # threshold does not affect the regression outcome.

2, Censored Regression (Tobit model)

Censored regression deals with censored data. Then what is censored data? If the data is censored at the upper limit of 5000, then any value that should be larger than 5000 shows to be 5000 in the dataset.

For this kind of data, we use “tobit” command to analysis.

tobit y x1 x2 x3, ll(#)  (lower limit, Left censored)

tobit y x1 x2 x3, ul(#)  (upper limit, Right censored)

tobit y x1 x2 x3, ll(#) ul(#)  (lower and upper limits, Two sides censored)

The # should be chosen carefully because it could affect the regression outcomes. The number of obs that at the # threshold does affect the regression outcome.

3, Incidental Trucation Regression (sample selection model, Heckman model)

Incidental trucation regression, or sample selection, means the data have some kind of bias. Kinds of bias are elaborated in books of Epidemiology. To correct the bias and get the effect of factors on the whole population, we use “heckman” command.

heckman y x1 x2 x3, select(z1 z2)    /*using MLE method, and the dependent variable of selcetion equation is y*/

heckman y x1 x2 x3, select(z1 z2) twostep    /*using twostep method, and the dependent variable of selcetion equation is y*/

heckman y x1 x2 x3, select(w=z1 z2)    /*using MLE method, and the dependent variable of selcetion equation is w. Variable w won’t show up in the main(second step, if twostep) regression equation, even you select it purposely.*/

Things to remember, any un-observed y should be set into missing.

4, Treatment Effects Model

Treatment effects model is derived from heckman model and mainly used to realize the mission of program evaluation.

It has two merits compared with Heckman model mentioned above: 1, In the heckman model, only part of dependent variables can be observed. 2, In the heckman model, intervention variable, which is the potential dependent variable of selection equation, cannot get into the main regression equation.

The command for this model is “treatreg”:

treatreg y x1 x2 x3 w, treat(w=z1 z2) [twostep]   /*w is usually the intervention variable, which defines the treatment group and control group*/

The rho variable in the output is important, the hypothesis test of it in the bottom of the output tells us whether it’s appropriate to choose this model. (It should be non-zero.)

5, Instrumental Variable Regression

Instrumental variable regression is a method to solve the problem that the independent variable have contemporaneous correlation with error term, when there are some explainary items are not included in the equation. So, this method uses an instrumental variable to solve this problem. This IV is hard to determine because it needs to fit the criteria “the instrument should be correlated with endogenous explanatory variables, but cannot be correlated with the error term”.

Here is the command:

ivregress 2sls/liml/gmm y x1 x2 x3 (x4=iv1 iv2 iv3)  /*x4 is the variable that correlated with the error term*/

ivprobit y x1 x2 x3 (x4=iv1 iv2 iv3), [twostep]   /*using MLE or twostep method*/

Also, it is recognized that because it is hard to find the appropriate instrument, it’s fine to use treatment effects model to insteat IV method.

All the above is the five methods I would like to introduce, later I may revise this post if I find any mistake. If the reader of this article finds any mistake, or has any suggestion, please kindly let me know, thank you!

Contributed by Kung Hiu

Propensity Score Match

This issue is a bit hard, I have been reading a book for several weeks(a short time period every time), and I know the theory behind that is so complex… However, we can solve the complex problem by just several  commands in STATA, so powerful a software, Ha…

First, you should have the STATA software and program adds-on. By

help pscore

install (one of the suits, usually the newest one)

Then, after you load the dataset, you can begin to calculate the PSM.

The example given by ”Microeconometrics: Methods and Applications” used global command to set the global variables, if you don’t like it, you can just jump this global part and use real numbers or words names in where there is a $.

global breps 200      /*set 200 to variable breps, and indicate the times bootstrap process will repeat*/

global vars name1 name2 …namen  /*just to save efforts of repeating writing the variable names*/

Next, we come to the core processors of PSM calculating. Pscore and PSM regression.

Pscore:

pscore TREAT $vars, pscore(myscore) comsup blockid(myblock) numblo(5) level(0.005) logit

or

pscore TREAT $vars, pscore(myscore) blockid(myblock) numblo(5) level(0.005) logit

TREAT is the var which indicats where the obs is a treated or a controlled. To perform regression on it, is based on the theory that all the obervations should have the same probability to be treated or controlled. In the other word, it should be a independent variable.

Pscore, blockid, numblo, level, logit(the default is probit model), means to get pscore, block indicator, set number of blocks, and significant level when selecting varialbes, also, by the method of logit model.

The comsup, which is the sole difference between the two command, is used to set whether to use the common supported part, the part that both group, treat and control, share.

PSM regression:

set seed 10101
attnd RE78 TREAT $vars, comsup boot reps($breps) dots logit

set seed 10101
attr RE78 TREAT $vars, comsup boot reps($breps) dots logit radius(0.001)

set seed 10101
atts RE78 TREAT, pscore(myscore) blockid(myblock) comsup boot reps($breps) dots

set seed 10101
attk RE78 TREAT $vars, comsup boot reps($breps) dots logit

The four groups of commands all perform PSM regression, but use different method, specifically, Nearest neighbor matching, Radius matching for Radius=0.001, Stratification Matching, Kernel Matching, respectively.

Boot reps($breps) is showing that we are using bootstrap method and repeat $breps times. Dots is interesting dots in the screen if you like to see. Attention should be taken that different methods using different numbers of observations.

 

Ok, all above is a simple, short command to perform PSM. Enjoy your research!

 

Contributed by Kung Hiu

Unveil the truth of DID

Have to say, that, DID, which stands for differnce in difference, one of commonly used economitrica method in program evaluation, is such an easy method.

I have been deceived by it for such a long time…

Here is the command of applying DID in STATA:

regress EARNS Tdyear2 TREAT dyear2

OLS regresssion, with some dummy variables. Among the command, EARNS stands for the outcome you want to measure, and TREAT and dyear2 stand for the treat and time, respectively. The KEY variable Tdyear2, is the multiplication of TREAT and dyear2, which, therefore distinguish the treat group on the second year with all the other 3 parts of the observations.

So, the paramater of Tdyear2, is the treat effect.

If you use the command of “regress EARNS Tdyear2 TREAT dyear2, robust”, then you can calculate heteroskedastic-robust standard errors.

Things to remember:

DID的基本假设:
1, 共同趋势假设:假设不同组别的时间效应都是相同的。
2, 假定两组的合成部分在变动前后均是稳定的。

Contributed by SRQM

“By using Excel, which was never designed for scientific research, they institutionalized mouse…”

“By using Excel, which was never designed for scientific research, they institutionalized mouse clicks and other untraceable actions into a scientific workflow, which must be avoided since it makes explaining to others (and to oneself) how to replicate the findings next to impossible and too easily introduces inadvertent mistakes.”

-

Period. The replication was carried with R, and additional analysis (easily found online) was done with Stata.

Victoria Stodden at What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine — The Monkey Cage

Contributed by Francis Smart

The effect of non-convergence on MLE estimates


* Maximum likelihood proceedures have become widely used to solve a variety of econometric problems.

* Unfortunately there is no guarantee that these proceedures will yeild a single solution which satisfies the convergence criteria of the maximizing function.

* This might occur for reasons difficult to detect such as localy flat spots or discontinuous areas.

* Maximization proceedures are usually evaluated based on their 1. efficiency (speed of convergence) or 2. on their robustness at detecting optimal values.

* The problem is that sometimes in simulation we need to limit the time a MLE proceedure takes in attempting to find a solution.

* What effect does that limitation result in and what do we do with estimates that result from non-convergence?
* 1. Keep them or 2. throw them out

* This simulation will explore both options

* In this simulation we will fall back on the widely used estimator which is equivalent when the standard errors are not structurally estimated to the OLS estimator.

* This is the "normal" regression estimator.  IE the MLE maximization that allows for linearly modeled heteroskedasticity.
cap program drop myNormalReg
program define myNormalReg
  args lnlk xb sigma2
  qui replace `lnlk' = -ln(sqrt(`sigma2'*2*_pi)) - ($ML_y-`xb')^2/(2*`sigma2')
end

* First let's generate a sample data set
clear
set obs 300

* I am going to try to make the problem hard to solve by including both addative and multiplicative error.
gen u = (runiform()-.5)
  * I made this error small because actually when the error is small it is harder to estimate the variance of the error.
  * It takes a little work with simulations to generate data which does not converge.

gen v1 = rnormal()
gen x1 = runiform()-.5

gen v2 = rnormal()
gen x2 = runiform()-.5

gen y = 3 + (1+v1)*x1 + (1+v2)*x2 + u

reg y x1 x2

ml model lf myNormalReg (reg: y=x1 x2) (sigma2:)
ml maximize

* This is the more efficient model because it is explicitly modeling the error.
gen x1_2 = x1^2
gen x2_2 = x2^2

ml model lf myNormalReg (reg: y=x1 x2) (sigma2: x1_2 x2_2)
ml maximize

* It seems that typically this model frequently converges.

* Let's see if we can't dilute the maximization:
gen x1x2 = x1*x2

gen x1abs = abs(x1)
gen x2abs = abs(x2)

ml model lf myNormalReg (reg: y=x1 x2 x1_2 x2_2 x1x2 x1abs x2abs) (sigma2: x1 x2 x1_2 x2_2 x1x2 x1abs x2abs)
ml maximize, iterate(100)

* It looks to me about half the time I run this code it does not converge within a 100 iterations.

* Now lets specify our functions we will use to test differences in results depending upon our method of dealing with convergence.

* This program is just a condensation of the above code.
cap program drop sim_converge
program define sim_converge

  clear
  set obs 300

  gen u = (runiform()-.5)

  gen v1 = rnormal()
  gen x1 = runiform()-.5

  gen v2 = rnormal()
  gen x2 = runiform()-.5
  gen y = 3 + (1+v1)*x1 + (1+v2)*x2 + u
  gen x1_2 = x1^2
  gen x2_2 = x2^2

  ml model lf myNormalReg (reg: y=x1 x2) (sigma2: x1_2 x2_2)
  ml maximize

  gen x1x2 = x1*x2

  gen x1abs = abs(x1)
  gen x2abs = abs(x2)

  ml model lf myNormalReg (reg: y=x1 x2 x1_2 x2_2 x1x2 x1abs x2abs) (sigma2: x1 x2 x1_2 x2_2 x1x2 x1abs x2abs)
  ml maximize, iterate(`1')
  * The only difference is that iterate is specified by the user.
end

* Leaving the first argument blank will not specify a maximum convergence iteration.
sim_converge
sim_converge 50

* Let's first define what we would like to save from the MLE.
* Yes, I am going to use a forbidden global :)
gl savelist ic=e(ic)
  * e(ic) is the macro in which the number of iterations used is saved.

 foreach i in reg sigma2 {
   foreach v in x1 x2 x1_2 x2_2 x1x2 x1abs x2abs {
   gl savelist $savelist `i'`v'=[`i']_b[`v']
   }
 }

 * Let's see what our savelist looks like:
 di "${savelist}"

 * looking pretty good.

simulate ${savelist} , rep(100) seed(32): sim_converge 50
tab ic
* In my simulation 51 times the MLE did not converge by the 50th iteration.

* Let's see if there are systematic differences between estimates.
sum if ic==50
sum if ic<50 p="">  * We can see that if the estimator did converge then it is much more precise (smaller sd) than in the cases when it did not converge.
 
  * The mean estimates of regx1 and regx2 and sigmax1_2 and sigmax2_2 are much closer to 1 which is the true parameter values.
 
* Let's try it again setting convergence at a higher bar:
simulate ${savelist} , rep(100) seed(32): sim_converge 250
tab ic
sum if ic==250
sum if ic<250 p="">
* Raising the max iteration does not lead to any of the observations converging.

* This is problematic because we want to know if there is a systematic difference in the draws for the estimates which converged and those that did not.

* By the results so far we might be tempted just to include the results of the iterations that did converge.

* First off let's see if the estimates that converged quickly are better or worse than those that converged more slowly.

recode ic (1/14=0) (15/49=1), gen(grp)

bysort grp: sum regx1 regx2
anova regx1 grp if grp<250 p="">* It seems there is no detectable difference in the means for those observations that converged more quickly than 15 iterations than those that converged more slowly.

* This implies, assuming the results are generalizable that truncating the simulation to only the results that converge might produce unbiased estimates.

* We should run the simulation again with more repetitions in order to confirm this.

simulate ${savelist} , rep(500) seed(32): sim_converge 50
  tab ic
 
/* tab ic

      e(ic) |      Freq.     Percent        Cum.
------------+-----------------------------------
         10 |          1        0.20        0.20
         11 |         18        3.63        3.83
         12 |         24        4.84        8.67
         13 |         48        9.68       18.35
         14 |         57       11.49       29.84
         15 |         50       10.08       39.92
         16 |         24        4.84       44.76
         17 |         10        2.02       46.77
         18 |         18        3.63       50.40
         19 |         10        2.02       52.42
         20 |          4        0.81       53.23
         21 |          6        1.21       54.44
         22 |          1        0.20       54.64
         23 |          3        0.60       55.24
         24 |          4        0.81       56.05
         25 |          2        0.40       56.45
         26 |          3        0.60       57.06
         29 |          1        0.20       57.26
         32 |          1        0.20       57.46
         33 |          1        0.20       57.66
         50 |        210       42.34      100.00
------------+-----------------------------------
      Total |        496      100.00
*/

  sum if ic==50
  sum if ic<50 p="">
  recode ic (1/14=0) (15/49=1), gen(grp)

  bysort grp: sum regx1 regx2

/*
-> grp = 0

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       regx1 |       148    .9918629    .1076053    .732878   1.296374
       regx2 |       148    .9807507     .112081    .684716   1.447506

-----------------------------------------------------------------------------------------
-> grp = 1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       regx1 |       138    .9967074    .1183514   .7492483   1.318624
       regx2 |       138    .9913069    .1161812   .6711526   1.407439

-----------------------------------------------------------------------------------------
-> grp = 50

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       regx1 |       210     .814391    .3268973   .0518116   1.590542
       regx2 |       210    .8106436    .3278075   .0707162   1.775239

*/
  anova regx1 grp if grp<50 p="">
/*
                         Number of obs =     286     R-squared     =  0.0005
                           Root MSE      = .112917     Adj R-squared = -0.0031

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  .001675983     1  .001675983       0.13     0.7172
                         |
                     grp |  .001675983     1  .001675983       0.13     0.7172
                         |
                Residual |  3.62106459   284  .012750227  
              -----------+----------------------------------------------------
                   Total |  3.62274057   285   .01271137  
*/

* We can see that even when the sample size is larger (about 140 per iteration group) there is no discernable difference between those draws that converge before the first 15 iterations and those that converge after.

* If we assume that for those draws that do not converge within 250 draws are sampling from the same probability of convergence distribution then this evidence suggests that rate of convergence is independent of the actual estimates and therefore it might be safe to exclude observations in which convergence did not occur.


* Now finally what we might interested in is seeing how our estimates change (for the values in which convergence is achieved) as we include more and more estimates by way of increasing our threshold of max iterations.

* How do we do this?

* Well, this might take a little while but we basically loop through the simulation saving the results it terms of means and standard deviations from each run.

forv i = 15(5)35 {
  simulate ${savelist} , rep(100) seed(32): sim_converge `i'
  sum regx1 if ic<`i'
    global mean_x1_`i'=r(mean)
    global var_x1_`i'=r(sd)^2
  sum regx2 if ic<`i'
    global mean_x2_`i'=r(mean)
    global var_x2_`i'=r(sd)^2
}
* By the way this is a highly redundant and inefficient method.

clear
set obs 5
gen mean_x1 = .
  label var mean_x1 "Mean of x1 estimates"
gen mean_x2 = .
  label var mean_x2 "Mean of x2 estimates"
gen var_x1 = .
  label var var_x1 "Variance of x1 estimates"
gen var_x2 = .
  label var var_x2 "Variance of x2 estimates"

gen i = .
  label var i "Max # iterations"

* Save the results as variables
forv i = 1(1)5 {
  local ii = 10+`i'*5
  replace mean_x1 = ${mean_x1_`ii'} if _n==`i'
  replace mean_x2 = ${mean_x2_`ii'} if _n==`i'
  replace var_x1  = ${var_x1_`ii'}  if _n==`i'
  replace var_x2  = ${var_x2_`ii'}  if _n==`i'
  replace i = `ii' if _n==`i'
}

two (connected mean_x1 i, msize(large) lwidth(thick)) ///
    (connected mean_x2 i, msize(large) lwidth(thick)), name(means, replace)
two (connected var_x1  i, msize(large) lwidth(thick)) ///
    (connected var_x2  i, msize(large) lwidth(thick)), name(vars,  replace)

graph combine means vars, col(1) title("Estimates are insensitive to speed of convergence")



* The take away seems to be that it is safe (at least in this simulation) to exclude from your analysis simulations that did not converge in the specified iteration count.

* This simulation also suggests that it is not ideal to include in your results MLE estimates from iterations in which no convergence was achieved.

Contributed by SRQM

From my student files.













From my student files.

Contributed by SRQM

A shorter lookfor command, in five lines of code

One thing that I like about Stata is the possibility to write quick wrappers for commands that get things done. The code below is an example that I wrote to search for variables in less keystrokes than lookfor (which cannot be abbreviated). I also wanted to get numbers of observations at the same time.

    cap pr drop find
    program find, rclass
        qui lookfor `*'
        if "`r(varlist)'" != "" codebook `r(varlist)', c
    end

This returns the output of the codebook command with the compact option instead of the standard lookfor output, which is based on the describe command. The variable labels are less readable, and long variable names still get abbreviated (such a strange idea). Yet it works, and returns the N.

I tried to call the program look, but that is still taken by a deprecated command, lookup, which has been superseded by search. As for the single letter l, it is taken by the list command, so that will not work either. The same holds for q, and ? is sadly not a valid program name.