Stata Blogger?

Are you a Stata blogger? Join the aggregator!

Contributed by SRQM

Stata questions at Stack Overflow

Many Stata users know about Statalist, the central mailing-list for Stata matters. But this short post is meant to advertise another great place for questions and answers about Stata code: Stack Overflow, a member of larger family of Q & A websites that include Stack Exchange, Math Overflow and many more.

Stack Overflow has a Stata tag that can be followed through RSS, the syndication standard that Google is effectively leading to its death this year (the feed is actually Atom-formatted, which makes no difference). A few Statalisters have already joined the Stack Overflow community, and most questions get answers in 48 hours or less.

A recent example of a Stata question at Stack Overflow is this query about automating DTA (Stata format) to TSV (tab-separated values) conversion. There are plenty of similar questions, especially about data management, with Stata. Questions on modelling are rather asked at Statalist as far as I can tell.

Hope to see you there!

Contributed by Francis Smart

ML and initial values


* This post explores the setting of initial values with the ML command.

* There are several reasons you may be interested in doing this.

* 1. You are having difficulty with your data converging and you think that
*        you are having a problem with your initial values being unfeasible.
* 2. You may believe that you might already have reasonable estimates and
*        therefore you believe you might gain efficiency by plugging in those
*        initial values.
* 3. You are interested in using Stata's approximation algorithm to compute
*        standard errors.

* Let's first start with a simple data set.
clear
set obs 1000

gen x = runiform()*2

gen u = rnormal()*5

gen y = 2 + 2*x + u
* x predicts y

graph twoway (lfitci y x) (scatter y x), title("X predicts reasonably Y")



* Let's say we are interested in estimating the three parameters in our data.
* coefficient on x and the constant as well as the variances of the error (25)
* We can use the maximum likelihood function myNormalReg to solve this problem
* simultaneously.

cap program drop myNormalReg
program define myNormalReg
  args lnlk xb sigma2
  qui replace `lnlk' = -ln(sqrt(`sigma2'*2*_pi)) - ($ML_y-`xb')^2/(2*`sigma2')
end

ml model lf myNormalReg (reg: y = x) (sigma2:)
  ml max
* Our estimates look reasonable.  We iterated 4 times and converged quickly.
 
* This is our benchmark.

* We also know that OLS is equivalent to the above MLE except it does not directly estimate sigma
*    though it does give a more precise estimate of the coefficients.
reg y x

* So let's try the previous MLE using our OLS estimates as starting values
ml model lf myNormalReg (reg: y = x) (sigma2:)
  ml init reg:x = `=_b[x]'
  ml init reg:_cons = `=_b[_cons]'
  ml max

* Unfortunately, perhaps this simple setup is too easy to solve.
* There is no noticable difference in convergence rates using the OLS estimates as starting values.

* Alternatively let's see if things improve if we use the true values.
ml model lf myNormalReg (reg: y = x) (sigma2:)
  ml init reg:x = 2
  ml init reg:_cons = 2
  ml init sigma2:_cons = 25
  ml max
 
* The results are pretty much the same as above.

* Finally, let's imagine that we would like to estimate the standard errors on the true parameter
*    values as if ml had converged on the true.  This might be useful if you have two methods for
*    maximizing an equation one that converged but did not give valid standard errors and one that
*    did not converged but gave valid standard errors.
ml model lf myNormalReg (reg: y = x) (sigma2:)
  ml init reg:x = 2
  ml init reg:_cons = 2
  ml init sigma2:_cons = 25
  ml max, iter(0)

Contributed by SRQM

Now on GitHub: A Stata bundle for TextMate

One month ago, I mentioned Phil Schumm’s Stata bundle for TextMate. His code has just moved to GitHub. TextMate is a free and open source code and text editor for Mac OS X. This short announcement is to encourage anyone with knowledge of GitHub repositories and TextMate bundles to give him a hand.

P.S. There is more than one copy of the bundle flying around GitHub at the moment, but the one cited in this post is the one being actively developed. Thanks to Phil Schumm for making that point clearer in the README file of his bundle.

Contributed by Francis Smart

asim command


* After using the msim command from a previous post, I decided I did not like having many different variables for each simulation run.

* So I have recoded an alternative command called asim

* This command allow subsequent simulate commands to be run, saving the old data in memory appended to the current data.
cap program drop asim
program define asim
  * This will allow the user to specify a value to assign to the asim variable which is generated after the simulation is run.
  gettoken before after : 0 , parse(":")
  local simulation = subinstr("`after'", ":", "", 1)

  tempfile tempsave
  cap save `tempsave'
  `simulation'

  * This will append in the new simulation data to the old data set.
  gen asim = "`before'"
  cap append using `tempsave'
end


* Let's write a nice little program that we would like to simulate.
cap program drop simCLT
program define simCLT

  clear
  set obs `1'
  * 1 is defined as the first argument of the program sim

  * Let's say we would like to see how many observations we need for the central limit theorem (CLT) to hold for a bernoulli distribution.
  gen x = rbinomial(1,.25)

  sum x
end

* So let's see first how the simulate command works initially.


simulate, rep(200): simCLT 100

* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)

* But instead let's try our new command!
clear
* Clear out the old results
asim 0100: simulate, rep(200): simCLT 100

asim 0200: simulate, rep(200): simCLT 200

* Looks good!

asim 0400: simulate, rep(200): simCLT 400

asim 1000: simulate, rep(200): simCLT 1000

asim 10000: simulate, rep(200): simCLT 1000

asim 100000: simulate, rep(200): simCLT 1000

* Standardize the individual means of each run so as to more easily compare them with each other.
bysort asim: egen sd_mean = sd(mean)
bysort asim: egen mean_mean = mean(mean)

gen std_mean = (mean-mean_mean)/sd_mean

hist std_mean, kden by(asim)
* We can see that this generate data is much easier to use than that using many different variables.

* It looks a little funny with some of the numbers having zeros in front.  However, this was the best way to do it given that the asim variable is text.

Contributed by Francis Smart

My not so Brief Stata Formatting Guide

* I write this as a short guide though I do not always stick to it. This post was inspired by a thoughtful discussion on the linkedin Stata-Users group.

* The number one rule is: Always, always comment your code!

* See my reasons:
* http://www.econometricsbysimulation.com/2012/07/7-reasons-to-comment-your-code.html

* I prefer using <*> before a comment as the primary method of commenting.
* And <//> before my lines when in mata.

* I only like using /* and */ when I have a large amount of comments.

* I think it is useful to add comments after commands like.
clear // remove data in memory

* Though with long commands I think it is hard to read.  For example:
twoway (scatter  length gear_ratio) (scatter  foreign mpg_price) (scatter  price mpg) ///
  , title("This is a useless and meaningless graph") // Graphs length against gear_ratio

* -----------------------------SAMPLE DOCUMENT----------------------------
/***** Title of Do File

Description of do file.  This might have several paragraphs for which
I reccomend hard breaking the lines since Stata does not have word wrap

*********************** Section 0: Initialization **********************

If your do file is very long and has multiple sections consider including
an index.

Index
1. Parameter declaration
2. Input/clean data, generate temporary data
3. Manipulate variables
4. Generate summary statistics
5. Generate estimates
6. Delete temporary data/variables

I might also consider including a variable glossory at the begging of your do file.

For example:
cntgdp: Country GPD
cntgdp2: Country GPD demeaned
year: Year
nsrvy: Number of Survey Wave

As for naming variables, I would suggest not letting variables get longer
than six letters and two numbers long.

For example the variables might mean:
dstgdpp: disctrict gross domestic product per capita, nominal currency of that year.
dstgdppcchgyr00: disctrict gross domestic product per capita change from year 2000.
It might seem like a good idea to write it this way but it is really confusing to
try to read especially since stata will start truncating variables.

I would suggest instead naming variables something like this instead:
dstgdp1: disctrict gross domestic product per capita change from year 2000.
dstgdp0: disctrict gross domestic product per capita, nominal currency of that year.

Have two places you define the variables.
The variable glossory at the begginning of your document and the label that you
give your variables.
*/
******************** End Section 0: Initialization **********************


****************** Section 1: Parameter declaration *********************

* Often times you might find it useful to specify globals or locals that help you
* control your analysis when you run your file.

* For example:
global exmin = 1
* When set to 1 minorities will be excluded from the analysis.

global ppp = 0
* When set to 1 purchasing power parody will be used instead of GDP per capita.

* Of course you will need to code up within the analysis what the globals actually do.


* Speficy a working directory.  This can be done with the "cd".

* Personally I don't think this is suffcient.
* Often I am loading multiple data files from multiple directories.

* I prefer using globals specified in the parameter section.
* This allows users to have slightly or largely different file organization,
* Yet still be able to run your analysis.

* For example:
* Use globals to specify directories of interest
* Read directory
global rdir = "C:/data_files/my_project/original_data/"
* Save directory
global sdir = "C:/my_project/modified_data/"

**************** End Section 1: Parameter declaration *********************



****************** Section 2: Input/clean data *********************

* When you load in data.  Always first load it then save a copy of it somewhere else.

* Load original data:
sysuse auto, clear

* Save data to new directory where it will never accidently overwrite your original data
save "${sdir}auto.dta", replace

*************** End Section 2: Input/clean data *********************



****************** Section 3: Manipulate variables *********************

* Always give your variables labels when you define them.
gen mpg_price = mpg*price
  label var mpg_price "Miles Per Gallon times Price"
* Uses spaces to help denote commands which are secondary.
* Never use tabs instead of spaces
* because
* they
* are
* hard
* to
* read
* and can
* substantially
  * decrease
  * your
  * page space
  * Also, your code may
  * look different with different
  * programs.
  * This is very annoying.
  * I stuck a
  * lot of spaces to
  * simulate the 
  * Stata editor.
* Always explain why you do things.
drop if foreign == 1
  * We only are interested in domestic cars (for example).

* When doing any kind of looping also use indentation:
forv i = 1(1)10 {
  * When using forvalues never do i = 1/10 instead of i = 1(1)10 which are equivalent.
  * But i = 1/10 notation can cause problems when using macros.

  * It is very improtant to indent.
  if (`i' == 3) {
    * Do something

* I am displaying filler text when i==3 and only then
di "Filler"

* This will display i squared when i==3 (which is obviously 9)
di `i'^2


  }
  * End if
}
* End forv i loop

* Also, indent when commands go on multiple lines in length.
twoway (scatter  length gear_ratio) (scatter  foreign mpg_price) (scatter  price mpg) ///
  , title("This is a useless and meaningless graph")

* This can made commands much easier to read.

*************** End Section 3: Manipulate variables *********************

 * Also, take a look at some of the comments below.  There have been some very thoughtful contributions by Stata users.

Contributed by Francis Smart

MSimulate Command

CLT is a powerful property

* Stata has a wonderfully effective simulate function that allows users to easily simulate data and analysis in a very rapid fashion.

* The only drawback is that when you run it, it will replace the data in memory with the simulated data automatically.

* Which is not a big problem if you stick a preserve in front of your simulate command.

* However, you may want to run sequential simulates and keep the data form all of the simulations together rather than only temporarily accessed.

* Fortunately we can accomplished this task by writing a small program.

cap program drop msim
program define msim

  * Gettoken will split the arguments fed into msim into those before colon and those after.
  gettoken before after : 0 , parse(":")

  * I really like this feature of Stata!

  * First let's strip the colon.  The 1 is important since we want to make sure to only remove the first colon.
  local simulation = subinstr("`after'", ":", "", 1)

  * Now what I propose is that the argument in `before' is used as an extension for names of variables created by the simulate command.

  * First let's save the current data set.


  * Generate an id that we will later use to merge in more data
  cap gen id = _n

  * Save the current data to a temporary location
  tempfile tempsave
  save `tempsave'

  * Now we run the simulation which wipes out the current data.
  `simulation'

  * First we will rename all of the variables to have an extension equal to the first argument
  foreach v of varlist * {
    cap rename `v' `v'`before'
  }

  * Now we need to generate the ID to merge into

  cap gen id = _n

  merge 1:1 id using `tempsave'
  * Get rid of the _merge variable generated from the above command.
  drop _merge
end


* Let's write a nice little program that we would like to simulate.
cap program drop simCLT
program define simCLT

  clear
  set obs `1'
  * 1 is defined as the first argument of the program sim

  * Let's say we would like to see how many observations we need for the central limit theorem (CLT) to make the means of a bernoulli distribution look normal.  Remember, so long as the mean and variance is defined the generally central limit theorem will eventually force any random distribution of means to approximate a normal distribution as the number of observations gets large.
  gen x = rbinomial(1,.25)

  sum x
end

* So let's see first how the simulate command works initially

simulate, rep(200): simCLT 100



* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)

hist mean, kden
* The mean is looking good but not normal

* Now normally what we need to do would be to run simulate again with a different argument.

* But instead let's try our new command with 200!


* But instead let's try our new command!
clear
* Clear out the old results
msim 100: simulate, rep(200): simCLT 100

msim 200: simulate, rep(200): simCLT 200

* Looks good!

msim 400: simulate, rep(200): simCLT 400

msim 1000: simulate, rep(200): simCLT 1000

msim 10000: simulate, rep(200): simCLT 10000

msim 100000: simulate, rep(200): simCLT 100000

msim 1000000: simulate, rep(200): simCLT 100000

* The next two commands can take a little while.
msim 10000000: simulate, rep(200): simCLT 1000000

msim 100000000: simulate, rep(200): simCLT 10000000

sum mean*
* We can see that the standard deviations are getting smaller with a larger sample size.

* How is the histograms looking?
foreach v in 100 200 400 100 1000 10000 100000 1000000 10000000 {
  hist mean`v', name(s`v', replace)  nodraw  title(`v') kden

}

graph combine s100 s200 s400 s1000 s1000 s10000 s100000 s1000000 s10000000 ///
  , title("CLT between 100 and 10,000,000 observations")


* We can see that the distribution of means approximates the normal distribution as the number of draws in each sample gets large.

* This is one of the fundamental findings of statistics and pretty awesome if you think about it.

Contributed by Solomon Hsiang

Now on Stata-bloggers…

Francis Smart has set up Stata-bloggers, a new blog aggregator for Stata users (modeled after R-bloggers). FE will be contributing there, but there's lots of other goodies worth checking out from more prolific bloggers.

Everyone say "Thank you, Francis."

Contributed by SRQM

Plotting survey data: A wrapper for catplot

The previous post and the one before that mentioned that plotting survey data, which often contains ordinal or low-dimensional nominal data, can take many Stata options. I have started working on a wrapper for Nick Cox’s catplot command to bring down the code to one-line commands that produce graphs like the following examples:

svyplot marital, ymax(60)

The example above is close to the default catplot output with one variable. With two variables, I have tried to implement degrading colors as shown in the work of the Oxford Internet Institute:

svyplot health race, asc red ymax(60)

The wrapper uses reds or blues (default) for the color gradients, which can be ascending (default) or descending. The ymax option controls the height of the graph, which is 100 by default, in order to fit stacked bars:

svyplot happy polviews, des stack angle(25) scheme(burd3)

The graph above uses the BuRd scheme. It shows the data that was used to claim that the Tea Party members are the happiest Americans — which is false, as you can see by plotting the full data.

svyplot inequal3 race, asc hor stack scheme(burd5)

This final example shows stacked horizontal bars. The wrapper code probably won’t behave well with recast(dot) and three-variable arrangements, even though both are supported.

Contributed by SRQM

Example plots with country-level data

The previous post mentioned the BuRd theme and ColorBrewer. Here are some possible uses of both in a series of plots with cross-sectional country-level data. The code uses pooled WDI estimates for fertility and real GDP per capita as measured by the World Bank, and then adds UN region names to the data.

// package dependencies

ssc install wbopendata
ssc install kountry

// WDI data

wbopendata, indicator(SP.DYN.TFRT.IN; NY.GDP.PCAP.PP.CD) year(2008:2010) long clear
collapse ///
    (mean) fr = sp_dyn_tfrt_in /// 
    (mean) gdpc = ny_gdp_pcap_pp_cd ///
    , by(countrycode)

// geo indicators

kountry countrycode, from(iso3c) geo(un)
encode GEO, gen(region)

Geographical regions make it easy to plot the data over small multiples. I also often find it useful to look at a mosaic plot to diagnose how seriously missing data puts representativeness to threat in the sample.

// package dependencies

ssc install splineplot

// small multiples

gr hbox gdpc, over(region, sort(1) des) mark(1, ms(i) mlab(countrycode) mlabp(0)) name(boxes, replace)

hist fr, bin(4) by(region, total) name(bins, replace)

// missing data

gen full = !mi(fr, gdpc)
spineplot full region

Going a bit further with regression results, a variety of graphs can be useful for running diagnostics. The first one shown below is a LOESS fit across the residuals against the fitted values, and the second one is an example of weighted markers where the error term is shown along the linear fit.

// residuals

gen loggdpc = ln(gdpc)
reg fr loggdpc

predict r, resid
predict yhat

// residuals-versus-fitted values, plus LOESS

sc r yhat, mlab(countrycode) yline(0) ms(i) mlabp(0) || lowess r yhat, ///
    name(residuals_loess, replace)

// linear fit with residually weighted points

sc fr loggdpc if abs(r) &gt; .3 [w = abs(r)], ms(O) mc(gs14) mfc(gs12) || ///
    lfit fr loggdpc || ///
    sc fr loggdpc, ms(i) mlab(countrycode) mlabc(gs6) mlabp(0) legend(off) ///
    name(residuals_rvf, replace)

Last, a map of the residuals can also be informative if there is suspicion of spatial dependence in the error term:

// package dependencies

ssc install spmap

// map of residuals (caution with intervals)

merge 1:1 countrycode using world-d, keep(match master) gen(mapmerge)
spmap r using world-c, id(_ID) clmethod(boxplot) ///
  fcolor(RdYlGn) ndocolor(gs12) ndfcolor(gs14) ocolor(none ..) ///
  legstyle(1) legend(ring(1) pos(3)) ///
  name(residuals_map, replace)

Country-level data is an ideal candidate for plot tweaks such as using marker labels instead of observations. With survey data, there would be more work to do at the level of the data itsef, and text labels would have to be taken from aggregate measures like relative frequencies or averages, which makes it more complex to plot the data quickly and efficiently.

Contributed by SRQM

Plotting with the BuRd scheme

Alternative Stata graph schemes got briefly mentioned in the opening post when I linked to the BuRd scheme, my own realization in that domain. Solomon Hsiang recently published his own scheme, which he uses to plot the neat graph functions that he codes for both Matlab and Stata. This post explains a bit further what I have been trying to achieve with the BuRd scheme.

Update: the BuRd scheme is now available from GitHub and from Stata, with the following command:

ssc install scheme-burd, replace

Stata graphs are not the nicest part of the software. What Stata wins on making it possible to recode or regress a set of variables in one line, it loses when it comes to making the look of a plot a bit cleaner or simply more elegant. Stata graph syntax is rather usable, but the default schemes are rarely satisfactory. Here, for instance, are a few default families applied to a scatterplot:

sysuse lifeexp, clear
gr drop _all

local l "s2color s2mono s1color s1mono sj economist"

foreach s of local l {
    sc lexp safewater, ti(`s') scheme(`s') name(`s')
}

gr combine `l', row(2) name(dots)
gr export dots.png, replace

The default schemes have a few undesirable issues, like perpendicular reading on the y-axis, that have been fixed in the Economist-like scheme. That scheme also wins a few more points on its discrete color selection, which is hard-coded in Stata’s color styles:

sysuse lifeexp, clear
gr drop _all

local l "s2color s2mono s1color s1mono sj economist"

gen x = lexp^3
xtile q = x, nq(10)

foreach s of local l {
    gr bar x, over(q) asyvars legend(row(1)) ti(`s') scheme(`s') name(`s', replace)
}

gr combine `l', row(2) name(bars)
gr export bars.png, replace

The issue finally gets to become a real problem when it makes a common visualization of survey data, which is generally full of lowly-dimensional ordinal data like 4-point scales, more difficult than it should ever be:

sysuse lifeexp, clear
gr drop _all

local l "s2color s2mono s1color s1mono sj economist"

gen x = lexp^3
xtile q = x, nq(10)

qui tab region, gen(r_)

local l "s2color s2mono s1color s1mono sj economist"

foreach s of local l {
    gr bar r_*, over(q, sort(1)) stack percent ///
    legend(row(1)) ti(`s') scheme(`s') name(`s', replace)
}

gr combine `l', row(2) name(stacks)
gr export stacks.png, replace

My own take consists in a scheme, burd, that uses some toned-down colors from ColorBrewer and offers a range of diverging scales colored from blue to red tints. The scheme was tested on the common types of plots below, and there are more demo plots at its wiki page.

set scheme burd, perm

sysuse lifeexp, clear
gr drop _all

sc lexp safewater, name(dots)

gen x = lexp^3
xtile q = x, nq(10)

gr bar x, over(q) asyvars legend(row(1)) name(bars)

qui tab region, gen(r_)

gr bar r_*, over(q, sort(1)) stack percent ///
    legend(row(1)) scheme(burd3) name(stacks) 

hist lexp, normal name(hist)
tw sc lexp safewater || lfit lexp safewater, name(lfit)
gr mat lexp safewater popgrowth, name(mat)

gr combine dots bars stacks hist lfit mat, row(2)
gr export burd.png, replace

The scheme uses the default ‘sharper’ graph settings used in Edwin Leuven’s own schemes, which are based on Svend Juul’s lean schemes and on ColorBrewer selections of discrete colors. Another implementation of ColorBrewer is in Maurizio Pisati’s spmap package, and yet another take on Stata graphs is Ulrich Atz’s scheme_tufte package, which mimicks Tufte-like plots.

Ideally, it should be possible to go much further with Stata graphs, and some users are already doing it: Stata News reported no so long ago about the work that was done at the Oxford Internet Institute to produce elegant survey plots from Stata. It should also be mentioned that recent versions of Stata offer more graph features, like margin plots, so future graphic improvement can be hoped for.