Stata Blogger?

Are you a Stata blogger? Join the aggregator!

R, Stata and matching additional learning costs

Francis Smart recently pointed to an important difference between R and Stata from a teaching perspective, which has to do with the additional learning costs of vectorization in R over the single-dataset orientation of Stata.

Stata makes it easy to manipulate names, or more specifically, variable names, as in a dataset with three variables for social expenditure called party1 party2 party3. This is common to many empirical preprocessed datasets.

 // example mvdecode party*, mv(999) 

Furthermore, Stata works like an accountant’s book, so all variables belong to a same data object that never needs to be called beyond loading. This naturally suppresses a lot of possibilities, compensated in part by macros and scalars.

 // example loc regressors "age sex" 

Macros in particular then branch with loops like the forval and foreach commands to allow more complex data processing. At that level of use, the software is flexible enough for most applied data cleaning.

 // example forval i = 1/3 { replace socx`i' = socx`i' / 10^6 } 

To access matrix notation, the Stata user needs to move to Mata syntax, while R immediately offers the user to manipulate objects through vectorization. Thinking in these terms is more demanding as there are more possibilities for errors, starting with calls to undeclared objects.

I teach both R and Stata. My experience with social science students is that the additional learning costs of R syntax need to be matched with other benefits to become valuable to them. To me, these benefits lie primordially in the more diverse array of data that R allows to access.

Continue reading R, Stata and matching additional learning costs

“By using Excel, which was never designed for scientific research, they institutionalized mouse…”

“By using Excel, which was never designed for scientific research, they institutionalized mouse clicks and other untraceable actions into a scientific workflow, which must be avoided since it makes explaining to others (and to oneself) how to replicate the findings next to impossible and too easily introduces inadvertent mistakes.”

Period. The replication was carried with R, and additional analysis (easily found online) was done with Stata.

Victoria Stodden at What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine — The Monkey Cage

Continue reading “By using Excel, which was never designed for scientific research, they institutionalized mouse…”

From my student files.

From my student files.

Continue reading From my student files.

A shorter lookfor command, in five lines of code

One thing that I like about Stata is the possibility to write quick wrappers for commands that get things done. The code below is an example that I wrote to search for variables in less keystrokes than lookfor (which cannot be abbreviated). I also wan…

Continue reading A shorter lookfor command, in five lines of code

Stata questions at Stack Overflow

Many Stata users know about Statalist, the central mailing-list for Stata matters. But this short post is meant to advertise another great place for questions and answers about Stata code: Stack Overflow, a member of larger family of Q & A websites…

Continue reading Stata questions at Stack Overflow

Now on GitHub: A Stata bundle for TextMate

One month ago, I mentioned Phil Schumm’s Stata bundle for TextMate. His code has just moved to GitHub. TextMate is a free and open source code and text editor for Mac OS X. This short announcement is to encourage anyone with knowledge of GitHub repositories and TextMate bundles to give him a hand.

P.S. There is more than one copy of the bundle flying around GitHub at the moment, but the one cited in this post is the one being actively developed. Thanks to Phil Schumm for making that point clearer in the README file of his bundle.

Continue reading Now on GitHub: A Stata bundle for TextMate

Plotting survey data: A wrapper for catplot

The previous post and the one before that mentioned that plotting survey data, which often contains ordinal or low-dimensional nominal data, can take many Stata options. I have started working on a wrapper for Nick Cox’s catplot command to bring down the code to one-line commands that produce graphs like the following examples:

svyplot marital, ymax(60) 

The example above is close to the default catplot output with one variable. With two variables, I have tried to implement degrading colors as shown in the work of the Oxford Internet Institute:

svyplot health race, asc red ymax(60) 

The wrapper uses reds or blues (default) for the color gradients, which can be ascending (default) or descending. The ymax option controls the height of the graph, which is 100 by default, in order to fit stacked bars:

svyplot happy polviews, des stack angle(25) scheme(burd3) 

The graph above uses the BuRd scheme. It shows the data that was used to claim that the Tea Party members are the happiest Americans — which is false, as you can see by plotting the full data.

svyplot inequal3 race, asc hor stack scheme(burd5) 

This final example shows stacked horizontal bars. The wrapper code probably won’t behave well with recast(dot) and three-variable arrangements, even though both are supported.

Continue reading Plotting survey data: A wrapper for catplot

Example plots with country-level data

The previous post mentioned the BuRd theme and ColorBrewer. Here are some possible uses of both in a series of plots with cross-sectional country-level data. The code uses pooled WDI estimates for fertility and real GDP per capita as measured by the Wo…

Continue reading Example plots with country-level data

Plotting with the BuRd scheme

Alternative Stata graph schemes got briefly mentioned in the opening post when I linked to the BuRd scheme, my own realization in that domain. Solomon Hsiang recently published his own scheme, which he uses to plot the neat graph functions that he codes for both Matlab and Stata. This post explains a bit further what I have been trying to achieve with the BuRd scheme.

Update: the BuRd scheme is now available from GitHub and from Stata, with the following command:

ssc install scheme-burd, replace 

Stata graphs are not the nicest part of the software. What Stata wins on making it possible to recode or regress a set of variables in one line, it loses when it comes to making the look of a plot a bit cleaner or simply more elegant. Stata graph syntax is rather usable, but the default schemes are rarely satisfactory. Here, for instance, are a few default families applied to a scatterplot:

sysuse lifeexp, clear gr drop _all local l "s2color s2mono s1color s1mono sj economist" foreach s of local l { sc lexp safewater, ti(`s') scheme(`s') name(`s') } gr combine `l', row(2) name(dots) gr export dots.png, replace 

The default schemes have a few undesirable issues, like perpendicular reading on the y-axis, that have been fixed in the Economist-like scheme. That scheme also wins a few more points on its discrete color selection, which is hard-coded in Stata’s color styles:

sysuse lifeexp, clear gr drop _all local l "s2color s2mono s1color s1mono sj economist" gen x = lexp^3 xtile q = x, nq(10) foreach s of local l { gr bar x, over(q) asyvars legend(row(1)) ti(`s') scheme(`s') name(`s', replace) } gr combine `l', row(2) name(bars) gr export bars.png, replace 

The issue finally gets to become a real problem when it makes a common visualization of survey data, which is generally full of lowly-dimensional ordinal data like 4-point scales, more difficult than it should ever be:

sysuse lifeexp, clear gr drop _all local l "s2color s2mono s1color s1mono sj economist" gen x = lexp^3 xtile q = x, nq(10) qui tab region, gen(r_) local l "s2color s2mono s1color s1mono sj economist" foreach s of local l { gr bar r_*, over(q, sort(1)) stack percent /// legend(row(1)) ti(`s') scheme(`s') name(`s', replace) } gr combine `l', row(2) name(stacks) gr export stacks.png, replace 

My own take consists in a scheme, burd, that uses some toned-down colors from ColorBrewer and offers a range of diverging scales colored from blue to red tints. The scheme was tested on the common types of plots below, and there are more demo plots at its wiki page.

set scheme burd, perm sysuse lifeexp, clear gr drop _all sc lexp safewater, name(dots) gen x = lexp^3 xtile q = x, nq(10) gr bar x, over(q) asyvars legend(row(1)) name(bars) qui tab region, gen(r_) gr bar r_*, over(q, sort(1)) stack percent /// legend(row(1)) scheme(burd3) name(stacks) hist lexp, normal name(hist) tw sc lexp safewater || lfit lexp safewater, name(lfit) gr mat lexp safewater popgrowth, name(mat) gr combine dots bars stacks hist lfit mat, row(2) gr export burd.png, replace 

The scheme uses the default ‘sharper’ graph settings used in Edwin Leuven’s own schemes, which are based on Svend Juul’s lean schemes and on ColorBrewer selections of discrete colors. Another implementation of ColorBrewer is in Maurizio Pisati’s spmap package, and yet another take on Stata graphs is Ulrich Atz’s scheme_tufte package, which mimicks Tufte-like plots.

Ideally, it should be possible to go much further with Stata graphs, and some users are already doing it: Stata News reported no so long ago about the work that was done at the Oxford Internet Institute to produce elegant survey plots from Stata. It should also be mentioned that recent versions of Stata offer more graph features, like margin plots, so future graphic improvement can be hoped for.

Continue reading Plotting with the BuRd scheme

Joining the Stata Bloggers aggregator: An opening post

A small subset of posts from this Tumblr log has been reformatted into this longer piece and submitted for inclusion into Francis Smart’s recently created aggregator for Stata blogs, Stata Bloggers. This post is an announcement message that explains the whys, hows and what of that operation.

First, why this blog? I am using Tumblr to run two course companions on health policy and data analysis. The latter, SRQM, is the one that you are reading from. It is named after the course “Statistical Reasoning and Quantitative Methods”, which I co-teach in Paris with Ivaylo Petev.

Why do you teach with Stata? Iavylo and I chose Stata for practical reasons: we both knew how to use it, the software was available where we teach the course, and we needed a software that could be taught to large groups of postgraduate beginners. We also teach an optional course with R.

Choosing a statistical software to work with is never an easy choice, but it has recently been made simple for a large category of users, for which the choice should be R. Anthony Damico is right when he jokingly writes the following:

confidential to sas, spss, stata, and sudaan users: the eighties called. they want their statistical languages back. time to transition to r. :D [Anthony also wrote that if Stata has a better learning curve than R, “so do bicycles with training wheels ;) ”]

R is also marked by the eighties in many ways, but it has indeed made other statistical software rather obsolete. Its ggplot2 library, for instance, is just much better than Stata graphs. Even if you take the time to tweak them with complex code or alternative colors, Stata plots are often ugly.

Stata yet remains a good choice for those who are learning statistical analysis next to other things and therefore have limited time to learn programming. It is quick, cheap enough for universities, and copes well with large surveys. It is also easily scriptable and open to user contributions.

For these reasons, Stata has a good user base among academics, especially in sociology, economics and political science. Nate Silver also uses it. There’s great documentation for Stata, in English as well as in different languages, like this page in Lithuanian.

There’s more cool things about Stata. The World Bank has an awesome Stata package to download its data. Its syntax is even supported by a few plain text editors like TextMate, thanks to Tim Beatty and Phil Schumm (now on GitHub), and it might get ported to the Pygments engine used at GitHub and elsewhere.

The real trouble with Stata might actually lie with the overwhelming dominance of “regression quants” in its user base. Regression analysis curbs how you think towards net effects, which is not necessarily what you need. I will probably come back to this in later posts.

Why aggregate this blog? For some time, I have been hoping to connect this blog and course to a larger community of Stata users. Neither have ever been advertised to Statalist, but the blog has gained a small readership, and the course is also public thanks to its hosting as a GitHub repository.

How is this blog aggregated? Blog aggregators work by making use of RSS feeds, which are a handy way to syndicate a website’s content. Most blogging engines offer at least one blog-wide feed. Tumblr also offers hidden tag-specific feeds. This post starts a series that will be tagged stata.

Continue reading Joining the Stata Bloggers aggregator: An opening post