Francis Smart recently pointed to an important difference between R and Stata from a teaching perspective, which has to do with the additional learning costs of vectorization in R over the single-dataset orientation of Stata.
Stata makes it easy to manipulate names, or more specifically, variable names, as in a dataset with three variables for social expenditure called party1 party2 party3. This is common to many empirical preprocessed datasets.
// example mvdecode party*, mv(999)
Furthermore, Stata works like an accountant’s book, so all variables belong to a same data object that never needs to be called beyond loading. This naturally suppresses a lot of possibilities, compensated in part by macros and scalars.
// example loc regressors "age sex"
Macros in particular then branch with loops like the forval and foreach commands to allow more complex data processing. At that level of use, the software is flexible enough for most applied data cleaning.
// example forval i = 1/3 { replace socx`i' = socx`i' / 10^6 }
To access matrix notation, the Stata user needs to move to Mata syntax, while R immediately offers the user to manipulate objects through vectorization. Thinking in these terms is more demanding as there are more possibilities for errors, starting with calls to undeclared objects.
I teach both R and Stata. My experience with social science students is that the additional learning costs of R syntax need to be matched with other benefits to become valuable to them. To me, these benefits lie primordially in the more diverse array of data that R allows to access.
Continue reading R, Stata and matching additional learning costs