Working with "large" data in Stata
The single best resource I’ve found for working with large data in Stata are these notes from Mauricio Cáceres Bravo, the creator of Gtools.
Below are some notes from my own experience at the South African National Treasury Secure Data Facility (NT-SDF):
- When using a shared resource (e.g. a server):
- Make sure you
clear
your dataset when your analysis is finished running, if your dataset is large relative to the amount of memory available on the server. Holding large datasets in memory while not actively using them is a sin! - Be considerate if you’re running computationally/memory intensive tasks which use excessive shared resources. Inefficient code in this case is not just a tax on your time, but a tax on all other users’ time too. You may want to consider running the analysis on a subsample of the data, and then running the full analysis over a weekend or night when other users are less affected (see
help sample
, note theby()
option).
- Make sure you
-
Use the Gtools suite of commands (
help gtools
) instead of the native Stata equivalents. These includegegen
,gcollapse
,gduplicates
,glevelsof
,gunique
,gdistinct
and more, which are all essentially (much) faster versions of the original Stata commands and simply replace the original commands in your code. It’s worth checking their website or reading the help file for each command to get maximum benefits (there are oftengtools
-specific additions/options you can use), but this is not necessary. - If running fixed effects regressions, especially with many fixed effects, use Sergio Correia’s
reghdfe
. To get the computational benefits make sure the fixed effects are specified inside theabsorb()
option. In the (very few) cases wherereghdfe
is inappropriate, you might want to useareg
.xtreg, fe
is extremely slow with many fixed effects and large datasets.- If you need to estimate a large number of fixed effects (or something else) for use in subsequent analysis – save them in a dataset for use later, don’t re-estimate the whole model each time!
- If you need to estimate a large number of fixed effects (or something else) for use in subsequent analysis – save them in a dataset for use later, don’t re-estimate the whole model each time!
-
For quantiles:
fastxtile
,fasterxtile
,gquantile
- Familiarise yourself with Stata variable storage types. You do not want to hold variables in inefficient storage types which take up unnecessary memory, but on the other hand precision issues can also bite in large datasets if you are not careful in how ID variables are created and managed.
- Avoid strings as much as possible: they take up a huge amount of memory if you have many observations and/or the strings are long. It is often better to
encode
strings and save them as categorical numeric variables with value labels. - ID variables can also be converted to numeric variables, saving space and helping with commands like
xtset
, but typically not viaencode
. In this case usegegen group
(help gegen
). But take care when doing this: for ID variables it is essential that you declare the data type to retain precision. Often a “long” data type is sufficient, e.g.gegen long firmID = group(stringID)
but check digits of accuracy here; you may need to declare the type as “double”, at the expense of using more memory again. Note that the numeric ID variable will only be unique within the dataset it is created. Do not use for merging or after appending data. - In general it is useful to declare the data type when creating variables if you know what is appropriate. E.g. for a dummy variable:
gen byte female = gender == 2 if !mi(gender)
. - If you are using dates, using Stata’s native date functions (
help datetime
) can offer substantial improvements over using strings or other workarounds.
- Avoid strings as much as possible: they take up a huge amount of memory if you have many observations and/or the strings are long. It is often better to
-
ALWAYS
compress
your data before saving. This recasts variables into the smallest data type which preserves precision. -
if
conditions are slow. If you are repetitively performing analysis on a subset of the data, it is often better to subset the data beforehand withkeep
(which can be combined with frames,preserve
/restore
, tempfiles if the subset is only used temporarily). In general you may experience substantial speed improvements by reducing the dataset to what you need for analysis rather than running the analysis on a large dataset with extraneous variables/observations. -
merge
can be slow, but a substantial part of this is the internalsort
used to merge. If you consistently sort your data in a particular way before saving, and merge on the same hierarchy of variables, your merges will be quicker. And don’tsort
needlessly! -
Avoid
gsort
, which can be incredibly slow. If you need a descending sort, it is quicker to generate a negative version of the variable you’re sorting on, and sort on that. -
In general if you have to use a very slow command (e.g
reshape
) try to do it once and save the resulting dataset (or just a subset for amerge
) for re-use rather than re-running the time-intensive command again and again every time you need that particular dataset. -
codebook
is extremely slow. For numeric variables, a much quicker option to get a sense of a variable isinspect
. If you want the number of unique values of a variable, usegdistinct
orgunique
.gstats
provides a number of useful commands for quick summary statistics (help gstats
). -
Loading a large dataset takes time. Frames typically provide a speed improvement over
preserve
/restore
or using tempfiles. -
recode
is slow; a more cumbersonereplace if...
workflow is usually better in terms of computation time. But your programming time also matters!