Datasets

Links to various public data sources, and datasets created by myself and coauthors.


South African microdata

The single best source for South African microdata for academic use, particularly for household surveys, is the DataFirst Data Portal. Their catalogue also includes data from a variety of other African countries.


Statistics South Africa

The national statistical office, Statistics South Africa, collects and produces an invaluable array of South African data, but their website is difficult to navigate. A few particularly useful pages:


Matched employer-employee tax administrative data

Can be accessed in-person at the National Treasury Secure Data Facility (NT-SDF) in Pretoria. Access to the data and output retrieval is strictly controlled to ensure anonymity is preserved and to comply with relevant legislation: see the SA-TIED website for details.

  • The datasets are very large, the underlying tax data is not collected with researchers in mind, and the data documentation is incomplete. This data is incredibly rich, but is not easy to work with. You must be very comfortable with Stata, R or Python to conduct productive research with this data; the NT-SDF is not a place to learn these languages as you go.
  • I have provided some tips on working with large data in Stata, based on my experiences at the NT-SDF.
  • The NT-SDF administrators may be able to direct you to research assistants based in Pretoria who can run analysis for you, but keep in mind that the data is complex to work with and it is probably a good idea to retain fairly close oversight and run many checks on the data as you go.
  • Carefully reviewing the available documentation before you start is highly beneficial. More detailed documentation and discussion is available for sub-components of the data, which may be more disaggregated than the firm-level (e.g. worker-level employment data). See references in the combined panel guide.
  • Data updates are available here; more detail is provided here.

South African industry codes and cross-walks

Tables from my report with Amina Ebrahim, Industry classification in the South African tax microdata, are available below and at the NT-SDF. If you use these tables, please cite the paper.


South African COVID-19 data

Electricity supply/generation and load shedding (rolling blackouts)

Geospatial data
  • The 2011 South African Census data and shapefiles are available from DataFirst.
    • The lowest Small Area Layer (SAL) shapefiles do not include polygons for areas where a Small Area (SA) contains 10 or fewer individuals. Helene Verhoef kindly provided me with additional Stats SA shapefiles which include these (empty) polygons. Please acknowledge Stats SA if you use them.
    • Adrian Frith’s census mapping website is very useful if working with this data.
    • Adrian Frith also provided a small dataset linking 2001 to 2011 municipality geographies, which one can use to construct a cross-walk (context).
  • The SALDRU YouthExplorer is also a very useful source for geolocated South African data. Apart from youth-focused socio-economic statistics, it also provides downloadable point data for the coordinates of “service points” such as schools, police stations, post offices, healthcare facilities, SASSA offices, and a variety of other facilities.
  • South African police station coordinates with their boundaries are available from SAPS.
  • The Copernicus Climate Data Store is a definitive source of weather data.
  • The Gridded Population of the World (GPW) data from SEDAC is very useful if you need the spatial distribution of human population across a continuous raster surface, rather than one defined by administrative boundaries.

Other