Datasets
Links to various public data sources, and datasets created by myself and coauthors.
South African microdata
The single best source for South African microdata for academic use, particularly for household surveys, is the DataFirst Data Portal. Their catalogue also includes data from a variety of other African countries.
Statistics South Africa
The national statistical office, Statistics South Africa, collects and produces an invaluable array of South African data, but their website is difficult to navigate. A few particularly useful pages:
- Historic CPI series, CPI archive, PPI archive
- The historic headline CPI data is trapped inside a PDF: I wrote a small script to extract and reshape the data into long
.csv
and.dta
files, available at my Github repo. - Aidan Horn scrapes and publishes this and more disaggregated non-historic data.
- The historic headline CPI data is trapped inside a PDF: I wrote a small script to extract and reshape the data into long
- ‘Interactive data’ webpage, which links to the Nesstar interface for downloading microdata, and to economic time series in machine-readable formats.
- QLFS, QES and GDP archives
- Archives of various monthly series, mainly production and sales data: Mining, Manufacturing, Electricity, Retail trade, Wholesale trade, Food and beverages, Motor trade, Tourist accomodation, Tourism and Migration, Import/Export Unit Value Indices
- Input-output (see also here and here) and Supply and use (see here for 1998-2005) tables
- Mid-year population estimates
Matched employer-employee tax administrative data
Can be accessed in-person at the National Treasury Secure Data Facility (NT-SDF) in Pretoria. Access to the data and output retrieval is strictly controlled to ensure anonymity is preserved and to comply with relevant legislation: see the SA-TIED website for details.
- The datasets are very large, the underlying tax data is not collected with researchers in mind, and the data documentation is incomplete. This data is incredibly rich, but is not easy to work with. You must be very comfortable with Stata, R or Python to conduct productive research with this data; the NT-SDF is not a place to learn these languages as you go.
- I have provided some tips on working with large data in Stata, based on my experiences at the NT-SDF.
- The NT-SDF administrators may be able to direct you to research assistants based in Pretoria who can run analysis for you, but keep in mind that the data is complex to work with and it is probably a good idea to retain fairly close oversight and run many checks on the data as you go.
- Carefully reviewing the available documentation before you start is highly beneficial. More detailed documentation and discussion is available for sub-components of the data, which may be more disaggregated than the firm-level (e.g. worker-level employment data). See references in the combined panel guide.
- Data updates are available here; more detail is provided here.
South African industry codes and cross-walks
Tables from my report with Amina Ebrahim, Industry classification in the South African tax microdata, are available below and at the NT-SDF. If you use these tables, please cite the paper.
- Industry codes: Stats SA SIC 7 | Stats SA SIC 5 | SARS Activity Codes | SARS Profit Codes
- Concordance tables: SIC 5 to SIC 7 | SIC 7 to SIC 5 | Activity Codes to SIC 5 | Profit Codes to SIC 5
- Github repo with CSV versions & additional resources
South African COVID-19 data
- Machine-readable South African COVID-19 data, from the Data Science for Social Impact group at the University of Pretoria.
- The NICD is the definitive source for SA COVID-19 data, but last I checked most of their data were trapped inside PDFs.
- Weekly deaths and estimated excess deaths data from the SAMRC.
Electricity supply/generation and load shedding (rolling blackouts)
- National load shedding implementation data from EskomSePush.
- The Eskom Data Portal provides a wide array of detailed time series data. Use the (automated) “Data request form”. Longer time series are available via PAIA request.
- The City of Cape Town Open Data Portal provides detailed load shedding implementation data for city-supplied regions.
- Detailed Eskom monthly plant-level energy availability factors (EAF) (see the “Evidence docket” link) provided by amaBhungane.
Geospatial data
- The 2011 South African Census data and shapefiles are available from DataFirst.
- The lowest Small Area Layer (SAL) shapefiles do not include polygons for areas where a Small Area (SA) contains 10 or fewer individuals. Helene Verhoef kindly provided me with additional Stats SA shapefiles which include these (empty) polygons. Please acknowledge Stats SA if you use them.
- Adrian Frith’s census mapping website is very useful if working with this data.
- Adrian Frith also provided a small dataset linking 2001 to 2011 municipality geographies, which one can use to construct a cross-walk (context).
- The SALDRU YouthExplorer is also a very useful source for geolocated South African data. Apart from youth-focused socio-economic statistics, it also provides downloadable point data for the coordinates of “service points” such as schools, police stations, post offices, healthcare facilities, SASSA offices, and a variety of other facilities.
- South African police station coordinates with their boundaries are available from SAPS.
- The Copernicus Climate Data Store is a definitive source of weather data.
- The Gridded Population of the World (GPW) data from SEDAC is very useful if you need the spatial distribution of human population across a continuous raster surface, rather than one defined by administrative boundaries.
Other
- Catalogue of public data sources courtesy of Open Data South Africa. Their website, with data organised by theme, is easier to navigate than the (overwhelming) catalogue.