Data Sources

Comprehensive guide to the PISA data underlying EduStrat (Educational Stratification in PISA)

EduStrat is designed to study the intergenerational transmission of educational achievement—specifically, how parental characteristics (education, occupational status, and household wealth, captured through the ESCS index) relate to children's academic performance at age 15 in mathematics, reading, and science across more than 100 countries worldwide.

1. The PISA Programme

1.1 Programme Overview

The Programme for International Student Assessment (PISA) is a triennial international survey coordinated by the Organisation for Economic Co-operation and Development (OECD) since 2000. PISA assesses the extent to which 15-year-old students near the end of compulsory education have acquired the knowledge and skills essential for full participation in modern societies.

Key Facts:

1.2 PISA Assessment Cycles

PISA has been conducted in the following years, with each cycle focusing on one major domain while assessing all three:

Year Major Domain Countries/Economies Students Assessed Status
2000 Reading 43 ~265,000 Available in App
2003 Mathematics 41 ~276,000 Available in App
2006 Science 57 ~398,000 Available in App
2009 Reading 65 ~475,000 Available in App
2012 Mathematics 65 ~510,000 Available in App
2015 Science 72 ~540,000 Available in App
2018 Reading 79 ~600,000 Available in App
2022 Mathematics 81 ~690,000 Available in App

This application includes data from all PISA cycles from 2000-2022 (8 assessment cycles: 2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022), covering 513 country-year combinations across 101+ unique countries/economies.

2. Assessment Framework

2.1 Mathematics Assessment

PISA mathematics assesses students' capacity to formulate, employ, and interpret mathematics in a variety of contexts. It includes reasoning mathematically and using mathematical concepts, procedures, facts, and tools to describe, explain, and predict phenomena.

Mathematics Competencies:

2.2 Reading Assessment

PISA reading literacy assesses students' capacity to understand, use, evaluate, reflect on, and engage with texts in order to achieve goals, develop knowledge and potential, and participate in society.

Reading Competencies:

2.3 Science Assessment

PISA science literacy assesses the ability to engage with science-related issues and with the ideas of science, as a reflective citizen. It includes understanding natural phenomena, designing scientific enquiry, and interpreting evidence.

Science Competencies:

3. Sampling Design

3.1 Target Population

PISA targets students who are between 15 years 3 months and 16 years 2 months at the time of assessment, regardless of their grade level. This age range was chosen because students at this age are approaching the end of compulsory schooling in most OECD countries.

3.2 Two-Stage Stratified Sampling

PISA employs a sophisticated two-stage stratified sampling design:

  1. Stage 1 - School Sampling: Schools are sampled with probability proportional to size (PPS), where size is the number of eligible 15-year-old students. Typically, 150-200 schools per country are selected.
  2. Stage 2 - Student Sampling: Within each selected school, approximately 35 students are randomly sampled from the complete list of eligible 15-year-old students.
Sampling Standards:

3.3 Sampling Weights

PISA provides several types of sampling weights to ensure representative estimates:

4. The learningtower R Package

4.1 Package Overview

EduStrat uses data processed through the learningtower R package (Wang et al., 2024), which provides harmonized, analysis-ready PISA data in a consistent format.

learningtower Package Benefits:

4.2 Package Installation

The learningtower package is available on CRAN:

install.packages("learningtower")
library(learningtower)

4.3 Data Access via learningtower

The package provides easy access to PISA data:

# Load all student data for a specific year
data_2018 <- load_student(2018)

# Load data for specific countries
data_usa <- load_student(2018, countries = "USA")

# Access codebook
codebook <- load_codebook()

4.4 Package Citation

If you use data from this application, please cite the learningtower package:

Wang, K., Yacobellis, P., Siregar, E., Romanes, S., Fitter, K., Dalla Riva, G. V., Cook, D., Tierney, N., Dingorkar, P., Sai Subramanian, S., & Chen, G. (2024). learningtower: OECD PISA datasets from 2000–2022 in an easy-to-use format (R package version 1.1.0). https://doi.org/10.32614/CRAN.package.learningtower

5. Data Structure in This Application

5.1 JSON Chunk Format

EduStrat pre-generates 513 country-year specific JSON files (e.g., USA_2018.json) for efficient progressive loading. Each chunk contains:

{
  "country": "USA",
  "year": 2018,
  "n_students": 4838,
  "data_quality": {
    "missing_math": 0,
    "missing_reading": 0,
    "missing_science": 0,
    "missing_escs": 0,
    "complete_cases": 4838
  },
  "students": [
    {
      "year": 2018,
      "country": "USA",
      "school_id": "USA_2018_84000001",
      "student_id": "USA_2018_84000250",
      "mother_educ": "ISCED 3A",
      "father_educ": "ISCED 3A",
      "gender": "male",
      "computer": "yes",
      "internet": "yes",
      "math": 534.751,
      "science": 488.669,
      "stu_wgt": 646.6246,
      "desk": "yes",
      "room": "no",
      "television": "3+",
      "computer_n": "1",
      "car": "3+",
      "book": "11-25",
      "wealth": 0.7125,
      "escs": -0.4811,
      "reading": 590.33
    }
  ]
}

5.2 File Organization

5.3 Metadata File

The application also includes a metadata.json file that catalogs all available data:

{
  "countries": ["ALB", "ARG", "AUS", ...],
  "years": [2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022],
  "variables": {
    "math": "Mathematics achievement score",
    "reading": "Reading achievement score",
    "science": "Science achievement score",
    "escs": "PISA index of economic, social and cultural status",
    ...
  }
}

6. Variable Codebook

All variable names in the JSON data files are taken directly from the learningtower R package (Wang et al., 2024), with the sole exception that read is renamed to reading during chunk generation. The following variables are used by EduStrat in its analyses. The JSON data files also contain additional learningtower fields (wealth, computer, internet, desk, room, book, television, car, computer_n) that are not currently used by the application.

6.1 Outcome Variables

math Mathematics achievement score (plausible value 1)
reading Reading achievement score (plausible value 1)
science Science achievement score (plausible value 1)

Note: PISA uses plausible values to account for measurement error. EduStrat uses the first plausible value (PV1) for each domain. Full analyses should consider all plausible values (5 in earlier cycles, 10 from PISA 2015 onward) and combine results using Rubin's rules.

6.2 Predictor Variables

escs PISA index of economic, social and cultural status (standardized to mean 0, SD 1 across OECD countries). Primary predictor variable.
mother_educ Mother's education level (ISCED classification, e.g., "ISCED 2", "ISCED 3A")
father_educ Father's education level (ISCED classification)

Note: EduStrat derives a parent_edu variable at runtime by taking the higher of mother_educ and father_educ (converted to years of schooling via ISCED mapping). This serves as the alternative predictor option in the application.

6.3 Demographic Variables

gender Student gender ("male" / "female"). Available as a control variable in regression analyses.

6.4 Sampling Weight

stu_wgt Final student sampling weight (W_FSTUWT). Adjusts for unequal selection probability and non-response. Used in all weighted analyses.

6.5 Identification Variables

country Country code (ISO 3166-1 alpha-3, e.g., USA, DEU, JPN)
year PISA assessment year (2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022)
student_id Unique student identifier
school_id School identifier

7. Official OECD Data Access

7.1 OECD PISA Data Portal

The official source for all PISA data is the OECD PISA Data Portal:

Main Portal: https://www.oecd.org/pisa/data/

7.2 Cycle-Specific Data

7.3 Technical Documentation

8. Data Quality and Limitations

8.1 Strengths

8.2 Limitations

8.3 Missing Data

PISA data contain missing values due to:

This application uses complete-case analysis by default. Advanced users should consider multiple imputation methods for handling missing data.

9. Countries Included in This Application

9.1 OECD Countries

The following 38 OECD countries are available (availability varies by year):

Australia (AUS)

Austria (AUT)

Belgium (BEL)

Canada (CAN)

Chile (CHL)

Colombia (COL)

Costa Rica (CRI)

Czech Republic (CZE)

Denmark (DNK)

Estonia (EST)

Finland (FIN)

France (FRA)

Germany (DEU)

Greece (GRC)

Hungary (HUN)

Iceland (ISL)

Ireland (IRL)

Israel (ISR)

Italy (ITA)

Japan (JPN)

Korea (KOR)

Latvia (LVA)

Lithuania (LTU)

Luxembourg (LUX)

Mexico (MEX)

Netherlands (NLD)

New Zealand (NZL)

Norway (NOR)

Poland (POL)

Portugal (PRT)

Slovak Republic (SVK)

Slovenia (SVN)

Spain (ESP)

Sweden (SWE)

Switzerland (CHE)

Turkey (TUR)

United Kingdom (GBR)

United States (USA)

9.2 Partner Countries/Economies

An additional 60+ partner countries and economies are available, including:

Albania (ALB)

Argentina (ARG)

Brazil (BRA)

Bulgaria (BGR)

China (CHN)*

Croatia (HRV)

Hong Kong (HKG)

India (IND)*

Indonesia (IDN)

Jordan (JOR)

Kazakhstan (KAZ)

Macao (MAC)

Malaysia (MYS)

Peru (PER)

Qatar (QAT)

Romania (ROU)

Russia (RUS)

Serbia (SRB)

Singapore (SGP)

Chinese Taipei (TWN)

Thailand (THA)

Uruguay (URY)

Vietnam (VNM)

* Note: Some countries participate through specific regions or provinces rather than nationally

10. References and Further Reading

10.1 Primary Sources

  1. OECD. (2024). PISA 2022 Technical Report. OECD Publishing. https://doi.org/10.1787/01820d6d-en
  2. OECD. (2009). PISA data analysis manual: SPSS (2nd ed.). OECD Publishing. https://doi.org/10.1787/9789264056275-en
  3. Wang, K., Yacobellis, P., Siregar, E., Romanes, S., Fitter, K., Dalla Riva, G. V., Cook, D., Tierney, N., Dingorkar, P., Sai Subramanian, S., & Chen, G. (2024). learningtower: OECD PISA datasets from 2000–2022 in an easy-to-use format (R package version 1.1.0). https://doi.org/10.32614/CRAN.package.learningtower

10.2 Methodological References

  1. Avvisati, F. & Wuyts, C. (2024). The measurement of socio-economic status in PISA. OECD Education Working Papers, No. 321. OECD Publishing. https://doi.org/10.1787/0c5b793c-en
  2. Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2–3), 114–128. https://doi.org/10.1016/j.stueduc.2005.05.005

10.3 Documentation in This Application

← Back to Application