Data Sources

Comprehensive guide to the PISA data underlying the Educational Stratification in PISA tool

This application is designed to study the intergenerational transmission of educational achievement—specifically, how parental characteristics (education, occupational status, and household wealth, captured through the ESCS index) relate to children's academic performance at age 15 in mathematics, reading, and science across more than 100 countries worldwide.

1. The PISA Programme

1.1 Programme Overview

The Programme for International Student Assessment (PISA) is a triennial international survey coordinated by the Organisation for Economic Co-operation and Development (OECD) since 2000. PISA assesses the extent to which 15-year-old students near the end of compulsory education have acquired the knowledge and skills essential for full participation in modern societies.

            Key Facts:
            Launch Year: 2000
Frequency: Every 3 years
Target Population: 15-year-old students
Domains Assessed: Mathematics, Reading, Science (rotating major focus)
Participating Countries/Economies: 81+ (as of 2022)
Sample Size: ~600,000 students per cycle

        

1.2 PISA Assessment Cycles

PISA has been conducted in the following years, with each cycle focusing on one major domain while assessing all three:

Year	Major Domain	Countries/Economies	Students Assessed	Status
2000	Reading	43	~265,000	Available in App
2003	Mathematics	41	~276,000	Available in App
2006	Science	57	~398,000	Available in App
2009	Reading	65	~475,000	Available in App
2012	Mathematics	65	~510,000	Available in App
2015	Science	72	~540,000	Available in App
2018	Reading	79	~600,000	Available in App
2022	Mathematics	81	~690,000	Available in App

This application includes data from all PISA cycles from 2000-2022 (8 assessment cycles: 2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022), covering 513 country-year combinations across 101+ unique countries/economies.

2. Assessment Framework

2.1 Mathematics Assessment

PISA mathematics assesses students' capacity to formulate, employ, and interpret mathematics in a variety of contexts. It includes reasoning mathematically and using mathematical concepts, procedures, facts, and tools to describe, explain, and predict phenomena.

Mathematics Competencies:

Mathematical reasoning
Problem formulation and solving
Mathematical modeling
Mathematical argumentation

2.2 Reading Assessment

PISA reading literacy assesses students' capacity to understand, use, evaluate, reflect on, and engage with texts in order to achieve goals, develop knowledge and potential, and participate in society.

Reading Competencies:

Retrieving information
Forming broad understanding
Developing interpretation
Reflecting on and evaluating content and form

2.3 Science Assessment

PISA science literacy assesses the ability to engage with science-related issues and with the ideas of science, as a reflective citizen. It includes understanding natural phenomena, designing scientific enquiry, and interpreting evidence.

Science Competencies:

Explaining phenomena scientifically
Evaluating and designing scientific enquiry
Interpreting data and evidence scientifically

3. Sampling Design

3.1 Target Population

PISA targets students who are between 15 years 3 months and 16 years 2 months at the time of assessment, regardless of their grade level. This age range was chosen because students at this age are approaching the end of compulsory schooling in most OECD countries.

3.2 Two-Stage Stratified Sampling

PISA employs a sophisticated two-stage stratified sampling design:

Stage 1 - School Sampling: Schools are sampled with probability proportional to size (PPS), where size is the number of eligible 15-year-old students. Typically, 150-200 schools per country are selected.
Stage 2 - Student Sampling: Within each selected school, approximately 35 students are randomly sampled from the complete list of eligible 15-year-old students.

            Sampling Standards:
            Minimum Sample Size: 4,500 students (for small countries, all eligible students may be assessed)
Minimum School Participation Rate: 85%
Minimum Student Participation Rate: 80%
Combined Participation Rate: At least 65% (school rate × student rate)

        

3.3 Sampling Weights

PISA provides several types of sampling weights to ensure representative estimates:

Student Weight (W_FSTUWT): Adjusts for unequal probability of selection and non-response. Use this for most analyses.
Senate Weight (W_FSENWT): Gives each country equal weight regardless of population size. Use for cross-country comparisons where you want each country to contribute equally.
Replicate Weights (W_FSTR1-W_FSTR80): Used for balanced repeated replication (BRR) variance estimation.

4. The learningtower R Package

4.1 Package Overview

This application uses data processed through the learningtower R package (Vaughan et al., 2021), which provides harmonized, analysis-ready PISA data in a consistent format.

learningtower Package Benefits:

Harmonized variable names across assessment cycles
Cleaned and validated data
Consistent country codes (ISO 3166-1 alpha-3)
Simplified access to core variables
Reduced file size compared to raw OECD data

4.2 Package Installation

The learningtower package is available on CRAN:

install.packages("learningtower")
library(learningtower)

4.3 Data Access via learningtower

The package provides easy access to PISA data:

# Load all student data for a specific year
data_2018 <- load_student(2018)

# Load data for specific countries
data_usa <- load_student(2018, countries = "USA")

# Access codebook
codebook <- load_codebook()

4.4 Package Citation

If you use data from this application, please cite the learningtower package:

Vaughan, B., Stanke, L., Teng, T., Hyndman, R., & O'Hara-Wild, E. (2021). learningtower: OECD PISA datasets from 2000-2018 in an easy-to-use format (R package version 1.0.1). https://CRAN.R-project.org/package=learningtower

5. Data Structure in This Application

5.1 JSON Chunk Format

This application pre-generates 513 country-year specific JSON files (e.g., USA_2018.json) for efficient progressive loading. Each chunk contains:

{
  "country": "USA",
  "year": 2018,
  "n_students": 4838,
  "data_quality": {
    "missing_math": 0,
    "missing_reading": 0,
    "missing_science": 0,
    "missing_escs": 0,
    "complete_cases": 4838
  },
  "students": [
    {
      "student_id": "USA_2018_00001",
      "math": 498.5,
      "reading": 505.2,
      "science": 502.9,
      "escs": 0.23,
      "gender": "male",
      "age": 15.5,
      ...
    }
  ]
}

5.2 File Organization

Location: pisa/data/country-year/
Naming Convention: {COUNTRY}_{YEAR}.json (e.g., USA_2018.json, DEU_2022.json)
Total Files: 320 files
File Size Range: 2-5 MB per file
Total Data Size: ~1 GB

5.3 Metadata File

The application also includes a metadata.json file that catalogs all available data:

{
  "countries": ["ALB", "ARG", "AUS", ...],
  "years": [2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022],
  "variables": {
    "math": "Mathematics achievement score",
    "reading": "Reading achievement score",
    "science": "Science achievement score",
    "escs": "PISA index of economic, social and cultural status",
    ...
  }
}

6. Variable Codebook

6.1 Achievement Variables

math Mathematics achievement score (plausible value 1)

reading Reading achievement score (plausible value 1)

science Science achievement score (plausible value 1)

Note: PISA uses plausible values to account for measurement error. This application uses the first plausible value (PV1) for each domain for simplicity. Advanced analyses should consider all 10 plausible values.

6.2 Socioeconomic Status Variables

escs PISA index of economic, social and cultural status (standardized to mean 0, SD 1 across OECD countries)

wealth Family wealth index derived from household possessions

books Number of books at home (categorical: 0-10, 11-25, 26-100, 101-200, 201-500, 500+)

6.3 Parental Education Variables

mother_educ Mother's education level (ISCED classification)

father_educ Father's education level (ISCED classification)

parent_edu Highest parental education (years of schooling)

6.4 Demographic Variables

gender Student gender (male/female)

age Student age in years

computer Has computer at home (yes/no)

6.5 Sampling Weights

w_fstuwt Final student weight (use for most analyses)

w_fsenwt Senate weight (equal country weighting)

6.6 Identification Variables

country Country code (ISO 3166-1 alpha-3, e.g., USA, DEU, JPN)

year PISA assessment year (2000-2022)

student_id Unique student identifier

school_id School identifier (for clustering standard errors)

7. Official OECD Data Access

7.1 OECD PISA Data Portal

The official source for all PISA data is the OECD PISA Data Portal:

Main Portal: https://www.oecd.org/pisa/data/

7.2 Cycle-Specific Data

PISA 2022: https://www.oecd.org/pisa/data/2022database/
PISA 2018: https://www.oecd.org/pisa/data/2018database/
PISA 2015: https://www.oecd.org/pisa/data/2015database/
PISA 2012: https://www.oecd.org/pisa/data/2012database/

7.3 Technical Documentation

PISA 2018 Technical Report: Technical Report
PISA Data Analysis Manual: OECD (2009). PISA data analysis manual: SPSS (2nd ed.). https://doi.org/10.1787/9789264056275-en

8. Data Quality and Limitations

8.1 Strengths

Large, representative samples from each participating country
Standardized assessment framework enabling cross-country comparisons
Rich contextual questionnaires on student background
High-quality sampling design with rigorous quality control
Transparent documentation and public data availability

8.2 Limitations

Cross-sectional design: Data from each cycle are collected at a single point in time, limiting causal inference
Coverage exclusions: Some countries exclude specific populations (e.g., remote regions, special education students)
Response rates: Vary by country; some countries have lower participation rates
Translation effects: Assessments are translated into 40+ languages, potentially affecting comparability
Cultural context: Test items may have different cultural relevance across countries

8.3 Missing Data

PISA data contain missing values due to:

Item non-response (student skipped a question)
Not reached (student ran out of time)
Not administered (question not in student's test booklet)
Questionnaire non-response

This application uses complete-case analysis by default. Advanced users should consider multiple imputation methods for handling missing data.

9. Countries Included in This Application

9.1 OECD Countries

The following 38 OECD countries are available (availability varies by year):

Australia (AUS)

Austria (AUT)

Belgium (BEL)

Canada (CAN)

Chile (CHL)

Colombia (COL)

Costa Rica (CRI)

Czech Republic (CZE)

Denmark (DNK)

Estonia (EST)

Finland (FIN)

France (FRA)

Germany (DEU)

Greece (GRC)

Hungary (HUN)

Iceland (ISL)

Ireland (IRL)

Israel (ISR)

Italy (ITA)

Japan (JPN)

Korea (KOR)

Latvia (LVA)

Lithuania (LTU)

Luxembourg (LUX)

Mexico (MEX)

Netherlands (NLD)

New Zealand (NZL)

Norway (NOR)

Poland (POL)

Portugal (PRT)

Slovak Republic (SVK)

Slovenia (SVN)

Spain (ESP)

Sweden (SWE)

Switzerland (CHE)

Turkey (TUR)

United Kingdom (GBR)

United States (USA)

9.2 Partner Countries/Economies

An additional 60+ partner countries and economies are available, including:

Albania (ALB)

Argentina (ARG)

Brazil (BRA)

Bulgaria (BGR)

China (CHN)*

Croatia (HRV)

Hong Kong (HKG)

India (IND)*

Indonesia (IDN)

Jordan (JOR)

Kazakhstan (KAZ)

Macao (MAC)

Malaysia (MYS)

Peru (PER)

Qatar (QAT)

Romania (ROU)

Russia (RUS)

Serbia (SRB)

Singapore (SGP)

Chinese Taipei (TWN)

Thailand (THA)

Uruguay (URY)

Vietnam (VNM)

* Note: Some countries participate through specific regions or provinces rather than nationally

10. References and Further Reading

10.1 Primary Sources

OECD. (2023). PISA 2022 database. Organisation for Economic Co-operation and Development. https://www.oecd.org/pisa/data/
OECD. (2019). PISA 2018 technical report. OECD Publishing. Technical Report
Vaughan, B., Stanke, L., Teng, T., Hyndman, R., & O'Hara-Wild, E. (2021). learningtower: OECD PISA datasets from 2000-2018 in an easy-to-use format (R package version 1.0.1). CRAN

10.2 Methodological References

OECD. (2009). PISA data analysis manual: SPSS (2nd ed.). OECD Publishing. https://doi.org/10.1787/9789264056275-en
OECD. (2017). PISA 2015 assessment and analytical framework. OECD Publishing. https://doi.org/10.1787/9789264281820-en

10.3 Documentation in This Application

Methodology Documentation - Statistical methods and formulas
How to Cite - Citation guide for academic use

← Back to Application