Methodology

EduStrat — Educational Stratification in PISA
Statistical Methods and Technical Documentation

This documentation describes the statistical methods used in EduStrat to study the intergenerational transmission of educational achievement—specifically, how parental characteristics (education, occupational status, and household wealth, captured through the ESCS index) relate to children's academic performance at age 15 in mathematics, reading, and science across more than 100 countries worldwide.

1. Data Sources

1.1 PISA Programme

This application uses data from the Programme for International Student Assessment (PISA), a triennial international survey coordinated by the Organisation for Economic Co-operation and Development (OECD). PISA assesses 15-year-old students' proficiency in reading, mathematics, and science, and has been conducted every three years since 2000 (OECD, 2024).

PISA employs a two-stage stratified sampling design. In the first stage, schools are sampled with probability proportional to size. In the second stage, students within schools are randomly selected. This design ensures representative samples of the target population in each participating country while maintaining statistical efficiency (OECD, 2024).

1.2 learningtower R Package

Data are accessed via the learningtower R package (Wang et al., 2024), which provides cleaned and harmonized PISA data from 2000-2022 in an easy-to-use format. The package standardizes variable names across assessment cycles and handles missing data codes according to OECD specifications.

Because learningtower distributes only the final student weight, the country–years for which design-correct (BRR) standard errors are available were re-sourced directly from the OECD Public Use Files to obtain the 80 replicate weights (Section 3.8). Point estimates are identical to the learningtower values; only the replicate weights are added.

1.3 Assessment Cycles

The current implementation includes data from eight PISA cycles:

2. Variables

2.1 Achievement Scores

PISA does not administer a single test to all students. Instead, students receive booklets containing different combinations of test items. To account for this design, PISA uses plausible values – a set of imputed values that represent the range of abilities a student might have demonstrated had they taken the full assessment (OECD, 2009).

Note: The learningtower R package provides only the first plausible value (PV1) for each domain, and EduStrat therefore uses PV1 throughout. Full publication-grade analyses should use all plausible values (5 in earlier cycles, 10 from PISA 2015 onward) and combine results using Rubin's rules (Rubin, 1987). See OECD guidance on analysing the PISA database.

Achievement scores are scaled to have an international mean of 500 and standard deviation of 100 in the first assessment cycle when a domain was the major focus.

2.2 Economic, Social and Cultural Status (ESCS): Measuring Family Background

The ESCS index is PISA's composite measure of family socioeconomic status—the primary variable for studying intergenerational transmission of educational achievement. It captures how parental characteristics relate to children's outcomes through three components:

  1. Highest parental occupation (HISEI): International Socio-Economic Index of Occupational Status, measuring the occupational prestige and economic rewards associated with parents' jobs. Based on the International Standard Classification of Occupations (ISCO).
  2. Highest parental education (PARED): Years of schooling completed by the most educated parent, coded using ISCED (International Standard Classification of Education) levels.
  3. Home possessions (HOMEPOS): An index of household wealth, cultural possessions (books, art), and home educational resources (desk, computer, internet access). This captures both material resources and cultural capital available to the student.

These components are combined using principal components analysis and standardized to have an OECD mean of 0 and standard deviation of 1 (Avvisati & Wuyts, 2024). When analyzing ESCS gradients or quartile gaps, researchers are directly measuring the strength of intergenerational transmission—how strongly these parental characteristics predict their children's academic performance at age 15.

ESCSi = PCA(HISEIi, PAREDi, HOMEPOSi) where: HISEIi = highest parental occupation for student i PAREDi = highest parental education for student i HOMEPOSi = home possessions index for student i

2.3 Sampling Weights

PISA provides multiple types of weights to account for the complex sampling design:

Weight Type Variable Name Purpose
Final Student Weight W_FSTUWT Represents national student population; accounts for selection probability and nonresponse
Senate Weight W_FSENWT Equally weights countries; each country contributes equally regardless of population
Replicate Weights W_FSTURWT1W_FSTURWT80 Used for balanced repeated replication (BRR) standard error estimation

The default analysis uses final student weights (W_FSTUWT) following OECD technical standards (OECD, 2009). In the learningtower package, this variable is named stu_wgt. Point estimates use the weight selected in the interface (student, senate, or unweighted). Where the data files also carry the 80 replicate weights, EduStrat additionally reports design-correct standard errors by balanced repeated replication; see Section 3.8.

3. Statistical Methods

3.1 Weighted Descriptive Statistics

All descriptive statistics account for sampling weights. The weighted mean is calculated as:

μ̂w = (Σ wi · yi) / (Σ wi) where: yi = achievement score for student i wi = sampling weight for student i μ̂w = weighted mean

The weighted variance is:

σ̂²w = [Σ wi · (yi - μ̂w)²] / (Σ wi)

Weighted quantiles are computed by sorting observations by achievement score and finding the value at which the cumulative sum of weights reaches the target percentile times the total weight.

3.2 Distributional Measures

These measures summarize the spread of achievement distributions within countries. While not direct measures of intergenerational transmission, they provide context for understanding achievement variation and can be compared across countries and over time.

Gini Coefficient

The Gini coefficient measures dispersion in achievement distribution, ranging from 0 (all students have identical scores) to 1 (maximum dispersion). When sampling weights are available, EduStrat computes the weighted Gini coefficient using the covariance form on weighted fractional ranks. Observations are sorted in ascending order by score:

G = (2 / μ̂w) · [ Σi wi · (yi − μ̂w) · (Fi − F̄) ] / W where: μ̂w = weighted mean of scores W = Σ wi (total sampling weight) Fi = (Σj<i wj + wi/2) / W (midpoint cumulative weight share) F̄ = Σ wi · Fi / W

This is the standard population Gini estimator; it reduces to the ordinary Gini coefficient when all weights are equal. It is verified against an independent R reference (see Section 5.3).

Coefficient of Variation

CV = σ / μ where σ is standard deviation and μ is mean

Percentile Ratios

The P90/P10 ratio compares the 90th percentile to the 10th percentile:

P90/P10 = Q0.90 / Q0.10

3.3 ESCS Gradient: The Core Measure of Intergenerational Transmission

The ESCS gradient (β) is the central statistic for quantifying intergenerational transmission of educational achievement. It measures how many score points of achievement are associated with a one-unit increase in family socioeconomic status—directly capturing how strongly parental education, occupation, and wealth predict children's academic outcomes. It is the slope from a weighted bivariate regression:

Yi = α + β · ESCSi + εi where: Yi = achievement score (math, reading, or science) ESCSi = family socioeconomic status index β = ESCS gradient (intergenerational transmission coefficient) εi = residual error A larger β indicates stronger intergenerational transmission—children's achievement is more strongly predicted by their parents' characteristics.

The weighted least squares estimator is:

β̂ = [Σ wi · (ESCSi - ESCS̄w) · (Yi - Ȳw)] / [Σ wi · (ESCSi - ESCS̄w)²]

3.4 Regression Models

Pooled OLS

The pooled ordinary least squares model treats all observations as independent:

Yij = β0 + β1·ESCSij + β2·Xij + εij where: Yij = achievement for student i in country j ESCSij = socioeconomic status Xij = vector of control variables (gender, parental education, etc.) εij ~ N(0, σ²)

Fixed Effects Model

The fixed effects model includes country-specific intercepts (αj) to control for all time-invariant country characteristics:

Yijt = αj + γt + β1·ESCSijt + β2·Xijt + εijt where: αj = country fixed effects (j = 1, ..., J countries) γt = year fixed effects (t = 1, ..., T years)

Country fixed effects are implemented using dummy variables with one country as the reference category.

Random Effects Model

The random effects model assumes country intercepts are drawn from a distribution:

Yij = β0 + uj + β1·ESCSij + β2·Xij + εij where: uj ~ N(0, τ²) = random country effect εij ~ N(0, σ²) = individual error Total variance: Var(Y) = τ² + σ²

Random effects are estimated by feasible GLS (quasi-demeaning; Wooldridge, 2010) with group-specific weights:

Yij* = Yij - θj · Ȳj where θj = 1 - √[σ² / (σ² + nj·τ²)]

The variance components are estimated by the Swamy–Arora method: the idiosyncratic variance σ² from the within (fixed-effects) residuals, and the between-country variance τ² from the group-means (between) regression. On unweighted data this reproduces R's plm(model = "random", random.method = "swar") — the ESCS slope and its standard error agree to within ~1e-6 (Section 5.3). When the student weight is in use the same construction is applied with sampling weights as a design-weighted extension.

3.5 Model Selection and Diagnostics

Information Criteria

Model comparison uses information-theoretic criteria that balance goodness-of-fit with parsimony:

Akaike Information Criterion (AIC): AIC = n · log(RSS/n) + 2k Bayesian Information Criterion (BIC): BIC = n · log(RSS/n) + k · log(n) where: n = number of observations k = number of parameters (including intercept) RSS = residual sum of squares

Interpretation: Lower AIC/BIC values indicate better model fit. BIC penalizes model complexity more heavily than AIC (k·log(n) vs 2k), making it more conservative for large samples. AIC tends to select more complex models, while BIC favors parsimony (Burnham & Anderson, 2004).

Residual Diagnostics

Regression assumptions are validated through graphical diagnostics:

Residual vs Fitted Plot: Detects violations of homoscedasticity (constant variance). Under the assumption ε ~ N(0, σ²), residuals should be randomly scattered around zero without systematic patterns. Fanning patterns indicate heteroscedasticity, which may require robust standard errors or variance-stabilizing transformations.

Q-Q Plot (Quantile-Quantile Plot): Tests residual normality by comparing sample quantiles against theoretical normal quantiles. The standardized residuals are computed as:

zi = (ei - ē) / SD(e) where: ei = residual for observation i ē = mean residual (should be ≈ 0) SD(e) = standard deviation of residuals

Points lying close to the 45-degree reference line indicate normally distributed residuals. Systematic deviations (S-curves, fat tails) suggest departures from normality. Theoretical quantiles are computed using the inverse normal CDF approximation (Beasley-Springer-Moro algorithm).

Hausman Specification Test: Tests whether random effects assumptions hold by comparing FE and RE estimates (Hausman, 1978):

H = (β̂FE - β̂RE)' · [Var(β̂FE) - Var(β̂RE)]-1 · (β̂FE - β̂RE) ~ χ²k H0: E(uj | X) = 0 (RE is consistent) H1: E(uj | X) ≠ 0 (only FE is consistent)

If p < 0.05, reject random effects in favor of fixed effects.

3.6 Variance Decomposition

The intraclass correlation coefficient (ICC) partitions total variance into within-country and between-country components. All means and variances in the decomposition are computed using the final student sampling weights (stu_wgt), so that each country's contribution is population-representative:

ICC (ρ) = τ² / (τ² + σ²) where: τ² = between-country variance = Σc (Wc / W) · (Ȳc - Ȳ)² σ² = within-country variance = Varw(Y) - τ² Wc = sum of sampling weights in country c W = total sum of sampling weights across all countries Ȳc = weighted mean of scores in country c Ȳ = overall weighted mean Varw(Y) = overall weighted variance

The ICC represents the proportion of total variance in achievement that is due to differences between countries. An ICC of 0.15 means 15% of variance is between countries and 85% is within countries.

3.7 ESCS Quartile Gaps: Family Background Differences in Achievement

Quartile gaps provide an intuitive measure of intergenerational transmission by comparing achievement between students from different family backgrounds. The Q4–Q1 gap shows the score-point difference between children from the most advantaged (top ESCS quartile) versus least advantaged (bottom ESCS quartile) families:

Gap (Q4-Q1) = ȲQ4 - ȲQ1 Effect Size (Cohen's d) = (ȲQ4 - ȲQ1) / SDpooled where: ȲQ4 = weighted mean for top SES quartile ȲQ1 = weighted mean for bottom SES quartile SDpooled = √[(SDQ4² + SDQ1²) / 2]

Effect sizes are interpreted using Cohen's conventions: small (0.2), medium (0.5), large (0.8) (Cohen, 1988).

3.8 Standard Errors: model-based and BRR

PISA's two-stage stratified, clustered design means that standard errors computed under a simple-random-sampling (SRS) assumption understate the true sampling uncertainty. EduStrat reports two kinds of standard error:

For any statistic θ, the Fay BRR variance is:

VBRR(θ̂) = 1 / (G·(1 − k)²) · Σr=1G (θ̂r − θ̂)² where: θ̂ = estimate using the final weight W_FSTUWT θ̂r = estimate using replicate weight r G = 80 replicate weights k = 0.5 (Fay factor for PISA), so 1 / (G·(1 − k)²) = 1/20

EduStrat's BRR implementation reproduces the intsvy R package exactly (to machine precision) and is verified across three cycles (2015, 2018, 2022); see the repository's VERIFICATION.md. The design effect is substantial and stable: for the Finnish mathematics mean the BRR standard error is 1.7–2.0 times the SRS error in every cycle. Toggling the Sampling Weights control between the student weight and no weights lets users see this difference directly.

Scope and a remaining limitation. Replicate weights are currently included for a documented subset of country–years (re-sourced from the OECD Public Use Files; the pipeline can be extended to others). BRR here captures sampling variance only: because learningtower supplies a single plausible value (PV1), the measurement/imputation component across plausible values is not estimated, and the resulting standard errors are therefore still slightly conservative relative to a full PV + BRR analysis. Users requiring publication-grade inference should combine all plausible values using Rubin's rules (Rubin, 1987) alongside BRR.

4. Assumptions and Limitations

4.1 Causal Inference

Important: PISA is a cross-sectional observational study. All analyses report associations, not causal effects. The ESCS gradient measures how strongly parental characteristics (education, occupation, wealth) predict children's achievement at age 15, but does not prove that family background causes differences in achievement. Unmeasured confounders (e.g., genetic factors, peer effects, school quality, neighborhood characteristics) may explain observed associations. ESCS gradients should be interpreted as descriptive measures of intergenerational transmission patterns, not as causal estimates of the effect of changing family socioeconomic status.

4.2 Missing Data

The current implementation uses listwise deletion for missing values. Students with missing data on any variable in the analysis are excluded. This approach is appropriate when data are missing completely at random (MCAR) but may introduce bias if data are missing at random (MAR) or not at random (MNAR).

OECD recommends multiple imputation for handling missing data in PISA (OECD, 2009), which is not yet implemented in this tool.

4.3 Sampling Design

All analyses account for sampling weights, which adjust for unequal selection probabilities and nonresponse. For country–years that include the replicate weights, EduStrat reports design-correct standard errors by balanced repeated replication (Section 3.8), which capture the stratified, clustered design. Where replicate weights are not available, the model-based standard errors assume simple random sampling and may understate uncertainty; for those cases, publication-quality inference should use BRR or cluster-robust errors in R or Stata.

4.4 Measurement Error

Achievement scores are measured with error. PISA addresses this through plausible values, but the learningtower R package provides only the first plausible value (PV1) for each domain, and EduStrat is therefore constrained to PV1. Full uncertainty quantification requires analyzing all plausible values and combining results using Rubin's rules (Rubin, 1987). See OECD guidance on analysing the PISA database.

4.5 Comparability Across Cycles

While PISA attempts to maintain comparability across assessment cycles through linking procedures, changes in test design, sampling, and country participation may affect trend estimates. Caution is advised when interpreting changes over time, particularly for countries that altered their sampling procedures.

4.6 Generalizability

PISA results are representative of 15-year-old students enrolled in school in participating countries. Results do not generalize to:

5. Software Implementation

5.1 Statistical Libraries

Analyses are implemented in JavaScript using the following libraries:

5.2 Numerical Stability

Regression models use ridge-regularized normal equations to improve numerical stability when computing (X'X)⁻¹:

β̂ridge = (X'WX + λI)⁻¹X'WY where λ = 10⁻¹⁰ · trace(X'WX)

This prevents singular matrix errors when country fixed effects induce near-collinearity.

5.3 Validation

Every statistic the application reports is checked against an independent reference implementation in R, computed on the same PISA chunks. The harness runs the application's own analysis modules in Node and compares the output term by term; the R references use peer-reviewed packages:

All 83 computational checks and 21 replicate-weight checks pass, and a headless-browser test confirms the application runs in a real browser with no errors and renders the BRR standard errors. The verification scripts (pipeline/scripts/04-verify-computations.R, 06-verify-brr.R) and a full report (VERIFICATION.md) are included in the repository so the results can be reproduced. This process also corrected several issues found in development (the weighted Gini formula, the random-effects variance components, and the Hausman variance term).

6. References

Avvisati, F. & Wuyts, C. (2024). The measurement of socio-economic status in PISA. OECD Education Working Papers, No. 321. OECD Publishing. https://doi.org/10.1787/0c5b793c-en
Burnham, K. P. & Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection. Sociological Methods & Research, 33(2), 261–304. https://doi.org/10.1177/0049124104268644
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Caro, D. H. & Biecek, P. (2021). intsvy: International Assessment Data Manager (R package). Tools for analyzing PISA, TIMSS, PIRLS and PIAAC data with replicate weights and plausible values. https://CRAN.R-project.org/package=intsvy
Croissant, Y. & Millo, G. (2008). Panel data econometrics in R: The plm package. Journal of Statistical Software, 27(2), 1–43. https://doi.org/10.18637/jss.v027.i02
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46(6), 1251–1271. https://doi.org/10.2307/1913827
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161. https://doi.org/10.1111/j.1745-3984.1992.tb00371.x
OECD (2009). PISA Data Analysis Manual: SPSS (2nd ed.). OECD Publishing. https://doi.org/10.1787/9789264056275-en
OECD (2024). Programme for International Student Assessment (PISA) Database. Paris: OECD. Available at: https://www.oecd.org/pisa/
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons. https://doi.org/10.1002/9780470316696
Wang, K., Yacobellis, P., Siregar, E., Romanes, S., Fitter, K., Dalla Riva, G. V., Cook, D., Tierney, N., Dingorkar, P., Sai Subramanian, S., & Chen, G. (2024). learningtower: OECD PISA datasets from 2000–2022 in an easy-to-use format (R package, Version 1.1.0). https://doi.org/10.32614/CRAN.package.learningtower
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.
Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2–3), 114–128. https://doi.org/10.1016/j.stueduc.2005.05.005
← Back to Application