Working Paper

EduStrat: A Browser-Based Tool for Exploratory Analysis of Educational Stratification in PISA Microdata, 2000–2022

Author: Kevin Schoenholzer1

Affiliation: 1 Institute of Communication and Public Policy, Università della Svizzera italiana (USI), Lugano, Switzerland

Correspondence: kevin.schoenholzer@usi.ch · kevinschoenholzer.com

ORCID: 0000-0001-9892-5869

Status: Draft — February 2026. Not yet peer-reviewed.

Abstract. EduStrat (Educational Stratification in PISA) is a browser-based, open-source tool for exploratory secondary analysis of OECD PISA student microdata across eight assessment cycles (2000–2022). The application enables researchers, instructors, and students to investigate the intergenerational transmission of educational achievement—specifically, how parental education, occupational status, and household wealth relate to the academic performance of 15-year-olds in mathematics, reading, and science across more than 100 countries. EduStrat provides survey-weighted descriptive statistics, distributional summaries, ESCS gradients and quartile gaps that quantify how strongly family background predicts student outcomes, pooled and country-panel regression specifications, variance decomposition, and model diagnostics. All computations run client-side; data are stored as pre-generated country–year JSON subsets loaded on demand. Results are exportable as CSV tables, high-resolution figures (PNG/SVG), and self-contained HTML reports with full analytic metadata. This paper describes the software architecture, statistical functionality, data coverage, and reuse potential. It also documents current limitations, notably the use of a single plausible value per domain and the absence of replicate-weight variance estimation in the interactive layer.

Keywords: PISA, educational inequality, intergenerational transmission, ESCS, survey weighting, regression, variance decomposition, open-source, web application, JavaScript, reproducibility

(1) Overview

Introduction

The Programme for International Student Assessment (PISA) is one of the most widely used data sources for cross-national research on educational achievement (1, 2). Conducted triennially by the OECD since 2000, PISA assesses the reading, mathematics, and science proficiency of 15-year-old students alongside extensive background questionnaires capturing family socioeconomic characteristics. This combination of standardised achievement measures and rich contextual data makes PISA particularly well suited for studying the intergenerational transmission of educational advantage—the extent to which parental education, occupational status, and household resources predict children's academic outcomes (3).

Despite the public availability of PISA microdata, routine analytic tasks remain implementation-heavy for many potential users. The data arise from a complex, stratified two-stage sampling design requiring appropriate weighting; achievement is released as sets of plausible values rather than single test scores; and descriptive results can be sensitive to choices regarding which weights, cycles, countries, and regression specifications are used (4, 5, 1). These technical requirements create a barrier that is particularly high for graduate students, instructors preparing teaching demonstrations, and researchers seeking a rapid first pass through specifications before investing in code-based replication.

EduStrat was developed to lower this barrier. It provides a browser-based interface for a set of recurring first-pass operations in large-scale assessment research: computing survey-weighted descriptive statistics, summarising achievement distributions, estimating ESCS gradients that quantify the strength of intergenerational transmission, comparing pooled and country-panel regression specifications, and decomposing variance into within- and between-country components. The application requires no local installation, runs entirely client-side, and exports all results with embedded analytic metadata to support transparent reporting and downstream replication.

The central substantive question that EduStrat operationalises is: how do parental characteristics—specifically education, occupational status, and household resources, captured by the PISA ESCS index—predict children's educational achievement at age 15, and how does this relationship vary across countries and over time?

EduStrat application interface
Figure 1. The EduStrat interface showing modular analysis tabs, country and year selectors, and export tools. Users select countries, PISA cycles, and outcome domains, then navigate analysis modules through the tabbed interface.

Implementation and architecture

EduStrat is implemented as a modular ES6 JavaScript application comprising approximately 7,600 lines of code organised into five functional layers: core infrastructure (state management, data loading, statistical utilities), analysis modules (descriptive statistics, regression, variance decomposition, diagnostics), visualisation modules (built on Plotly.js), an export system (CSV, PNG/SVG, HTML reports), and user-interface components. An R-based preprocessing pipeline (four scripts, approximately 650 lines) generates the data files that the application consumes. All source code is publicly available under the MIT licence.

Data architecture

Rather than loading a monolithic cross-cycle microdata file, EduStrat stores data as 513 country–year JSON files (one per country–cycle combination) and loads only the subsets requested by the user. This reduces typical browser memory consumption to 30–40 MB for a working selection, compared with approximately 1.25 GB for the full dataset. A metadata catalogue (metadata.json) indexes all available country–cycle combinations and variable definitions, enabling the interface to present valid selections before any data are loaded.

The underlying data are sourced from the OECD PISA database through the learningtower R package (6), which provides cleaned and harmonised PISA extracts with standardised variable naming across eight assessment cycles (2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022). The processed dataset encompasses 101 participating countries and economies and approximately 3.5 million student observations.

Key variables

The primary outcome variables are PISA achievement scores in mathematics, reading, and science. The primary stratification variable is ESCS (Economic, Social, and Cultural Status), a composite index of family background comprising three components: highest parental education (ISCED classifications), highest parental occupational status (HISEI index), and an index of home possessions and educational resources. ESCS is standardised to an OECD-referenced scale (mean 0, SD 1 for OECD countries), making it directly interpretable as a measure of relative socioeconomic advantage (7). Each student record additionally includes demographic variables, home resource indicators, and the final student weight (W_FSTUWT) and senate weight (W_FSENWT) required for design-consistent estimation.

Statistical functionality

EduStrat implements six analytic modules, each designed as a reusable building block for exploratory analysis of educational stratification:

  1. Descriptive statistics. Survey-weighted means, variances, and quantiles for achievement scores and ESCS, reported at the country–year level with analytical standard errors.
  2. Distributional summaries. Gini coefficients (with Lorenz curves), coefficients of variation, and inter-percentile ratios (P90/P10, P75/P25) characterising within-country achievement inequality.
  3. ESCS gradients. The slope from a survey-weighted regression of achievement on ESCS—the core measure of intergenerational transmission, representing the score-point difference in achievement associated with a one-unit increase in family socioeconomic status. Optional controls include gender and parental education.
  4. Quartile gaps. Q4–Q1 achievement differences (gap between children from the most and least socioeconomically advantaged families) with standardised effect sizes (Cohen's d), providing an intuitive summary of how much family background matters for achievement.
  5. Regression models. Three specifications—pooled weighted OLS, country fixed effects (with optional year fixed effects), and country random effects (quasi-demeaning transformation)—with coefficient tables, fit statistics (R², AIC, BIC), diagnostic plots (residuals versus fitted, Q–Q), and Hausman specification tests.
  6. Variance decomposition. Partitioning of total variance into within- and between-country components with intraclass correlation coefficients (ICC), revealing how much variation in the family background–achievement relationship occurs within versus between education systems.

All descriptive and regression computations apply sampling weights. The application defaults to the final student weight (W_FSTUWT) and optionally supports senate weights (W_FSENWT) for equal-country estimands (1).

Survey weighting and plausible values

Because PISA data arise from a complex sampling design, all of EduStrat's outputs are computed using the appropriate sampling weights. Achievement in PISA is released as plausible values—multiple draws from the estimated proficiency distribution under a latent regression model—which are designed to support valid population inference under matrix sampling (4, 8, 5). Standard practice requires repeating computations across all plausible values and combining estimates following Rubin's rules. For interactivity, EduStrat currently uses the first plausible value per domain and flags this constraint explicitly in all outputs. Users requiring publication-grade inference are expected to replicate selected specifications using all plausible values and appropriate replicate-weight variance procedures in a dedicated statistical environment.

Export system

A practical design priority is that every analysis should be exportable with sufficient metadata to document and reproduce the analytic choices. EduStrat provides four export channels:

All exports encode the selected countries, years, outcome domain, predictor variables, control choices, weighting scheme, and estimator family, so that results can be traced to specific specifications. This supports the minimum standard for computational reproducibility in exploratory work: that analytic choices are recorded and can be re-executed in a more formal environment (9).

Quality control

Data quality is validated at three stages. First, the R preprocessing pipeline (03-validate-chunks.R) checks each country–year file for structural integrity, variable completeness, plausible ranges, and consistency with the metadata catalogue. Second, the browser-based data loader verifies JSON structure, required fields, and weight availability before admitting any chunk into the analysis workspace. Third, the analysis modules include internal checks: regression routines verify positive-definite design matrices, variance decomposition verifies non-negative components, and all modules flag when effective sample sizes fall below reporting thresholds.

Statistical outputs have been cross-validated against published OECD country statistics for selected country–year combinations (e.g. mean mathematics scores and ESCS gradients for OECD countries in 2018 and 2022), with discrepancies documented in the methodology appendix where they arise from the single-plausible-value constraint or differences in missing-data treatment.

Current limitations

EduStrat's outputs should be interpreted as descriptive associations produced by a transparent, interactive implementation, not as definitive inferential results. Four limitations warrant explicit acknowledgement:

(2) Availability

Software metadata

Name EduStrat — Educational Stratification in PISA
Operating system Platform-independent (browser-based)
Programming language JavaScript (ES6 modules); R 4.0+ (data preprocessing pipeline)
Runtime dependencies Modern web browser (Chrome 90+, Firefox 88+, Safari 15+, Edge 90+). No server-side runtime required.
JavaScript libraries Plotly.js 2.35.2, simple-statistics 7.8.0, jStat 1.9.4, D3.js 7.8.5, PapaParse 5.3.0 (loaded via CDN)
R dependencies (pipeline only) learningtower ≥ 1.1.0, dplyr, jsonlite, tidyr
Data source OECD PISA Database via the learningtower R package (6)
Licence MIT
Code repository github.com/kevisc/edustrat
Archived copy [Zenodo DOI to be assigned upon deposit]
Live deployment kevinschoenholzer.com/edustrat/

Documentation

Document Description Location
Methodology Formal statistical definitions, formulas, and assumptions docs/methodology.html
Data sources Variable provenance, coverage by cycle, and codebook docs/data-sources.html
Citation guide Recommended citations in APA, Chicago, MLA, and BibTeX formats docs/citation.html
Pipeline README Instructions for regenerating data files from source pipeline/scripts/README.md

(3) Reuse potential

EduStrat addresses a recurring need in comparative education research and teaching: rapid, documented, exploratory analysis of PISA microdata without the overhead of configuring a local statistical environment. Its reuse potential spans four domains.

Research exploration

Researchers can use EduStrat as a specification-screening tool. By comparing pooled and panel estimators, toggling weight conventions, and varying country selections, users can assess whether descriptive patterns regarding the relationship between family background and achievement are robust to common analytic alternatives before investing in code-based replication with full plausible-value combination and replicate-weight inference. All results are exportable with analytic metadata, supporting integration into larger reproducible workflows.

Teaching and instruction

EduStrat provides a zero-installation platform for classroom demonstrations of key concepts in quantitative methods and comparative education: why survey weights matter, what plausible values are, how fixed- and random-effects estimators differ, and how ESCS gradients vary across education systems. Instructors can guide students through live comparisons without requiring them to install specialised software or manage large microdata files (1, 5).

Policy briefing and cross-country profiling

The combination of interactive selection, automated computation, and structured export makes EduStrat suitable for producing rapid cross-country profiles of educational stratification. Users can select a year and a set of countries, generate weighted distributional summaries and ESCS gradients, and export compact tables and figures for briefing materials, revealing which education systems show stronger or weaker intergenerational transmission.

Extension and modification

The modular ES6 architecture is designed to accommodate extensions. The analysis layer (js/analysis/) and visualisation layer (js/visualization/) are separated from the data-loading infrastructure (js/core/), so new analytic modules or chart types can be added without modifying the data pipeline. The R preprocessing scripts can be adapted to incorporate additional PISA variables or future assessment cycles as they become available through the learningtower package or directly from the OECD. The MIT licence permits unrestricted modification and redistribution.

Support

Bug reports, feature requests, and contributions can be submitted through the project's GitHub issue tracker at github.com/kevisc/edustrat/issues. Documentation covering installation, data regeneration, and statistical methods is maintained alongside the source code.

Acknowledgements

The PISA data used in this application are made available by the OECD. The learningtower R package (6), developed by Kevin Wang, Paul Yacobellis, Erika Siregar, and collaborators, provided the harmonised data extracts that underpin the preprocessing pipeline. The author gratefully acknowledges the developers of the open-source JavaScript libraries on which EduStrat depends: Plotly.js, simple-statistics, jStat, D3.js, and PapaParse.

Competing interests

The author declares no competing interests.

Author contributions

K.S. designed and implemented the software, processed the data, wrote the documentation, and drafted this paper.

References

[1] OECD. PISA data analysis manual: SPSS. 2nd ed. Paris: OECD Publishing; 2009. DOI: 10.1787/9789264056275-en

[2] OECD. PISA 2022 Technical Report. Paris: OECD Publishing; 2024. DOI: 10.1787/01820d6d-en

[3] Sirin SR. Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research. 2005; 75(3): 417–453. DOI: 10.3102/00346543075003417

[4] Mislevy RJ, Beaton AE, Kaplan B, Sheehan KM. Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement. 1992; 29(2): 133–161. DOI: 10.1111/j.1745-3984.1992.tb00371.x

[5] Wu M. The role of plausible values in large-scale surveys. Studies in Educational Evaluation. 2005; 31(2–3): 114–128. DOI: 10.1016/j.stueduc.2005.05.005

[6] Wang K, Yacobellis P, Siregar E, Romanes S, Fitter K, Dalla Riva GV, Cook D, Tierney N, Dingorkar P, Sai Subramanian S, Chen G. learningtower: OECD PISA datasets from 2000–2022 in an easy-to-use format [R package, version 1.1.0]. 2024. DOI: 10.32614/CRAN.package.learningtower

[7] Wuyts C. The measurement of socio-economic status in PISA. Paris: OECD; 2024.

[8] Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons; 1987. DOI: 10.1002/9780470316696

[9] Peng RD. Reproducible research in computational science. Science. 2011; 334(6060): 1226–1227. DOI: 10.1126/science.1213847

[10] Hanushek EA, Woessmann L. The economics of international differences in educational achievement. In: Hanushek EA, Machin S, Woessmann L, editors. Handbook of the Economics of Education. Vol. 3. Amsterdam: North-Holland; 2011. p. 89–200. DOI: 10.1016/B978-0-444-53429-3.00002-8

[11] Reardon SF. The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In: Duncan GJ, Murnane RJ, editors. Whither Opportunity? Rising Inequality, Schools, and Children's Life Chances. New York: Russell Sage Foundation; 2011. p. 91–116.

Back to Application