Working Paper
Abstract. EduStrat (Educational Stratification in PISA) is a browser-based, open-source tool for exploratory secondary analysis of OECD PISA student microdata across eight assessment cycles (2000–2022). The application enables researchers, instructors, and students to investigate the intergenerational transmission of educational achievement—specifically, how parental education, occupational status, and household wealth relate to the academic performance of 15-year-olds in mathematics, reading, and science across more than 100 countries. EduStrat provides survey-weighted descriptive statistics, distributional summaries, ESCS gradients and quartile gaps that quantify how strongly family background predicts student outcomes, pooled and country-panel regression specifications, variance decomposition, and model diagnostics. All computations run client-side; data are stored as pre-generated country–year JSON subsets loaded on demand. Results are exportable as CSV tables, high-resolution figures (PNG/SVG), and self-contained HTML reports with full analytic metadata. This paper describes the software architecture, statistical functionality, data coverage, and reuse potential. It also documents current limitations, notably the use of a single plausible value per domain and the absence of replicate-weight variance estimation in the interactive layer.
Keywords: PISA, educational inequality, intergenerational transmission, ESCS, survey weighting, regression, variance decomposition, open-source, web application, JavaScript, reproducibility
The Programme for International Student Assessment (PISA) is one of the most widely used data sources for cross-national research on educational achievement (1, 2). Conducted triennially by the OECD since 2000, PISA assesses the reading, mathematics, and science proficiency of 15-year-old students alongside extensive background questionnaires capturing family socioeconomic characteristics. This combination of standardised achievement measures and rich contextual data makes PISA particularly well suited for studying the intergenerational transmission of educational advantage—the extent to which parental education, occupational status, and household resources predict children's academic outcomes (3).
Despite the public availability of PISA microdata, routine analytic tasks remain implementation-heavy for many potential users. The data arise from a complex, stratified two-stage sampling design requiring appropriate weighting; achievement is released as sets of plausible values rather than single test scores; and descriptive results can be sensitive to choices regarding which weights, cycles, countries, and regression specifications are used (4, 5, 1). These technical requirements create a barrier that is particularly high for graduate students, instructors preparing teaching demonstrations, and researchers seeking a rapid first pass through specifications before investing in code-based replication.
EduStrat was developed to lower this barrier. It provides a browser-based interface for a set of recurring first-pass operations in large-scale assessment research: computing survey-weighted descriptive statistics, summarising achievement distributions, estimating ESCS gradients that quantify the strength of intergenerational transmission, comparing pooled and country-panel regression specifications, and decomposing variance into within- and between-country components. The application requires no local installation, runs entirely client-side, and exports all results with embedded analytic metadata to support transparent reporting and downstream replication.
The central substantive question that EduStrat operationalises is: how do parental characteristics—specifically education, occupational status, and household resources, captured by the PISA ESCS index—predict children's educational achievement at age 15, and how does this relationship vary across countries and over time?
EduStrat is implemented as a modular ES6 JavaScript application comprising approximately 7,600 lines of code organised into five functional layers: core infrastructure (state management, data loading, statistical utilities), analysis modules (descriptive statistics, regression, variance decomposition, diagnostics), visualisation modules (built on Plotly.js), an export system (CSV, PNG/SVG, HTML reports), and user-interface components. An R-based preprocessing pipeline (four scripts, approximately 650 lines) generates the data files that the application consumes. All source code is publicly available under the MIT licence.
Rather than loading a monolithic cross-cycle microdata file, EduStrat stores data as 513 country–year JSON files (one per country–cycle combination) and loads only the subsets requested by the user. This reduces typical browser memory consumption to 30–40 MB for a working selection, compared with approximately 1.25 GB for the full dataset. A metadata catalogue (metadata.json) indexes all available country–cycle combinations and variable definitions, enabling the interface to present valid selections before any data are loaded.
The underlying data are sourced from the OECD PISA database through the learningtower R package (6), which provides cleaned and harmonised PISA extracts with standardised variable naming across eight assessment cycles (2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022). The processed dataset encompasses 101 participating countries and economies and approximately 3.5 million student observations.
The primary outcome variables are PISA achievement scores in mathematics, reading, and science. The primary stratification variable is ESCS (Economic, Social, and Cultural Status), a composite index of family background comprising three components: highest parental education (ISCED classifications), highest parental occupational status (HISEI index), and an index of home possessions and educational resources. ESCS is standardised to an OECD-referenced scale (mean 0, SD 1 for OECD countries), making it directly interpretable as a measure of relative socioeconomic advantage (7). Each student record additionally includes demographic variables, home resource indicators, and the final student weight (W_FSTUWT) and senate weight (W_FSENWT) required for design-consistent estimation.
EduStrat implements six analytic modules, each designed as a reusable building block for exploratory analysis of educational stratification:
All descriptive and regression computations apply sampling weights. The application defaults to the final student weight (W_FSTUWT) and optionally supports senate weights (W_FSENWT) for equal-country estimands (1).
Because PISA data arise from a complex sampling design, all of EduStrat's outputs are computed using the appropriate sampling weights. Achievement in PISA is released as plausible values—multiple draws from the estimated proficiency distribution under a latent regression model—which are designed to support valid population inference under matrix sampling (4, 8, 5). Standard practice requires repeating computations across all plausible values and combining estimates following Rubin's rules. For interactivity, EduStrat currently uses the first plausible value per domain and flags this constraint explicitly in all outputs. Users requiring publication-grade inference are expected to replicate selected specifications using all plausible values and appropriate replicate-weight variance procedures in a dedicated statistical environment.
A practical design priority is that every analysis should be exportable with sufficient metadata to document and reproduce the analytic choices. EduStrat provides four export channels:
All exports encode the selected countries, years, outcome domain, predictor variables, control choices, weighting scheme, and estimator family, so that results can be traced to specific specifications. This supports the minimum standard for computational reproducibility in exploratory work: that analytic choices are recorded and can be re-executed in a more formal environment (9).
Data quality is validated at three stages. First, the R preprocessing pipeline (03-validate-chunks.R) checks each country–year file for structural integrity, variable completeness, plausible ranges, and consistency with the metadata catalogue. Second, the browser-based data loader verifies JSON structure, required fields, and weight availability before admitting any chunk into the analysis workspace. Third, the analysis modules include internal checks: regression routines verify positive-definite design matrices, variance decomposition verifies non-negative components, and all modules flag when effective sample sizes fall below reporting thresholds.
Statistical outputs have been cross-validated against published OECD country statistics for selected country–year combinations (e.g. mean mathematics scores and ESCS gradients for OECD countries in 2018 and 2022), with discrepancies documented in the methodology appendix where they arise from the single-plausible-value constraint or differences in missing-data treatment.
EduStrat's outputs should be interpreted as descriptive associations produced by a transparent, interactive implementation, not as definitive inferential results. Four limitations warrant explicit acknowledgement:
| Name | EduStrat — Educational Stratification in PISA |
| Operating system | Platform-independent (browser-based) |
| Programming language | JavaScript (ES6 modules); R 4.0+ (data preprocessing pipeline) |
| Runtime dependencies | Modern web browser (Chrome 90+, Firefox 88+, Safari 15+, Edge 90+). No server-side runtime required. |
| JavaScript libraries | Plotly.js 2.35.2, simple-statistics 7.8.0, jStat 1.9.4, D3.js 7.8.5, PapaParse 5.3.0 (loaded via CDN) |
| R dependencies (pipeline only) | learningtower ≥ 1.1.0, dplyr, jsonlite, tidyr |
| Data source | OECD PISA Database via the learningtower R package (6) |
| Licence | MIT |
| Code repository | github.com/kevisc/edustrat |
| Archived copy | [Zenodo DOI to be assigned upon deposit] |
| Live deployment | kevinschoenholzer.com/edustrat/ |
| Document | Description | Location |
|---|---|---|
| Methodology | Formal statistical definitions, formulas, and assumptions | docs/methodology.html |
| Data sources | Variable provenance, coverage by cycle, and codebook | docs/data-sources.html |
| Citation guide | Recommended citations in APA, Chicago, MLA, and BibTeX formats | docs/citation.html |
| Pipeline README | Instructions for regenerating data files from source | pipeline/scripts/README.md |
EduStrat addresses a recurring need in comparative education research and teaching: rapid, documented, exploratory analysis of PISA microdata without the overhead of configuring a local statistical environment. Its reuse potential spans four domains.
Researchers can use EduStrat as a specification-screening tool. By comparing pooled and panel estimators, toggling weight conventions, and varying country selections, users can assess whether descriptive patterns regarding the relationship between family background and achievement are robust to common analytic alternatives before investing in code-based replication with full plausible-value combination and replicate-weight inference. All results are exportable with analytic metadata, supporting integration into larger reproducible workflows.
EduStrat provides a zero-installation platform for classroom demonstrations of key concepts in quantitative methods and comparative education: why survey weights matter, what plausible values are, how fixed- and random-effects estimators differ, and how ESCS gradients vary across education systems. Instructors can guide students through live comparisons without requiring them to install specialised software or manage large microdata files (1, 5).
The combination of interactive selection, automated computation, and structured export makes EduStrat suitable for producing rapid cross-country profiles of educational stratification. Users can select a year and a set of countries, generate weighted distributional summaries and ESCS gradients, and export compact tables and figures for briefing materials, revealing which education systems show stronger or weaker intergenerational transmission.
The modular ES6 architecture is designed to accommodate extensions. The analysis layer (js/analysis/) and visualisation layer (js/visualization/) are separated from the data-loading infrastructure (js/core/), so new analytic modules or chart types can be added without modifying the data pipeline. The R preprocessing scripts can be adapted to incorporate additional PISA variables or future assessment cycles as they become available through the learningtower package or directly from the OECD. The MIT licence permits unrestricted modification and redistribution.
Bug reports, feature requests, and contributions can be submitted through the project's GitHub issue tracker at github.com/kevisc/edustrat/issues. Documentation covering installation, data regeneration, and statistical methods is maintained alongside the source code.
The PISA data used in this application are made available by the OECD. The learningtower R package (6), developed by Kevin Wang, Paul Yacobellis, Erika Siregar, and collaborators, provided the harmonised data extracts that underpin the preprocessing pipeline. The author gratefully acknowledges the developers of the open-source JavaScript libraries on which EduStrat depends: Plotly.js, simple-statistics, jStat, D3.js, and PapaParse.
The author declares no competing interests.
K.S. designed and implemented the software, processed the data, wrote the documentation, and drafted this paper.
[1] OECD. PISA data analysis manual: SPSS. 2nd ed. Paris: OECD Publishing; 2009. DOI: 10.1787/9789264056275-en
[2] OECD. PISA 2022 Technical Report. Paris: OECD Publishing; 2024. DOI: 10.1787/01820d6d-en
[3] Sirin SR. Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research. 2005; 75(3): 417–453. DOI: 10.3102/00346543075003417
[4] Mislevy RJ, Beaton AE, Kaplan B, Sheehan KM. Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement. 1992; 29(2): 133–161. DOI: 10.1111/j.1745-3984.1992.tb00371.x
[5] Wu M. The role of plausible values in large-scale surveys. Studies in Educational Evaluation. 2005; 31(2–3): 114–128. DOI: 10.1016/j.stueduc.2005.05.005
[6] Wang K, Yacobellis P, Siregar E, Romanes S, Fitter K, Dalla Riva GV, Cook D, Tierney N, Dingorkar P, Sai Subramanian S, Chen G. learningtower: OECD PISA datasets from 2000–2022 in an easy-to-use format [R package, version 1.1.0]. 2024. DOI: 10.32614/CRAN.package.learningtower
[7] Wuyts C. The measurement of socio-economic status in PISA. Paris: OECD; 2024.
[8] Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons; 1987. DOI: 10.1002/9780470316696
[9] Peng RD. Reproducible research in computational science. Science. 2011; 334(6060): 1226–1227. DOI: 10.1126/science.1213847
[10] Hanushek EA, Woessmann L. The economics of international differences in educational achievement. In: Hanushek EA, Machin S, Woessmann L, editors. Handbook of the Economics of Education. Vol. 3. Amsterdam: North-Holland; 2011. p. 89–200. DOI: 10.1016/B978-0-444-53429-3.00002-8
[11] Reardon SF. The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In: Duncan GJ, Murnane RJ, editors. Whither Opportunity? Rising Inequality, Schools, and Children's Life Chances. New York: Russell Sage Foundation; 2011. p. 91–116.
Back to Application