policydatainfrastructure.com · Project Report · April 2026

Policy Data Infrastructure

An open-source pipeline from raw public data to policy-ready deliverables. National scale. Tract-level precision. Raw data first.

DojoGenesis/policy-data-infrastructure · Apache-2.0 · Cruz Morales
Scroll
01 — Origin

Where This Comes From

Policy Data Infrastructure is an open-source data pipeline that operates at national scale and at the county, tract, block group, and ward level from day one.

The foundation is the Madison Equity Atlas—a 22-layer GIS platform analyzing 125 census tracts in Dane County, Wisconsin. The Atlas produced the statistical methodology, the Python data acquisition core, and the evidence-card framework that generated 70 policy analyses across all 72 Wisconsin counties. PDI builds a Go orchestration layer over that core—a compiled pipeline engine that runs the same analyses faster, at any geographic scope, with a proper API and narrative generation built in.

The Atlas proved what data infrastructure can do. Five Mornings in Madison—five households, five alarm clocks, the same city—proved that when the data is structured right, it produces stories that move people. The Partnership Proposal proved it can build coalitions. The Field Guide proved it can brief decision-makers. PDI generalizes all of this to any county in the country.

02 — Pipeline

The Five-Stage Pipeline

Data moves from public sources to policy-ready deliverables through five stages. Each stage has a clear input contract and output format. The pipeline is a DAG—stages run in concurrent waves, bounded by parallelism settings.

1

Source Acquisition

External data pulled from public APIs into raw storage. The system knows 12 upstream sources, their rate limits, their geographic resolutions, and their update schedules.

Census ACS TIGER/Line CDC PLACES EPA EJScreen BLS LAUS USDA Food HUD CHAS HRSA HPSA
2

Processing & Normalization

Raw data cleaned, filtered to the target geography, joined to tract-level geometry via PostGIS, and output as standardized indicator records. Every value carries its GEOID, source metadata, and vintage year. Missing data is null, never a sentinel value—because a zero and an absence are two different truths.

3

Analytical Computation

The statistical engine reads processed indicators and computes derived metrics: z-score normalization, OLS regression, Blinder-Oaxaca decomposition, bootstrap confidence intervals, and piecewise tipping-point detection. The same methodology that produced the Atlas findings—now generalized to any county in the country.

4

Visualization

Single-file HTML applications consume the processed data. Leaflet for choropleth maps, Alpine.js for interactivity, Chart.js for statistical displays—all self-contained, no server required.

5

Communication Layer

This is where the pipeline produces value. Research outputs, grant proposals, partnership materials, and narrative documents that reference real findings. A narrative engine with Go templates that can generate policy deliverables for any county in America, grounded in that county’s actual tract-level data.

03 — Current State

What’s Already Working

70
Evidence cards across 18 policy categories
1,542
Census tracts with ICE scores
72
Wisconsin counties analyzed
2
SDOH factors discovered (66.5% variance)
48
Go source files, 20 test files
9
Python ingest scripts
12
REST API endpoints + SSE
30
Peer-reviewed sources in research base

The VPS is running at pdi.trespies.dev with PostgreSQL + PostGIS. The project website is live at policydatainfrastructure.com. The pipeline runs end to end. The narrative engine renders. The statistical architecture has been refactored from the ground up.

04 — What We Refactored

Composites Without Validation Were Not a Foundation

The Madison Equity Atlas used the Neighborhood Attendance Risk Index (NARI)—a composite of 8 indicators averaged by percentile rank—to identify “priority tracts.” That approach was useful for a prototype at 125 tracts. PDI replaces it with a research-grounded statistical architecture.

What the research shows

Equal-weighted composites hide more than they reveal. The CDC Social Vulnerability Index—16 variables, equal weights—predicts only 38.9% of COVID case variability. Factor analysis shows 3–4 variables carry most of the variance; the other 12 are correlated noise that inflates apparent precision.

Unstandardized composites collapse to proxies. The Area Deprivation Index, when computed without standardizing variables, is 98.8% explained by just 2 variables (income + home value). A “17-variable index” that is functionally a 2-variable proxy.

Rankings are methodologically unstable. Environmental composite index rankings differ by an average of 45 places across alternative weight specifications. If rankings shift substantially under sensitivity analysis, the composite should not be presented as authoritative.

The prototype’s NARI was never tested against an outcome it didn’t contain. Its 8 indicators were selected by intuition, not factor analysis. Its tier cutoffs (80th percentile = “Critical”) were arbitrary. Copying it to national scale without re-validating against real research questions is how interpretability debt accumulates.

The Stiglitz-Sen-Fitoussi Rule

“Weights embed hidden normative choices disguised as technical choices.” The Commission recommended presenting composites alongside a dashboard of raw indicators—so the underlying dimensions remain visible and the composite cannot substitute for them.

05 — New Architecture

Raw Data First, Composites Only When Earned

The refactored architecture has five layers. Each builds on the one below. Composites exist only at the top—computed at query time, never stored as truth.

1
Raw Indicators
Stored in PostgreSQL, every value carries a reliability flag (CV < 0.15 = high, 0.15–0.30 = moderate, > 0.30 = low)
2
Validated Features
ICE (Krieger 2016), Dissimilarity Index (Massey & Denton 1988), Housing Cost Burden (HUD)—literature-grounded, not invented
3
Factor Scores
EFA-derived, named by loading profile. Wisconsin analysis found 2 factors (66.5% variance): Mental Health / Economic Deprivation and Cardiovascular / Metabolic.
4
Spatial Analysis
LISA cluster maps, GWR local coefficients, multilevel variance partitioning, SKATER regionalization
5
Composite Views
Query-time only. Geometric mean aggregation. Accompanied by sensitivity analysis showing ranking stability under ±20% weight perturbation.
What gets replaced

CompositeIndex() with equal weights → ValidatedFeatures() + query-time composites

AssignTiers() with arbitrary cutoffs → LISA cluster classification from actual spatial patterns

NARI as stored score → Named factor scores + ICE as first-class indicators

Tier badges in narratives → Factor profile descriptions (“Economic Distress: 92nd percentile”)

06 — Methods

What Replaces Composites

Each method below is grounded in peer-reviewed literature and tested at the 85,000-tract scale the platform targets.

LISA Cluster Maps

Local Indicators of Spatial Association. Classifies each tract as High-High (concentrated disadvantage), Low-Low, High-Low (outlier), or Not Significant. The core equity atlas visual.

Interpretability: 5/5

ICE (Index of Concentration at the Extremes)

Krieger et al. 2016. Measures polarization: (high-income white − low-income POC) / total population. Validated, directional, does not collapse race and income into a dimensionless score.

Interpretability: 5/5

Factor Analysis (EFA)

Oblimin rotation on 50+ indicators. Parallel analysis for factor count. Names by loading profile, not number. Kolak et al. found 4 factors at 72K tracts explaining 71% variance.

Interpretability: 5/5

Threshold Detection

Segmented regression identifies breakpoints: “Above 35% poverty, diabetes prevalence increases 4x faster.” County-level first, then validate at tract level.

Interpretability: 5/5

Quantile Regression

For a given poverty rate, what is the 10th/50th/90th percentile of health outcomes? Identifies positive-deviance communities beating expectations—more actionable than cataloguing worst cases.

Interpretability: 4/5

Multilevel Variance Partitioning

Tracts nested in counties nested in states. “27% of variation in uninsurance is between states—state Medicaid policy matters as much as local poverty.”

Interpretability: 4/5
07 — Landscape

Where PDI Fits

Platform Open Source National API Narrative Raw-First
Census Reporter Yes Yes Yes No Yes
COI 3.0 Docs only Yes Download No Composite
National Equity Atlas No Metro/city No No Dashboard
Opportunity Insights Code only Yes Download No Yes
PolicyMap No Yes No No Mixed
PDI Yes Yes REST + SSE Go templates Yes

No open-source platform currently offers the full stack: ingestion + statistical computation + API + narrative generation + visualization in one deployable package. PDI is the first to attempt it with a raw-data-first statistical architecture.

08 — Vision

The Purpose Behind This Work

The purpose of PDI is to multiply knowledge and power—to boost policy proposals with statistics, raise awareness to issues using data, and connect achievable goals to the communities they would reach.

The vision is that this data infrastructure will produce stories and narratives backed by real facts and figures—not just charts that get filed, but documents that keep organizations in the room after the presentation is over. What Five Mornings did for Madison, PDI can do for any county in the country: turn tract-level indicators into stories that decision-makers cannot ignore.

The infrastructure exists now so that when a campaign asks “what does the data say about food access in Rusk County?” the answer is already computed, the map is already rendered, and the story is already waiting to be told.

What makes this different

Open-source from day one. Apache-2.0. Fork it, deploy it, extend it.

Raw data is the foundation. No unvalidated composites. Every indicator carries a reliability flag. Composites are query-time views, not stored truth.

Research-grounded methods. 30 peer-reviewed sources inform the statistical architecture. Every method traces to published validation.

Narrative generation. The only open-source platform that turns indicator data into policy-ready documents mechanically.

PDI is open to contributors, collaborators, and communities that want to build data infrastructure that serves policy, not just measures it.

09 — What’s happened since

Progress Since the Initial Build

Since the rough draft shipped, the infrastructure has been audited, refactored, and extended. Here is what changed.

Factor Analysis — The Data Named Its Dimensions

Exploratory factor analysis on 1,265 Wisconsin tracts across 12 SDOH indicators produced two factors explaining 66.5% of variance (KMO = 0.833):

Factor 1: Mental Health / Economic Deprivation (38.4%) — poverty rate, mental health prevalence, ICE score, healthcare access. These move together across Wisconsin tracts.

Factor 2: Cardiovascular / Metabolic (28.1%) — high blood pressure, diabetes, physical health, obesity. A separate dimension that does not reduce to poverty.

This confirms the refactor decision: averaging these two dimensions into a single composite would hide the fact that they are independent. A tract can score high on economic deprivation and low on metabolic risk, or vice versa. The composite would tell you neither.

True ICE Scores — No More Approximation

ACS table B19001 (household income by race) now provides the cross-tabulated counts needed for the Index of Concentration at the Extremes—replacing the poverty×race approximation from the initial build. 1,524 of 1,542 WI tracts (98.8%) have true ICE scores ranging from −0.65 (concentrated deprivation) to +0.82 (concentrated privilege).

Narrative Engine Fixed

The narrative rendering chain — which generates Five Mornings documents from tract data — was broken after the refactor because it still referenced the old NARI fields. Fixed: the selector, engine, and all three templates now use ICE and factor profiles. 33 tests pass.

Website Live

The project website is deployed at policydatainfrastructure.com via Cloudflare Pages. It includes a Five Mornings excerpt with sourced statistics, an interactive evidence card explorer, and a six-tab methodology section explaining the statistical architecture. Every number on the site has been audited against source material.

What’s Next

Research Grounding

This document is grounded in four research tracks conducted on April 14, 2026, reviewing 30 peer-reviewed and technical sources across validated composite index methodologies, disaggregated analysis methods, the open-source policy data platform landscape, and scalable spatial statistics. Full research documents and a structured reference list are available in the repository.