Data sources

A short index of where Civitar's data comes from and how to refresh it.

Site briefings (per-site analysis pipeline)

Sourced live at briefing-generation time by the agents in src/agents/:

Agent	Source	API
`biodiversity.py` (v3.2)	USFWS IPaC (primary), GBIF + IUCN (context/fallback)	https://ipac.ecosphere.fws.gov/location/api/resources
`env_justice.py`	U.S. Census ACS 5-year	api.census.gov (key required)
`geospatial.py`	Google Earth Engine — Sentinel-2	EE Python client
`hydrology.py`	USGS National Hydrography Dataset + NASA GRACE-FO	waterservices.usgs.gov, GES DISC
`well_levels.py`	USGS NWIS daily values (param 72019)	waterservices.usgs.gov
`thermal.py`	NASA MODIS MOD11A2 LST	Google Earth Engine
`air_quality.py`	ESA Sentinel-5P TROPOMI	Google Earth Engine
`noise_receptors.py`	JRC Global Human Settlement Layer (GHS_POP 2030)	Google Earth Engine

Data center sites layer (map ingestion)

The map's nationwide DC marker layer is sourced from OpenStreetMap via the Overpass API. PNNL's IM3 Data Center Atlas is itself derived from OSM but is gated behind a WAF that blocks programmatic download; we go to the same source directly.

Ingestion script: scripts/ingest_osm_datacenters.py Output: docs/data/dc_sites.json (committed snapshot) Refresh cadence: weekly is plenty (OSM tagging changes slowly). Current count: ~1,500 existing US data centers + 6 hand-curated proposed sites.

Why OSM instead of PNNL directly

PNNL's processed atlas (DOI 10.57931/3017294) adds validated names, operators, and square-footage estimates beyond raw OSM tags — which is genuinely valuable. But the actual CSV/GPKG files are hosted behind a WAF on im3.pnnl.gov that rejects scripted access, and the OSTI/MSD-LIVE mirrors only carry stub dummy.txt files. To ingest PNNL's processed dataset programmatically we'd need to either (a) reach out to the PNNL research team for direct download access, or (b) manually download via browser and commit a snapshot.

Both are reasonable future steps. For now, going to OSM Overpass directly gets us 90% of the value with no manual or correspondence step.

Proposed-DC sourcing

OSM doesn't distinguish proposed from operating facilities. The 6 proposed sites in dc_sites.json are hand-curated from public permitting filings and news coverage. A real proposed-DC pipeline would need to combine:

EIA Form 860 (utility interconnection requests — early grid-load signal)
State PUC dockets (utility load-growth filings often name DC customers)
Local zoning / permit applications (most authoritative, but fragmented across thousands of jurisdictions)
DC trade press — Data Center Dynamics, Bisnow, etc.

This is Phase 3 of the data-coverage roadmap and not in scope for the current evening's work.

Attribution

OSM data is ODbL-licensed. Required attribution on any UI that displays this data: "© OpenStreetMap contributors (ODbL)". The map's existing Leaflet attribution control carries this string; the footer also references the data lineage.

The ODbL "share-alike" clause applies if Civitar publishes a derived database (e.g., a combined briefings + DC list export). For per-record use within the webapp this is not triggered. If we ever ship a downloadable dataset, we'd need to release it under ODbL or compatible.

How to refresh the DC layer

``bash python3.13 -m scripts.ingest_osm_datacenters ``

Writes docs/data/dc_sites.json. The mockups load this file at runtime via fetch('data/dc_sites.json') — no rebuild needed.

The ingestion script prints a regional sanity check so you can verify coverage hasn't unexpectedly dropped:

`` Regional sanity check: Northern Virginia (Loudoun): 305 Memphis metro: 4 Phoenix West Valley: 14 Dallas-Fort Worth: 62 Greater Atlanta: 25 ``

If these numbers drop dramatically vs. the previous run, OSM tagging may have changed or Overpass may have transient issues — try a different Overpass mirror (overpass.kumi.systems or overpass.osm.ch) before assuming the data has disappeared.