civitar
← Back to civitar.org

Data sources

A short index of where Civitar's data comes from and how to refresh it.

Site briefings (per-site analysis pipeline)

Sourced live at briefing-generation time by the agents in src/agents/:

AgentSourceAPI
biodiversity.py (v3.2)USFWS IPaC (primary), GBIF + IUCN (context/fallback)https://ipac.ecosphere.fws.gov/location/api/resources
env_justice.pyU.S. Census ACS 5-yearapi.census.gov (key required)
geospatial.pyGoogle Earth Engine — Sentinel-2EE Python client
hydrology.pyUSGS National Hydrography Dataset + NASA GRACE-FOwaterservices.usgs.gov, GES DISC
well_levels.pyUSGS NWIS daily values (param 72019)waterservices.usgs.gov
thermal.pyNASA MODIS MOD11A2 LSTGoogle Earth Engine
air_quality.pyESA Sentinel-5P TROPOMIGoogle Earth Engine
noise_receptors.pyJRC Global Human Settlement Layer (GHS_POP 2030)Google Earth Engine

Data center sites layer (map ingestion)

The map's nationwide DC marker layer is sourced from OpenStreetMap via the Overpass API. PNNL's IM3 Data Center Atlas is itself derived from OSM but is gated behind a WAF that blocks programmatic download; we go to the same source directly.

Ingestion script: scripts/ingest_osm_datacenters.py Output: docs/data/dc_sites.json (committed snapshot) Refresh cadence: weekly is plenty (OSM tagging changes slowly). Current count: ~1,500 existing US data centers + 6 hand-curated proposed sites.

Why OSM instead of PNNL directly

PNNL's processed atlas (DOI 10.57931/3017294) adds validated names, operators, and square-footage estimates beyond raw OSM tags — which is genuinely valuable. But the actual CSV/GPKG files are hosted behind a WAF on im3.pnnl.gov that rejects scripted access, and the OSTI/MSD-LIVE mirrors only carry stub dummy.txt files. To ingest PNNL's processed dataset programmatically we'd need to either (a) reach out to the PNNL research team for direct download access, or (b) manually download via browser and commit a snapshot.

Both are reasonable future steps. For now, going to OSM Overpass directly gets us 90% of the value with no manual or correspondence step.

Proposed-DC sourcing

OSM doesn't distinguish proposed from operating facilities. The 6 proposed sites in dc_sites.json are hand-curated from public permitting filings and news coverage. A real proposed-DC pipeline would need to combine:

This is Phase 3 of the data-coverage roadmap and not in scope for the current evening's work.

Attribution

OSM data is ODbL-licensed. Required attribution on any UI that displays this data: "© OpenStreetMap contributors (ODbL)". The map's existing Leaflet attribution control carries this string; the footer also references the data lineage.

The ODbL "share-alike" clause applies if Civitar publishes a derived database (e.g., a combined briefings + DC list export). For per-record use within the webapp this is not triggered. If we ever ship a downloadable dataset, we'd need to release it under ODbL or compatible.

How to refresh the DC layer

``bash python3.13 -m scripts.ingest_osm_datacenters ``

Writes docs/data/dc_sites.json. The mockups load this file at runtime via fetch('data/dc_sites.json') — no rebuild needed.

The ingestion script prints a regional sanity check so you can verify coverage hasn't unexpectedly dropped:

`` Regional sanity check: Northern Virginia (Loudoun): 305 Memphis metro: 4 Phoenix West Valley: 14 Dallas-Fort Worth: 62 Greater Atlanta: 25 ``

If these numbers drop dramatically vs. the previous run, OSM tagging may have changed or Overpass may have transient issues — try a different Overpass mirror (overpass.kumi.systems or overpass.osm.ch) before assuming the data has disappeared.