Overview
This pipeline is designed for targeted thematic extraction from OpenStreetMap, especially for early-stage infrastructure, environmental, and access-screening work. It assumes you are pulling specific feature classes rather than mirroring an entire country extract.
Why it matters
The failure mode in many OSM workflows is not the query itself. It is everything after the query:
- raw files saved with inconsistent names
- mixed geometries pushed into the same layer
- duplicates left unresolved
- no record of which tags were treated as valid
The pipeline fixes that by separating the steps.
When to use
Use this pattern when:
- you expect to rerun the extraction later
- multiple areas or feature themes will follow the same logic
- you need clean handoff to web maps, analysts, or field teams
Inputs
- area definition such as bbox, polygon, or admin relation
- feature theme and tag strategy
- query text and extraction date
- expected output schema
- destination products such as GeoJSON, GeoPackage, CSV, or vector tiles
Live normalization example
Workflow and method
RequestDraft the Overpass query and save the query text alongside the output.ExtractExport the raw JSON or GeoJSON exactly as returned.CleanStandardize field names, promote essential tags, and remove clearly broken records.DedupeResolve obvious duplicates across nodes, ways, and relations or from overlapping pulls.SplitSeparate outputs by geometry type or analytical class where needed.ExportProduce final layers for mapping, analysis, and archival reuse.
Suggested file naming
Use names that encode geography, source, theme, stage, and date.
moz_temane_osm_schools-clinics_raw_2026-03-19.geojson
moz_temane_osm_schools-clinics_clean_2026-03-19.geojson
moz_temane_osm_schools-clinics_final_2026-03-19.gpkg
That naming pattern makes it obvious which file is disposable and which file is a handoff artifact.
Pipeline logic
Request
Store the query text in version control or beside the output folder. The query is part of the data provenance.
Extract
Keep one untouched raw export. If you need to rerun cleaning logic later, that file is the baseline.
Clean
Typical cleaning tasks:
- flatten or select tags into stable fields
- add a
feature_class - add a
source_date - standardize geometry validity
- remove empty-name records only if the use case actually requires names
Dedupe
Check for:
- repeated features returned as node and way variants
- identical coordinates and names
- overlapping results from adjacent bbox pulls
Split
Split outputs when different downstream consumers need different packages, for example:
- points for receptor screening
- polygons for settlement or protected-area context
- simplified web-map layers
Export
Produce one clean analytical layer and one publish-ready layer when the styling or schema requirements differ.
Tools and suggested script layout
scripts/
overpass/
fetch_osm_overpass.sh
normalize_osm_geojson.js
dedupe_osm_features.js
export_osm_packages.js
data/
raw/
interim/
processed/
queries/
schools_clinics.overpassql
Recommended command flow:
./scripts/overpass/fetch_osm_overpass.sh queries/schools_clinics.overpassql
node ./scripts/overpass/normalize_osm_geojson.js
node ./scripts/overpass/dedupe_osm_features.js
node ./scripts/overpass/export_osm_packages.js
QA checks
Run QA before you publish or hand off:
- Compare counts by feature class against expectation.
- Inspect a sample of records with and without names.
- Check geometry validity and unexpected geometry types.
- Map the result and look for obvious duplicate clusters.
- Review the area boundary to confirm the pull actually covers the intended geography.
Outputs
The pipeline should end with:
- one or more final cleaned layers
- a saved query file
- a short run log or README note
- QA notes on exclusions, dedupe rules, and known weak coverage
Limitations
This pipeline improves reliability, but it does not solve sparse mapping, uncertain place names, or project-specific interpretation problems. Those still require method choices and sometimes field validation.