OSM extraction pipeline | OpenGeo.tools

Overview

This pipeline is designed for targeted thematic extraction from OpenStreetMap, especially for early-stage infrastructure, environmental, and access-screening work. It assumes you are pulling specific feature classes rather than mirroring an entire country extract.

Why it matters

The failure mode in many OSM workflows is not the query itself. It is everything after the query:

raw files saved with inconsistent names
mixed geometries pushed into the same layer
duplicates left unresolved
no record of which tags were treated as valid

The pipeline fixes that by separating the steps.

When to use

Use this pattern when:

you expect to rerun the extraction later
multiple areas or feature themes will follow the same logic
you need clean handoff to web maps, analysts, or field teams

Inputs

area definition such as bbox, polygon, or admin relation
feature theme and tag strategy
query text and extraction date
expected output schema
destination products such as GeoJSON, GeoPackage, CSV, or vector tiles

Live normalization example

Workflow and method

Request Draft the Overpass query and save the query text alongside the output.
Extract Export the raw JSON or GeoJSON exactly as returned.
Clean Standardize field names, promote essential tags, and remove clearly broken records.
Dedupe Resolve obvious duplicates across nodes, ways, and relations or from overlapping pulls.
Split Separate outputs by geometry type or analytical class where needed.
Export Produce final layers for mapping, analysis, and archival reuse.

Suggested file naming

Use names that encode geography, source, theme, stage, and date.

moz_temane_osm_schools-clinics_raw_2026-03-19.geojson
moz_temane_osm_schools-clinics_clean_2026-03-19.geojson
moz_temane_osm_schools-clinics_final_2026-03-19.gpkg

That naming pattern makes it obvious which file is disposable and which file is a handoff artifact.

Pipeline logic

Request

Store the query text in version control or beside the output folder. The query is part of the data provenance.

Extract

Keep one untouched raw export. If you need to rerun cleaning logic later, that file is the baseline.

Clean

Typical cleaning tasks:

flatten or select tags into stable fields
add a feature_class
add a source_date
standardize geometry validity
remove empty-name records only if the use case actually requires names

Dedupe

Check for:

repeated features returned as node and way variants
identical coordinates and names
overlapping results from adjacent bbox pulls

Split

Split outputs when different downstream consumers need different packages, for example:

points for receptor screening
polygons for settlement or protected-area context
simplified web-map layers

Export

Produce one clean analytical layer and one publish-ready layer when the styling or schema requirements differ.

Tools and suggested script layout

scripts/
  overpass/
    fetch_osm_overpass.sh
    normalize_osm_geojson.js
    dedupe_osm_features.js
    export_osm_packages.js
data/
  raw/
  interim/
  processed/
queries/
  schools_clinics.overpassql

Recommended command flow:

./scripts/overpass/fetch_osm_overpass.sh queries/schools_clinics.overpassql
node ./scripts/overpass/normalize_osm_geojson.js
node ./scripts/overpass/dedupe_osm_features.js
node ./scripts/overpass/export_osm_packages.js

QA checks

Run QA before you publish or hand off:

Compare counts by feature class against expectation.
Inspect a sample of records with and without names.
Check geometry validity and unexpected geometry types.
Map the result and look for obvious duplicate clusters.
Review the area boundary to confirm the pull actually covers the intended geography.

Outputs

The pipeline should end with:

one or more final cleaned layers
a saved query file
a short run log or README note
QA notes on exclusions, dedupe rules, and known weak coverage

Limitations

This pipeline improves reliability, but it does not solve sparse mapping, uncertain place names, or project-specific interpretation problems. Those still require method choices and sometimes field validation.