Overview
This pipeline is for situations where multiple GeoJSON files describe the same class of features but differ in schema, geography, or timing. The objective is not just to merge them. It is to merge them without losing the ability to explain where a record came from and why one duplicate was kept over another.
Why it matters
- Overlapping extractions create duplicate features fast.
- Attribute names drift between files.
- Small schema mismatches become large downstream problems.
- Web maps and analyses both degrade when duplicates are left unresolved.
When to use
Use this pattern when:
- stitching adjacent bbox pulls together
- combining multiple consultant or project deliveries
- assembling one publish layer from several source runs
- preparing a web-map package from messy interim files
Inputs
- two or more GeoJSON files
- a target schema
- rules for duplicate detection
- optional source priority order
Workflow and method
InventoryRecord the input files, dates, and schemas.NormalizeStandardize field names and geometry types.AppendCombine inputs into one staged layer while preserving source metadata.Detect duplicatesUse identifier, geometry, name, or proximity logic.Resolve duplicatesKeep, merge, or flag records based on a documented rule.ExportProduce a final layer and a QA table of what changed.
Duplicate logic options
| Rule type | Useful when | Risk |
|---|---|---|
| Same source ID | IDs are stable across files | Fails when IDs changed or were dropped |
| Same normalized name + near-identical geometry | Feature labels are reliable | Can collapse legitimately distinct nearby sites |
| Spatial proximity threshold | Point data is loosely consistent | Can over-merge dense clusters |
| Source priority | One source is clearly authoritative | Can hide useful attributes from lower-priority sources |
Suggested file layout
data/
raw/
staged/
processed/
scripts/
merge/
normalize_geojson.js
detect_duplicates.js
resolve_duplicates.js
export_final_geojson.js
logs/
QA checks
- Compare feature counts before and after dedupe.
- Inspect clusters where more than one record was collapsed.
- Confirm geometry type did not drift unexpectedly.
- Spot-check names and source fields in the final output.
Outputs
Expected outputs:
- merged staged layer
- final deduplicated layer
- duplicate review table
- short run log describing rules applied
Limitations
No universal duplicate rule works across every layer. Dense settlement points, facilities with shared campuses, and mixed point/polygon representations often require manual review or class-specific logic.