Overview

This pipeline is for situations where multiple GeoJSON files describe the same class of features but differ in schema, geography, or timing. The objective is not just to merge them. It is to merge them without losing the ability to explain where a record came from and why one duplicate was kept over another.

Why it matters

When to use

Use this pattern when:

Inputs

Workflow and method

  1. Inventory Record the input files, dates, and schemas.
  2. Normalize Standardize field names and geometry types.
  3. Append Combine inputs into one staged layer while preserving source metadata.
  4. Detect duplicates Use identifier, geometry, name, or proximity logic.
  5. Resolve duplicates Keep, merge, or flag records based on a documented rule.
  6. Export Produce a final layer and a QA table of what changed.

Duplicate logic options

Rule type Useful when Risk
Same source ID IDs are stable across files Fails when IDs changed or were dropped
Same normalized name + near-identical geometry Feature labels are reliable Can collapse legitimately distinct nearby sites
Spatial proximity threshold Point data is loosely consistent Can over-merge dense clusters
Source priority One source is clearly authoritative Can hide useful attributes from lower-priority sources

Suggested file layout

data/
  raw/
  staged/
  processed/
scripts/
  merge/
    normalize_geojson.js
    detect_duplicates.js
    resolve_duplicates.js
    export_final_geojson.js
logs/

QA checks

Outputs

Expected outputs:

Limitations

No universal duplicate rule works across every layer. Dense settlement points, facilities with shared campuses, and mixed point/polygon representations often require manual review or class-specific logic.