1. Use Case & Goalsο
Allow users to spatially subset large datasets (HRRR, GFS, etc.) by:
Bounding box (lat/lon extent)
Point extraction (single location timeseries)
(Future: polygon masks, shapefiles, etc.)
This reduces download size, speeds up visualization, and aligns with cloud-efficient workflows.
2. Placement in Pipelineο
New processor module:
src/zyra/processing/subset_processor.py
Exposed in CLI as:
zyra process subset \\ --input file.grib2 \\ --bbox "-110,35,-100,45" \\ --output subset.grib2
3. Technical Approachο
a. GRIB2 inputο
Use wgrib2 (if available) for subsetting:
wgrib2 file.grib2 -small_grib lonW lonE latS latN subset.grib2
Or use cfgrib/xarray:
import xarray as xr ds = xr.open_dataset("file.grib2", engine="cfgrib") ds_sel = ds.sel(latitude=slice(latN, latS), longitude=slice(lonW, lonE)) ds_sel.to_netcdf("subset.nc")
b. NetCDF/Zarr inputο
Directly use xarrayβs
.sel()
with bounding box slices.
c. Outputο
Keep format consistent with
--output
flag (GRIB2, NetCDF, GeoTIFF).Reuse
convert-format
processor where possible.
4. CLI Designο
Proposed options:
zyra process subset \\
--input hrrr.grib2 \\
--bbox "-110,35,-100,45" \\ # lon_min, lat_min, lon_max, lat_max
--output colorado.grib2
Extensions:
--point lon lat
β extract nearest grid point.--polygon shapefile.geojson
(future).
5. Integration with IDX (Future)ο
For HRRR in AWS S3, subsetting could be done at download time:
Parse
.idx
file.Filter only records that overlap bounding box.
Fetch those byte ranges.
This would be a Phase 2 optimization β start with local subsetting first.
6. Implementation Stepsο
Prototype
subset_processor.py
using xarray + cfgrib for NetCDF/GRIB inputs.CLI wiring: add
cmd_process_subset
incli.py
.Output handling: integrate with existing format converter.
Tests:
Full CONUS HRRR β subset Colorado
Small NetCDF test file
Docs: update CLI docs + examples.
7. Milestonesο
MVP: Support bounding-box subset for NetCDF & GRIB2 β NetCDF output.
Phase 2: Add GRIB2 β GRIB2 with
wgrib2
backend.Phase 3: Add S3 IDX-aware subsetting (fetch only spatial subset from bucket).
Phase 4: Polygon masking & shapefile support.