00:00
Data Depot Explorer

AI-Enhanced Discovery for 45 TB of Weather & Climate Data

A metadata catalog that combines Python-powered crawling, LLM enrichment, and semantic search to make decades of NOAA observation and model data discoverable and accessible.

Eric Hackathorn NOAA Global Systems Laboratory iD 0000-0002-9693-2093
Datasets 120+ Archive 45 TB API STAC v1.1 Search AI-Powered
QR code — view this poster online View Online

A Wealth of Data, Ready to Be Unlocked

GSL's /public disk holds decades of operational weather observations, model output, satellite imagery, radar data, and retrospective case studies — a rich archive built and maintained by scientists across the lab. The goal: make that deep institutional knowledge accessible to everyone.

Deep Expertise

Experienced staff carry invaluable knowledge of what’s on the depot. This tool helps capture and share that expertise so new team members can get up to speed faster.

Discovery at Scale

With 120+ datasets across nested directories, even experienced users can miss relevant data. A searchable catalog surfaces connections that browsing alone can’t.

Growing & Evolving

Dozens of formats (netCDF, GRIB, BUFR, HDF5), 6 root directories, datasets spanning 1996–present. Keeping up requires automation.

Built for Scientists

A single-page React app with no build step — designed for fast, intuitive exploration of the depot’s full catalog.

Smart Search

Three modes: keyword matching, semantic similarity via embeddings, and a 40/60 hybrid blend. Autocomplete with instant suggestions.

Ask AI

Natural language Q&A powered by RAG. Ask “What satellite data covers the Gulf of Mexico?” and get grounded answers.

Interactive Map

Natural Earth coastlines with drag-to-select bounding box filtering. See spatial coverage at a glance across all datasets.

Timeline View

Temporal coverage bars for every dataset. Expandable detail showing date ranges, update frequency, and data freshness status.

Keyword Browser

9 organized categories — Observation Type, Model/System, Domain, Phenomenon, and more. Click to filter, combine to narrow.

Dataset Cards

Expandable cards with relevance scores, AI descriptions, related datasets, format breakdowns, and JSON export capability.

Explore the Data Depot

Try: HRRR aircraft observations satellite data over Gulf of Mexico

Search across 120+ datasets by keyword, spatial extent, time range, or natural language.

localhost:8080 — Data Depot Explorer

To run the explorer locally:

git clone https://github.com/NOAA-GSL/data-depot.git
cd data-depot
docker compose up -d --build       # starts on :8080

Then open localhost:8080 — or view the poster from there at localhost:8080/poster/ for a live interactive demo.

From Raw Disk to Searchable Catalog

A Python-powered pipeline automatically walks the depot, extracts structured metadata, enriches it with AI, and serves it through a standards-compliant API.

/public Disk 45 TB, 120+ datasets
Python Crawler depot_crawler.py
AI Enricher metadata_enricher.py
Search Index 121 .search.json files
Explorer UI + STAC API

What the Crawler Captures

  • Directory trees with file counts, sizes, and extension breakdowns
  • Date ranges (earliest/latest file timestamps)
  • Update frequency estimation from timestamp intervals
  • NetCDF/GRIB/BUFR header sampling for variable names
  • Documentation files (READMEs, CDL schemas)
  • Domain keyword tagging from 177+ term vocabulary

Automated Scheduling

  • Runs as a daily Kubernetes CronJob
  • Incremental mode — only re-crawls changed directories
  • Freshness detection: active, stale, or archive
  • Fast structure-only mode with --no-sampling
  • Enricher CronJob runs later in the day
  • Search index files tracked in Git for versioning

From Raw Metadata to Rich Context

The crawler captures structure; LLMs add understanding. A local Ollama instance analyzes each dataset’s metadata and generates human-quality descriptions, categorization, spatial inference, and scientific use cases.

Crawler Output

Name
data/acars
Formats
.nc .cdf .bin
Size
19.3 GB · 172,668 files
Date Range
1996 – present
Keywords
ACARS aircraft observation

+ AI Enrichment

AI Description
In-flight aircraft observations (ACARS/AMDAR) providing temperature, wind speed, humidity, and turbulence measurements used for NWP data assimilation and aviation weather forecasting.
AI Category
observation
Spatial Coverage
Global · [-90, -180, 90, 180]
AI Keywords
AMDAR data assimilation NWP aviation upper-air
Use Cases
Data assimilation for numerical weather prediction · Model validation · Atmospheric monitoring
🔍
Semantic Search

Cosine similarity on embeddings, blended with keyword matching

💬
Ask AI

Natural language Q&A with retrieval-augmented generation

🌎
Spatial Inference

Bounding box and human-readable coverage from context

🏷
Categorization

Automatic classification into 9 dataset categories

STAC-Compliant API with AI Extensions

The catalog exposes a full STAC v1.1.0 API alongside custom AI endpoints — making the depot discoverable by standard geospatial tools while offering intelligent search capabilities beyond what STAC provides.

STAC Core Endpoints

Standards-compliant for interoperability with QGIS, pystac-client, and STAC Browser.

  • GET /stac/ Landing page
  • GET /stac/collections All datasets as Collections
  • GET /stac/collections/{id}/items Subdirectory Items
  • GET|POST /stac/search Cross-collection search

AI-Extended Endpoints

Custom endpoints that go beyond STAC for intelligent discovery.

  • GET /api/search/semantic Hybrid keyword + embedding search
  • GET /api/ask?q=... Natural language Q&A
  • POST /api/chat Multi-turn conversation
  • GET /api/ai/status Ollama model availability

Dataset → STAC Collection  ·  Subdirectory → STAC Item  ·  Custom dsg: extension for freshness status, AI category, contacts, and depot paths.

Enabling Automated Workflows

With a standards-compliant API, the Data Depot becomes more than a browsable archive — it’s a programmable data source. The Zyra Editor can treat it as a first-class node in visual data processing pipelines, connecting search and acquisition directly to downstream analysis.

Zyra Editor pipeline showing GSL Data Depot Search and Acquire nodes connected in a visual workflow
GSL Data Depot Search

Queries the Data Depot API by keyword, variable, spatial extent, or time range. Returns matching dataset metadata and access paths for downstream nodes.

GSL Data Depot Acquire

Retrieves data files from a matched dataset. Accepts the search result’s URL, parameters, and JSONPath selector to extract specific items for downstream processing.

Up next: The Zyra Editor presentation explores how visual pipeline building and agent-driven orchestration are transforming scientific data workflows.

QR code — Zyra Editor poster Zyra Editor Poster

Tech Stack

Frontend

  • React 18 + Babel Standalone
  • Single HTML file, no build step
  • Natural Earth map rendering
  • Dark/Light theme support
  • Responsive, mobile-friendly

Backend

  • FastAPI + Uvicorn
  • Python 3.12
  • STAC v1.1 router (stac.py)
  • 21 REST endpoints
  • In-memory index (~121 datasets)

Infrastructure

  • Docker + Docker Compose
  • Kubernetes (deployment + CronJobs)
  • Ollama for local LLM inference
  • nomic-embed-text embeddings
  • 82 automated STAC tests