**1. Conference Framing – Agents4Science 2025** The target venue is the [**Open Conference of AI Agents for Science (Agents4Science 2025)**](https://agents4science.stanford.edu/), which pioneers AI systems as primary authors of scientific papers. According to the call for papers, each submission *“should be primarily authored by AI systems, which are expected to lead the hypothesis generation, *1 *experimentation, and writing processes”*, with the AI listed as sole first author . Human co-authors may assist in oversight roles. The **conference theme** centers on **AI agents as primary scientific authors**, not just tools – meaning novelty is expected beyond *“AI helped”*; the AI must drive hypothesis formation and reasoning. Key **submission details** include a deadline of **September 5, 2025 (Anywhere on Earth)** and a **page limit** of 8 pages (excluding references and required statements) 2 using the official LaTeX template. 3 Submissions are **anonymous** and must be made via OpenReview . Critically, the conference mandates an **AIContribution Disclosure** – a checklist documenting what the AI did vs. what humans did – 4 reflecting the emphasis on transparency of AI involvement. In addition, each paper must include a **Responsible AI Statement** (addressing broader impacts and ethical considerations) and a **Reproducibility** 5 6 **Statement**, neither of which count toward the page limit. These requirements align with broader norms (e.g. [NeurIPS ethics guidelines](https://neurips.cc/Conferences/2025/ReviewerGuidelines)) and reinforce that authors clearly communicate how the AI was used 7 and ensure the work can be independently verified. **Audience:** The conference audience will be researchers across machine learning, autonomous agents, computational science, and scientific methodology – including those interested in ethics and reproducibility in AI-driven research. The framing of our paper should therefore highlight not only the technical innovation (an AI agent *leading* scientific discovery) but also responsible conduct (e.g. addressing potential pitfalls or biases) and verifiable results. The work should resonate with experts in **ML for science, agent systems, and weather/computational science** by demonstrating a novel, rigorous approach to using AI as a scientific collaborator. **2. Zyra’s Positioning in this Research** **Project Zyra** (from NOAA’s Global Systems Laboratory) will serve as the AI agent framework in our study. It’s important to clarify **what Zyra is and is not**. **Zyra is not merely a plotting library**; it is a modular pipeline 8 9 framework that spans data acquisition, processing, visualization, and dissemination of results. In other words, Zyra provides an end-to-end workflow for turning raw data into insights: one "plants" data seeds (from web, model outputs, etc.), **nurtures** them through processing, and **harvests** insights in the 8 form of visualizations and reports. This design focuses on reproducibility and clarity – *"every workflow can be re-run, shared, and verified"* by others. 10 Our **unique angle** is to leverage Zyra’s visualization capabilities as part of the *reasoning process* rather than treating visualization as a final, passive step. Typically, in scientific workflows visualization is the end-point (used to communicate results). Zyra enables us to make visualization an **agentic act of hypothesis generation**: the system can generate plots or animations of model outputs and then actively analyze those visuals (or underlying data) to propose hypotheses about the model’s behavior. In essence, **Zyra treats visualization as a form of reasoning**, not just illustration. For example, given high-resolution weather model outputs (from NOAA’s **HRRR** or **RRFS** models), Zyra can produce diagnostic plots (e.g. spatial maps, time loops, cross-sections) and then identify anomalies or patterns in those plots that might indicate model biases or interesting phenomena. Rather than a human analyst “eyeballing” the graphs, Zyra itself will flag features – say, an unusual pattern in forecast reflectivity or a consistent spatial offset in convective initiation – as hypotheses to investigate. This approach fits well with NOAA GSL’s mission. Zyra’s ability to ingest and analyze **operational model outputs (HRRR/RRFS)** means it can be applied directly to ongoing NWP (Numerical Weather Prediction) systems, potentially improving how we evaluate and trust these models. GSL is focused on **improving forecast systems and ensuring they are trustworthy and reproducible**. By building reproducible pipelines that automatically surface candidate anomalies for verification, Zyra can support **model verification workflows** at GSL. In practice, Zyra could highlight forecast features (like an unexpected bias in convective available potential energy, or a systematic under-prediction of reflectivity in certain conditions) and suggest these for further scrutiny. This does not replace traditional verification, but it **augments human experts** by pointing them to potential issues more quickly. All actions Zyra takes (data sources, processing steps, figures generated) are logged, providing full provenance for reproducibility. In summary, **Zyra’s role** in our project is as the *AI Scientist* agent that will autonomously: retrieve HRRR/RRFS data, generate visual analyses, form hypotheses about model performance, and even suggest verification tests. We will emphasize this positioning clearly – that Zyra is a workflow and reasoning engine that aligns with NOAA’s push for trustworthy, reproducible NWP, rather than a generic plotting toolkit. **3. Research Plan Stages** To produce a full paper by the deadline, we outline a series of stages: **Stage 1: Scoping & Literature Review (Now – Early Sept)** **Scope definition:** We will constrain the study to a few illustrative **case studies using HRRR/RRFS** model outputs. Rather than tackling all possible weather phenomena, we’ll select 1–2 representative variables or forecast aspects to focus on – for example, hourly maximum **reflectivity** (to examine convective storms) and perhaps **CAPE** (Convective Available Potential Energy) or ensemble spread in the RRFS (to examine uncertainty). These choices give us concrete targets for visualization and hypothesis generation. By limiting scope, we ensure depth in analysis and keep the project manageable. **Literature review:** We will survey prior work in several relevant areas to ground our approach: * **Existing verification frameworks in meteorology:** Traditional and modern methods for forecast verification will inform how we evaluate Zyra’s findings. Notable techniques include **object-based methods** like MODE (*Method for Object-Based Diagnostic Evaluation*) and CRA (*Contiguous Rain Area*), **neighborhood methods** like FSS (*Fractions Skill Score*), and other spatial metrics like **SAL** (*Structure– Amplitude–Location*). These methods were developed to address limitations of point-by-point verification, often inspired by how humans visually compare forecast maps . For instance, *"the CRA technique mimics 'eyeball' verification in order to rigorously quantify visual similarities of forecast and observed rain events"*. 11 MODE, introduced by Davis et al. (2006), identifies coherent forecast 12 13 “objects” (e.g., rainfall clusters) and compares their attributes to observed objects – effectively quantifying what a human might see when overlaying forecast vs. observed fields. The **Fractions Skill Score (FSS)** is another approach that *"relaxes" spatial matching by comparing the fractional* 14 *coverage of events (like rainfall above a threshold) in neighborhoods, rather than exact gridpoint hits* (Roberts and Lean 2008). Meanwhile, **SAL** provides a summary of differences in the overall structure, 15 amplitude, and location of precipitation forecasts vs. observations. By reviewing these, we learn where purely *visual* comparison has been turned into quantitative metrics. This helps us design Zyra’s reasoning so that it isn’t just guessing – it can lean on similar principles (identifying objects, spatial biases, etc.) but with the potential to automate hypothesis generation. * **Visualization for forecast verification:** We’ll look for past attempts to use visualization or interactive tools to aid forecast evaluation. One example is an **exploratory visualization tool by Lundblad et al. (2011)** for weather verification, developed with the Swedish Meteorological Institute. They found that *interactive visualization sped up the analysis process and increased flexibility compared* 16 17 *to manual methods*. This suggests that visual analysis can indeed enhance verification when done thoughtfully. However, these past tools relied on human analysts to interpret visuals; none (to our knowledge) positioned the visualization system itself as an *agent* generating insights. We’ll document where such approaches succeeded (e.g., making it easier for humans to spot trends) and where they fell short (e.g., scalability, subjectivity). This will help us articulate how Zyra’s approach is different (automation + reproducibility). We will also review operational tools like **Verif** (an open source verification package) and recent **dashboard systems (e.g., JIVE)** that NOAA and others use to visualize forecast performance in real-time. These show the state-of-practice that our approach should connect to. * **"AI as Scientist" and automated discovery:** A key inspiration is recent work on AI agents autonomously conducting research. Notably, *"The AI Scientist"* (Lu et al., 2024) proposed a framework where an AI system generates research ideas, runs experiments, and even writes papers without 18 19 human intervention. This multi-agent system was able to produce ML research papers and simulate a peer review process, reportedly even generating papers that could meet acceptance 20 thresholds at top conferences. We will examine this work to understand how the AI formulates and evaluates hypotheses. The AI Scientist’s success in three ML case studies (diffusion models, language modeling, etc.) and the fact that it *“visualizes results [and] describes its findings by writing a full scientific paper”* are directly relevant to our aims. We should also heed its limitations: 21 commentary noted that while impressive, it had **limited applicability** beyond well-defined ML 22 problems. In particular, we should ask: what would it take for an “AI Scientist” to tackle an open scientific problem in weather? Our approach with Zyra will adapt some ideas (automated hypothesis generation, closing the loop with experiments), but in the more concrete context of verifying a known numerical model. Another relevant project is *“AI Copernicus”* which used AI to rediscover 23 physical laws – famously, an AI was able to infer that Earth orbits the Sun from data. That demonstration, while flashy, mainly showed AI rediscovering known science; we aim for Zyra to possibly surface *new* or at least not immediately obvious insights about our forecast models. **Key questions guiding the literature review:** * *Where have visual reasoning attempts succeeded or failed in NWP?* For example, humans have long 24 done “eyeball verification” by overlaying forecasts and observed maps. Formal efforts like CRA and 11 MODE were created to add rigor to what was previously a subjective visual comparison. We’ll document if there were cases where trusting one’s eyes led forecasters astray or if automated image analysis missed context that a human would catch. Understanding these will shape how we validate Zyra’s visual hypotheses (to ensure we’re not just automating “eyeballing” without rigor). * *How can we avoid merely "AI eyeballing" and instead ensure reproducible rigor?* This is crucial for credibility. Our plan is to have Zyra not only point out a pattern, but also record **why** (e.g., “Region X consistently has forecast reflectivity 10 dBZ higher than observed at hour Y” – something quantifiable). We’ll draw on reproducibility standards: by logging the data and code that lead to each hypothesis, anyone can re-run Zyra’s pipeline and verify the anomaly. Literature on *reproducible visual analytics* and any domain standards for verification (like keeping provenance of data and plots) will be reviewed. The goal is to formulate concrete methods in our prototype to document every step (Zyra’s pipeline by design yields such provenance metadata). By the end of Stage 1, we will have a **background section or related-work section** outline for the paper, complete with citations to prior work on forecast verification methods, uses of visualization in meteorology, and the emerging concept of AI as an autonomous researcher. This will ground our contributions in context and help demonstrate the novelty: whereas others have either done visualization for human use or AI for other domains, *we propose an AI agent (Zyra) that uses visualization as a tool for scientific reasoning in weather model verification*. **Stage 2: Pipeline Prototyping (September)** In this phase, we build a functional prototype of the Zyra workflow to carry out a case study. The components will include: 1. **Dataset acquisition:** Using Zyra’s Acquisition layer, we’ll obtain historical cases from NOAA’s **HRRR** and/or **RRFS** model runs. This data (likely GRIB2 or NetCDF format) can be fetched via NOAA’s data services (e.g., **NOMADS **servers or the AWS Open Data repositories). *HRRR (High-Resolution Rapid Refresh)* is a 3-km convection-allowing model with hourly updates, and the *RRFS (Rapid Refresh *25 *Forecast System)* is its upcoming ensemble-based extension. We will choose specific dates or events of interest – for example, a severe weather day with strong convection – to have rich scenarios for analysis. Data acquisition in Zyra can be scripted (e.g., using FTP/HTTP managers for NOMADS or S3 9 access for AWS) . We'll ensure to log what model run and initialization times are pulled, as part of reproducibility. 2. **Automated Zyra visualization workflow:** Once data is in hand, Zyra will execute a pipeline to **process and visualize** the model output. Concretely, this may involve: 3. Reading in the forecast fields (e.g., 3D reflectivity fields, or 2D CAPE fields over time). 4. Applying any processing needed (e.g., computing derived fields or subsetting a region of interest). Generating **visual encodings** of the data. We anticipate creating a few types of visualization: * **Spatial maps or loops:** e.g. a map of forecast vs observed reflectivity side by side, or an animation of forecast reflectivity over several hours. * **Cross-sections or vertical profiles:** if examining something like CAPE or wind shear. * **Ensemble spaghetti plots or probabilistic maps:** if using RRFS ensemble data, to visualize spread. Each visualization will be configured through Zyra’s Visualization layer (which uses 26 matplotlib, etc.). The output could be static images or short videos/gifs. The key is that these visuals are created in a *consistent, scriptable way* (no manual plotting), so that they can be regenerated exactly. 1. **Hypothesis generation (Zyra as agent):** After producing the visual diagnostics, we will implement a module in Zyra (or an external agent that uses Zyra) to **analyze the results and propose “hypotheses”**. For example, suppose Zyra generates a time-loop of forecast reflectivity vs. observed radar for a storm event; the agent might detect that the forecast consistently develops storms too early (a timing error), or that the intensity of reflectivity is weaker than observed. The “hypothesis” could be phrased as: *“The model is under-predicting convective intensity in this case” *or* “There is an anomalous area of high CAPE in the forecast that did not lead to observed storms – potential false alarm region.”* Internally, this can be done by simple algorithmic checks (difference fields, thresholds) or more sophisticated pattern recognition (perhaps using computer vision on the images). At this stage, we won’t over-engineer an AI/ML solution for pattern detection – simple heuristics might suffice to flag a notable discrepancy. The novelty is that **Zyra, not a human, is making the first suggestion** of what to investigate. 2. **Logging provenance:** Throughout this pipeline, Zyra will record metadata – which data files were used, what processing steps applied (filters, thresholds), and pointers to the generated figures. This provenance log is crucial for reproducibility, allowing us (and ultimately readers) to trace every result. Zyra’s design supports this kind of transparency (workflows are declarative and can be versioned/ 10 logged by design). We might output a JSON or text report of the run. By the end of Stage 2, we expect to have **mock figures and example hypotheses** generated. For instance, an output could be a figure with an annotation “Zyra highlights this region where forecast reflectivity >> observed,” along with a caption that Zyra formulates as a hypothesis (e.g., *possible overprediction of reflectivity in mountainous regions*). These results will be framed carefully: they are **demonstrations** of the concept, not yet comprehensive verification analyses. We’ll emphasize that at this prototype stage, we’re showing what** could** be done with an AI-driven pipeline in a controlled scenario. (Reviewers will be reminded that this is not an operational system yet, but a proof-of-concept of an AI scientist in meteorology.) **Stage 3: Hypothesis Validation (Late Sept)** Once Zyra has proposed one or two hypotheses from the case study, we will attempt a **closed-loop test**: can the AI’s hypothesis be checked (at least preliminarily) and the results fed back into the reasoning? This involves a “downstream” verification agent or step: * We will use **observational data** (such as weather radar reflectivity, satellite observations, or station measurements, depending on the variable) to test the hypotheses. For example, if Zyra hypothesizes *“the model underforecasted precipitation in region X”*, we’d gather the observed precipitation/radar for region X and compare quantitatively. If it hypothesizes *“the model’s ensemble spread is too tight”*, we’d look at actual forecast errors to see if the ensemble failed to cover the truth. * The validation need not be a full rigorous study (since that could be extensive); rather, we will do a light-weight check that demonstrates the principle. For instance, we might compute a simple metric: *observed storm intensity vs forecast intensity in the highlighted case*, confirming that indeed the forecast was say 20% weaker in reflectivity than observed, supporting Zyra’s claim. Or if Zyra pointed out an anomaly that turned out not to matter, we’d note that too. The idea is to show **the scientific loop in action**: **Visualization → Hypothesis → Test → (New Visualization)**. We can even have Zyra automatically generate a follow-up plot as part of validation – e.g., a scatter plot of forecast vs observed values for the identified region, or time series of error, to illustrate whether the hypothesis holds. * This stage will produce a small set of **results that “close the loop.”** For example, an outcome could be a chart comparing model and observations for the anomaly period, with an annotation like “Confirmed: forecast reflectivity consistently 10 dBZ lower than radar during 20Z-22Z” or “Hypothesis not supported for other cases (needs further investigation).” This demonstrates a degree of autonomy: the AI not only made an assertion but also tested it. Crucially, we’ll keep these validation exercises **transparent and reproducible** as well – using Zyra or simple scripts to fetch the needed observation and do the comparison, rather than manual analysis. The point is not to achieve a publishable meteorological finding in itself, but to prove the *AI scientist workflow *can complete a full cycle (much like The AI Scientist paper had its system write a paper and even critique itself). In our case, the “critique” is using real data to verify the AI’s idea. By the end of Stage 3, we will have the core content for a **case study section** in the paper: a narrative of what Zyra did (with figures) and how the hypotheses panned out. This will likely be the most illustrative part of the paper, showing readers a concrete example of AI-driven scientific inquiry in weather modeling. **Stage 4: Writing & Assembly (Late Sept)** With the technical work done, we will focus on writing the paper and assembling all components in the required format: * **Outline and Drafting:** We will follow a structure typical for academic papers (and one that aligns with the conference’s expectations). Tentatively: * **Introduction:** Present the problem (forecast verification needs new approaches; enter AI-as scientist) and our solution’s novelty. We’ll highlight the gap in current practice (verification mostly manual or metric-driven, visualization not fully utilized, AI not yet trusted to do science) and how our approach addresses it. Also, explicitly mention that an AI agent (Zyra) led the work – fitting the conference theme. * **Related Work:** Summarize literature from Stage 1 – verification methods, AI Scientist, visual analytics in meteorology, etc. We will position our approach relative to each. For instance, “Unlike prior verification tools which required human analysis, our system autonomously generates hypotheses,” and “Unlike The AI Scientist which focused on ML benchmarks, we target a real-world scientific domain (weather) with an AI agent.” * **Methodology:** Describe Zyra’s architecture and how we configured it for this study. A diagram of the workflow pipeline (data→process→visualize→hypothesize→validate) will be included. We will detail the case study setup: which data was used and how, what algorithms Zyra employed to analyze visuals, etc. This section ensures readers could in theory replicate the pipeline. * **Case Study Results:** Present the outcome from Stage 2 and 3. This includes the figures Zyra generated, the hypotheses, and the validation results. We’ll write it in a narrative form: e.g., “Zyra examined a June 2025 severe weather outbreak using HRRR data. It produced an animation of forecast vs observed reflectivity (Figure X) and immediately identified a discrepancy in storm timing. Specifically, Zyra noted the model initiated convection approximately 1 hour earlier than reality in several cells (marked in Figure X). Hypothesizing a timing bias, Zyra then... etc.” We will likely include at least **two figures**: one illustrating the **visual hypothesis **(perhaps a map or image with annotations showing what Zyra flagged), and another showing the **validation** (perhaps a simple plot or table confirming the error). These figures will be captioned clearly, possibly even showing a “chat” or explanation from Zyra if appropriate (to reinforce the AI-driven aspect). * **Discussion:** Here we interpret the results and discuss broader implications. We’ll acknowledge limitations (small sample, the simplicity of Zyra’s analytics so far, etc.) and also advantages (the approach is generalizable, and everything is reproducible). We will also connect back to the conference theme: what did having an AI in charge teach us? For example, we might note that the AI found something a human might also find, but it did it systematically and logged the process, which is a step toward *auditable science done by machines*. We’ll also discuss how human oversight was still important (to design the pipeline and to ensure the hypotheses made sense), providing a balanced view. * **Conclusion/Future Work:** Summarize contributions and point to next steps (e.g., scaling up to more cases, incorporating more advanced AI pattern recognition, or integrating into real forecast evaluation pipelines). * **Required Statements and Checklist:** We will include the **Responsible AI Statement **and **Reproducibility Statement **as required. Much of this content we can draft in parallel. For Responsible AI, we will emphasize the precautions taken: for example, ensuring the AI doesn’t mislead by only presenting significant findings with statistical support, and discussing the ethical angle of crediting AI vs humans. (Since the conference itself is about AI authors, ethics of attribution is key.) We’ll also mention possible risks: could an AI scientist make incorrect conclusions from spurious correlations? We mitigate that by human oversight and by transparent validation. The **Reproducibility Statement **will note that our **code and data will be released**, and that Zyra’s pipelines are openly available (likely via the NOAA-GSL/zyra repository). We’ll specify what we are doing to enable others to replicate the results (e.g., providing environment details, fixed random seeds if any ML is used, and the provenance logs). * **AI Contribution Disclosure:** We need to fill out the checklist where we delineate which parts were AI-driven. In our case, the AI (Zyra) did hypothesis generation, some parts of experiment (visualization, analysis), and even helped draft some text perhaps (if we use any language generation for writing). We, as human researchers, did the framing and pipeline development. We will carefully 4 document this. Clarity here is critical for credibility , so we’ll ensure the final paper clearly states: e.g., “Zyra (AI system) selected the case study, generated Figures 2–3, and formed the initial hypotheses, while human co-authors designed the experiment and verified the AI’s findings,” etc. * **Paper assembly and formatting:** We will use the provided LaTeX template (which includes the 3 disclosure checklist) . All figures will be properly anonymized (no NOAA logos or author names on them). We must double-check that nothing in text or acknowledgments reveals our identity, given the submission is anonymous. We’ll likely refer to Zyra without saying “by NOAA GSL” in the paper, to keep anonymity (Zyra is open-source, so it should be okay to mention the tool as long as we don’t link it directly to us). By the end of Stage 4, essentially we will have a full **draft of the paper** ready for final checks. The various pieces (figures, statements, references) will be in place. We'll circulate it (internally) for feedback if time permits, to ensure clarity. **Stage 5: Submission (by Sept 5, 2025)** Final steps include polishing and submitting: * We will give the manuscript a thorough edit for coherence, compliance with length (<=8 pages + refs) and formatting. All required sections (Abstract, main paper, disclosure checklist, statements) will be present. We will ensure references are formatted per the template. Any supplementary material (perhaps code or an appendix with additional figures) will be prepared as needed, though the conference hasn’t explicitly mentioned a separate supplementary. * **Anonymity check:** We’ll remove or generalize any identifying information. For example, instead of “In our lab at NOAA, we…”, the paper would say “Zyra was used… (details omitted for anonymity).” We’ll also be cautious not to cite an obscure internal tech report that gives us away. If we need to cite Zyra’s documentation, we might cite it as a URL with care (or as anonymous source for review).27 * Once everything is set, we will submit via the OpenReview portal . We’ll also be prepared to answer any checklist questions in the submission form about AI usage (the conference likely will ask some questions on submission about how the AI was used – essentially the disclosure in the paper covers it, but the form might as well). * After submission, the paper will undergo the review process where, interestingly, AI reviewers will be 28 involved. We might later see the AI agent reviews, which in itself is an exciting aspect of this conference. (Should we get accepted, the final camera-ready can then include actual author names and acknowledgments, presumably.) By following these stages, we aim to **meet the deadline confidently **and produce a submission that is novel, rigorous, and aligned with Agents4Science 2025’s expectations. The plan is ambitious but feasible with steady progress through September. **4. Comparable Work and Related Sources** In developing this project, we draw inspiration from several **similar projects** and take heed of **lessons learned **from previous attempts: **Similar Projects and Approaches** * **The AI Scientist (Lu et al., 2024):** This project is a clear precursor in spirit. It presented a *“fully automated open-ended scientific discovery” *system where an AI agent (or team of agents) iteratively 21 generated ideas, ran experiments, and wrote papers . It even simulated a peer review with an automated reviewer. The AI Scientist showed that in domains like machine learning, an AI can produce research outputs (including readable papers) largely autonomously. One of its achievements was producing papers that an automated reviewer judged above acceptance threshold at a top conference . This is an existence proof of AI-as-author. However, it was 20 constrained to well-defined computational experiments (with clearly quantitative goals and plenty of training data). We are translating some of its ideas to a scientific domain (meteorology) with real world data. Also, whereas AI Scientist encompassed the *entire *research cycle (including writing and peer review), our focus is on the hypothesis generation and verification loop within a specific scientific workflow. Still, the modular design (separate modules for idea generation, execution, writing, reviewing) is something we consider – Zyra could be seen as covering the “experiment execution and analysis” modules of such a system. We also note the **cost and reliability **aspects: AI Scientist reportedly could generate a paper for <$15 of compute . Our approach leverages 29 relatively cheap computations (pulling data, making plots) – it’s more about intelligent analysis than heavy compute, so cost is not a barrier. The **limitation **noted by commentators was that AI Scientist’s scope was limited and it might *lack truly novel insights *(the Nature News headline asks 22 *“what can it do?” *and notes *“limited applicability”*). This urges us to carefully evaluate whether our AI (Zyra) is contributing real insight or just automating trivial procedures. * **Automated/ML Approaches in NWP:** The meteorological community has seen rising interest in using machine learning to emulate or improve numerical weather prediction. *AutoML for NWP* can refer to a few things: one is using ML to tune models or select optimal parameters (which is tangential), but more relevant is using **ML as surrogates or supplements for physical models**. For example, recent work has shown that deep learning models can post-process deterministic forecasts to generate probabilistic hazard predictions. One study developed a **CGAN (Conditional Generative Adversarial Network) to generate an ensemble of synthetic forecasts from a single HRRR run**, 30 feeding those into a CNN to predict severe weather probabilities . This essentially created an *ML based ensemble *that behaved similarly to a traditional ensemble, preserving key correlations from 31 the original HRRR physics . The result was improved skill in forecasting events like tornadoes and hail. This is relevant as it demonstrates AI adding value on top of a physics model – analogous to how we want AI (Zyra) to add value on top of traditional verification. Another example: *graph neural networks and FNOs (Fourier Neural Operators)* have been used as faster surrogates for weather models, though primarily at global scales or for specific parameters. These projects show that **AI can either emulate NWP (to speed it up)** or **analyze NWP output for insight**, which is closer to our use-case. We will cite these works to situate Zyra: we are not trying to replace the weather model with ML, but rather to use AI to interpret the model’s output. If their results show, say, common patterns of model error that ML picks up, Zyra could be designed to catch similar patterns and explain them. Also, any **AutoML **efforts that automate model experimentation (like hyperparameter tuning for weather models, or combining different schemes) might be relevant to mention, as they reflect the trend of automating parts of the scientific process in meteorology. * **Visualization-driven forecast evaluation:** Traditional verification research has given us tools like MODE and SAL (discussed above) which indeed sprang from recognizing the value of the “human eye” in pattern comparison. Additionally, there have been systems focusing on *visual analytics *for meteorology. For example, the *Exploratory Visualization for Weather Verification *tool (Lundblad et al. 2011) enabled users to interactively explore forecast vs obs data and cluster stations by error 16 patterns . Another is the **JIVE **system (Joint Integrated Verification Experiment) used in the U.S., which provides forecasters with visual dashboards of recent model performance (e.g., plots of errors, biases, etc., updated in real-time). These systems underscore that visualization can greatly aid understanding of model performance. However, they rely on human expertise to draw conclusions from the visuals. Our work pushes this further by placing an AI in that loop – essentially *performing *the visual analysis algorithmically. We believe this could catch things human forecasters might miss (especially subtle or multi-variate patterns). But we also look to these for caution: a human forecaster has contextual knowledge and can intuitively ignore unimportant differences; an AI might lack that context. We might leverage techniques from explainable AI to ensure Zyra’s focus is on meaningful differences (e.g., it could use thresholds so it doesn’t flag every tiny discrepancy as an “anomaly”). * **Explainable AI in Weather:** A growing body of work on **interpretable and explainable AI for weather/climate **provides another angle. Researchers have applied methods like SHAP (Shapley values), LIME, and saliency maps to understand *why* an ML model made a certain weather prediction 32. For instance, showing which input features or regions most influenced a neural network’s forecast of extreme rainfall. These are essentially visualization techniques (heatmaps, etc.) aimed at making AI predictions more transparent. However, **they serve interpretability, not scientific discovery** – i.e., they help humans trust or understand a model, rather than the AI itself coming up with new hypotheses. In our project, Zyra’s visualizations are not explaining an AI’s decision, but rather examining the physical model’s output to generate new questions. We occupy a complementary space: instead of “explainable AI” we are doing “AI-driven explanation of a scientific system.” Nonetheless, we will incorporate best practices from XAI, like ensuring **transparency** (Zyra should be able to show why it flagged something, akin to providing a “reason” or criteria, much like saliency shows important pixels). The ethical consideration from XAI also applies – misleading visuals can misinform, so Zyra must be designed to avoid spurious patterns (or at least to communicate uncertainty when it flags something of borderline significance). **Lessons Learned from Past Attempts** From reviewing prior efforts, we note several pitfalls that we aim to avoid: * **Pitfall 1: Visualization as a “post-hoc decoration.”** In some projects, visualization was treated as merely a pretty output at the end of a study, without actually informing the science. Such approaches can rightfully be dismissed as cosmetic. We must ensure that in our work, visualization is *integral *to the reasoning. That is why Zyra’s hypothesis generation is directly tied to what it visualizes. We will emphasize in the paper that without the visualization step, the insights would not have emerged – it’s not just there to make the paper look nice. By demonstrating a case where Zyra’s plot *revealed *an issue that wasn’t obvious from raw numbers alone, we show true value. Additionally, we will document everything quantitatively (where possible) to back up visual observations, so that it’s not “just a pretty picture.” Reviewers should see that our visual findings are reproducible and lead to actionable verification tests. * **Pitfall 2: Hypotheses too obvious (lack of novelty).** One risk of an AI system combing through data is that it may “discover” things we already know. For example, it might conclude “higher CAPE leads to stronger storms” – which, while true, is meteorologically obvious and not a novel insight. If our AI only produces such trivial hypotheses, the contribution will fall flat. We need to demonstrate **serendipity or non-obvious findings**. This might involve focusing on more nuanced model errors (e.g., a subtle diurnal bias, or an error that only shows up in specific terrain). It could also mean having Zyra analyze combinations of fields (perhaps finding a pattern involving both reflectivity and winds that is unusual). We recall examples like the “AI Copernicus” case – the AI re-derived a known fact (Earth orbits Sun) . Impressive technically, but scientifically it’s just confirming known 23 knowledge. We want Zyra to ideally catch at least one thing that isn’t already well-documented in HRRR literature. This is challenging, but even a new combination or framing of an issue would count. We’ll also phrase hypotheses carefully to avoid tautologies. If they are obvious, we will acknowledge it and position the result as a sanity check (e.g., “Zyra identified a known timing bias, confirming it can catch expected errors; more interestingly, it also pointed out X, which is less widely recognized”). Ensuring some element of surprise or deeper insight will strengthen the paper. * **Pitfall 3: Overclaiming or trying to “replace” established processes.** When introducing AI into a domain with longstanding practices (like forecast verification), one must be careful not to oversell. We will explicitly state that **Zyra is a support tool, not a replacement for human experts or for rigorous statistical verification**. The goal is to accelerate and enrich the verification process, not to produce final judgment on model quality. In the past, some automated verification proposals faced skepticism if they appeared to ignore the expertise of forecasters or if they weren’t thoroughly validated. We’ll mitigate this by positioning Zyra as *hypothesis-generating*, requiring confirmation (as we indeed do in Stage 3). Our discussion will note that ultimate decisions (e.g., whether a model change is good or bad) would still be made by scientists, but Zyra can provide a strong assist by combing through vast data quickly. By being humble in claims, we gain credibility. For instance, rather than saying “Zyra *verifies *the model,” we say “Zyra *identifies candidates *for further verification.” Reviewers will appreciate this realistic framing. Additionally, we’ll compare Zyra’s flagged issues with traditional verification metrics if possible (to show that it’s consistent or catches something missed by standard scores). This demonstrates we’re not ignoring existing methods but building upon them. In summary, these lessons instruct us to integrate visualization deeply (not superficially), aim for at least some novel findings, and present our system as a complementary approach rather than a wholesale replacement of current practice. A quote from a WMO verification guide reminds us that *even the best *24 *automated methods should be used alongside human judgment and understanding *– we will echo that sentiment. **5. Best Practices for AI-Led Academic Papers** Writing a paper where an AI is essentially the primary researcher requires careful adherence to emerging best practices, both for credibility and for meeting conference policies: * **Clarity in AI vs. Human Contributions:** It must be unambiguous which aspects of the work were 4 done by the AI agent and which by humans. This is not just a formal requirement but crucial for readers to trust the methodology. We will clearly document that Zyra (the AI) performed certain tasks (data analysis, plotting, initial hypothesis generation), and that humans were responsible for overseeing, providing the research question, and validating results. This clarity will be reflected in the AI Contribution Disclosure checklist in the paper, and also in the narrative (e.g., “Zyra then did X…” vs “We then guided the analysis to Y…”). By being transparent here, we address possible skepticism – for instance, if something looks too insightful, the reader should know if it came from 4 the AI or our own injection. The Agents4Science CFP explicitly expects this breakdown * **Reproducibility and Open Science: **Since an AI is involved in producing the results, we need to doubly ensure everything is reproducible (to counter any notions of “black box magic”). All code (including Zyra itself and any glue code for this experiment) will be made available as open-source. We will provide links or a DOI to a repository in the paper (likely in the Reproducibility Statement). Every figure in the paper should be reproducible with the provided workflow and data – we’ll include or reference **provenance logs **that detail data sources and processing. This aligns with the 6 conference’s encouragement to include a Reproducibility Statement . Also, by using Zyra – which 10 itself emphasizes reproducible workflows – we have a good foundation. In practice, we might release a *“reproducibility package” *containing: the exact HRRR/RRFS data files used (or scripts to fetch them), the configuration for Zyra’s pipeline, and the output it produced. This allows other researchers (or the reviewers) to rerun our AI scientist experiment end-to-end. Such openness will lend credibility and is simply good scientific practice. * **Transparency of the AI Pipeline:** We must avoid a situation where the AI’s reasoning is a mystery. Even though Zyra is not a learning algorithm that’s hard to interpret (it’s a deterministic pipeline we construct), we will document its decision logic. For example, if Zyra flags an anomaly, we will state the criterion (e.g., “it is considered a reflectivity error significant if >10 dBZ over at least 1000 km² area for >2 hours”). This way, the process is not a black box. Additionally, if we use any machine learning component for pattern detection, we will utilize explainability tools to illustrate why that component flagged something (though at this point, we plan mostly rule-based detection). The general principle is to treat Zyra’s “thought process” like we would an algorithm in a methods section – fully specified. The conference’s ethics code (NeurIPS guidelines) also emphasize transparency . Being 7 transparent also helps in establishing norms for AI scientists: we want to set a good example where anyone can scrutinize the AI’s work as they would a human’s. * **Ethical and Responsible AI Framing:** Given the novelty of AI agents doing science, we need to address potential ethical issues. We will include a **Responsible AI Statement **that covers: authorship and credit (we give Zyra “credit” as first author in spirit, but we remain responsible for its actions), biases in analysis (the AI might focus on certain regions or events – is there any bias in what data we showed it?), and the risk of incorrect or misleading conclusions. One risk is that an AI might find a pattern that is coincidental (spurious correlation) and propose it as meaningful. If unchecked, this could mislead scientific conclusions. Our mitigation is the human validation step and the use of real statistical measures to verify any hypothesis. We’ll also mention the oversight we provided to Zyra – it did not run completely unsupervised in a high-stakes context. Another ethical aspect: **data usage and privacy **– not an issue here since weather data is public and doesn’t involve personal data. But we will mention that all data was open and that we follow NOAA data usage policies. Lastly, we will discuss the broader impacts: if AI like Zyra were widely used, could it replace junior scientists in some tasks? How do we ensure it augments rather than displaces human expertise? These considerations show we have thought about the societal context of AI-driven research, as required 7 by the conference. * **Careful Positioning of Novelty:** We have to strike the right tone in claims about our contribution. It is important not to claim that we have “solved” forecast verification or that our AI is infallible. Instead, we’ll frame our novelty as **opening a new pathway **for how AI can participate in scientific discovery. The paper will highlight that the *process *is novel (AI agent actively reasoning on weather data) even if the specific weather findings are preliminary. We anticipate reviewers from both ML and meteorology sides – the ML folks want to see that the application is non-trivial and the meteorology folks want to see that we respect the complexity of the problem. By citing the limitations of AI 22 Scientist and others (e.g., limited domains, need for human-confirmation) , we show awareness of novelty limits. We will emphasize what is, in fact, novel: for example, *“to our knowledge, this is the first demonstration of an AI agent autonomously guiding the verification of an operational NWP model.” *That is a strong statement if true, but we’ll ensure it’s accurate by checking literature. Furthermore, we say *“we do not claim the AI finds every issue or replaces metrics, but it provides a novel complementary tool.” *This honest assessment will make our claims credible. Essentially, we let the work speak for itself through the case study, and avoid hyperbole. If anything, we under-promise and over-deliver. By following these best practices, we aim for our paper to be not only accepted but also to serve as an exemplar in this nascent area of AI-driven science. The conference organizers themselves are looking to 33 establish norms, and by being meticulous about contributions, reproducibility, transparency, ethics, and modesty in claims, we align with those emerging norms. **6. Paper Structure and Deliverables** Finally, we summarize the expected **structure of the paper **and the concrete **deliverables **we will produce on the way to submission: **Proposed Paper Outline** 1. **Abstract:** A concise summary (~200 words) highlighting that an AI (Zyra) was used to autonomously evaluate a weather model, the key findings (e.g., “Zyra identified a timing bias in the HRRR model and validated it with observations”), and the significance (AI as a new tool for model verification). We will write this last, but it should be compelling to both AI and domain experts. 2. **Introduction:** Introduce the problem of forecast verification and the new opportunity of AI agents in science. Mention Agents4Science context (AI first author) to set stage. State our objectives and contributions plainly – e.g., *“This paper presents an autonomous visualization-driven verification agent for numerical weather prediction. To our knowledge, this is the first instance of an AI agent leading the discovery of model performance insights in an operational forecast system.” *Also outline the rest of the paper. 3. **Related Work:** Combine literature review elements from Stage 1. This will have subsections or 13 15 paragraphs on: (i) Forecast verification methods (traditional vs. modern) , (ii) Prior uses of 16 21 visualization in meteorology , (iii) AI in scientific discovery (AI Scientist, etc.) , and (iv) perhaps 30 AI/ML in NWP (to note what else AI is being used for, to show our approach is complementary) . We will ensure to cite all relevant works to show we built on solid ground. 4. **Methods (Zyra Workflow):** Describe Zyra and our extensions. We’ll likely include **Figure 1 **as a 4 schematic of the workflow: boxes for “Data Acquisition,” “Visualization & Analysis by Zyra,” “Hypothesis Generation,” “Validation,” etc., with arrows showing the loop. In text, detail each component. Also describe the specific case study setup (which model, variables, dates). This section might also mention implementation details (e.g., computing environment, any custom code around Zyra). 5. **Case Study Results:** This is the heart, likely 2–3 sub-figures. **Figure 2 **could be an example annotated visualization from Zyra – for instance, a map with a highlighted anomaly. We’ll describe what Zyra did: “In case X, Zyra produced the following visualization (Fig. 2) and automatically identified the red circled region as anomalous because…”. Then present **Figure 3** which might show the result of the hypothesis test (e.g., a time series or bar chart comparing forecast vs observed in that region). The text will say what the hypothesis was and what the test showed (support or refute). If we have more than one hypothesis, we might have a small table or a couple of paragraphs for each. We will also report any quantitative metrics computed (for example, “traditional verification score Y for this case was Z, whereas Zyra’s discovery was related to that in this way…” for context). The result section should convincingly demonstrate the AI agent’s capability. 6. **Discussion:** Reflect on what the results mean. Did Zyra find something interesting? How might this be used operationally? What are the limitations (e.g., maybe Zyra might miss issues that aren’t visually obvious, or it might need improvement to filter out noise)? We will tie back to the big picture: how does this illustrate AI as a scientific partner? Perhaps mention how much human effort was saved or how this could scale to many cases (an AI doesn’t get tired looking at hundreds of maps). We’ll also address any surprises encountered (maybe the AI pointed out something we hadn’t considered initially – that would be great to mention as a serendipitous find). 7. **Conclusion:** Summarize contributions and perhaps include a visionary statement: e.g., *“This work demonstrates a first step toward AI agents that can autonomously scrutinize and improve scientific models. In the future, such agents could continuously monitor operational forecasts, suggesting improvements in real-time.” *Also thank (in a general way, due to anonymity) any supporting resources, and reiterate that code/data are available. 8. **Statements:** Following the main content, we’ll have the **Responsible AI Statement **(detailing societal impacts, ethical steps) and **Reproducibility Statement** (what we have done to enable 5 replication), as required . These are typically a paragraph each. They’ll be written in accordance with conference guidelines but tailored to our project (for instance, Responsible AI might mention that the AI’s actions were monitored and the system won’t be deployed without further testing, etc.). 9. **AI Contribution Disclosure Checklist:** In the camera-ready, this will likely appear as a checklist. For the submission, it might be part of the template. We will ensure it’s filled out correctly – e.g., checkboxes for “Hypothesis generation: AI (X); Human (✓)” depending on what fits. This formalizes our earlier notes on contribution division. 10. **References:** A complete list of cited works. We have many to include (verification literature, AI Scientist paper, etc.), which is good to show thoroughness. We’ll use the bib entries as required by the template. We might need to be careful if any reference could deanonymize (e.g., if we cite a NOAA tech report authored by us), but likely we’ll stick to publicly available papers and well-known references. The conference allows concurrent submission, so citing arXiv or preprints is okay if needed. **Deliverables Before Submission** To achieve the above, we plan the following concrete outputs during the project: * **Workflow Diagram:** A schematic illustration of Zyra’s pipeline (to be used as Figure 1). We will create this diagram (likely using a drawing tool) showing how data flows from model outputs -> Zyra processes -> visualization -> hypothesis -> validation -> feedback. This can be prepared once our design is finalized (Stage 2 or 3). * **Annotated Visualization Figures (at least 2):** We will generate at least two figures from the case study with annotations. For example, one figure might be a side-by-side of forecast vs observation with an area circled by Zyra indicating an anomaly. Another figure might be a graph or map highlighting an ensemble bias found by Zyra. We’ll make sure these figures are clear, with legible annotations or insets explaining what the AI found. Each figure will have captions that tell a mini story. These will be included in the paper and also serve as visual proof-of-concept for presentations. * **Sample Hypotheses proposed by Zyra:** We will document 1–2 example hypotheses that Zyra “proposed.” This can be a short text output or description, e.g., “Hypothesis: The HRRR model’s convective initiation is systematically early by ~1 hour in the Colorado Front Range.” We might include these explicitly in the paper (possibly in a table or quoted in the text). Having concrete examples strengthens the reader’s understanding of what exactly the AI contributed. * **Validation Results:** For each hypothesis, we’ll have a corresponding validation result (numeric or visual). Deliverables here could be a small table of error statistics, or a plot as mentioned. Even if not all make it into the final paper, we will have them as supplementary or to reference in text. These results should demonstrate whether Zyra’s suggestions hold water. For instance, if Zyra flagged a bias, the validation might compute the mean error over that region/time and indeed show a significant bias of X magnitude. * **Drafted Statements (Responsible AI, Reproducibility):** We will prepare these statements in advance, since they require careful wording. We already have an outline for them (as discussed in Stage 4 and Best Practices). We can draft them even before final results, as they depend more on methodology and ethical considerations. This ensures we don’t rush them at the last minute. * **Completed AI Disclosure Checklist:** We will fill out the checklist as part of writing. As a deliverable, we consider it “completed” when we have a clear list of which tasks were AI vs human. This can be done once all stages are done and we reflect on the workflow. It’s essentially a summary, but we treat it seriously as a required piece for submission. All these deliverables feed directly into the paper. By mid-late September, as we compile the LaTeX submission, we will plug in the diagram, figures, hypothesis descriptions, validation findings, and the written sections. A final internal review will ensure consistency (e.g., the introduction promises something that the results do show, references are all cited, etc.). **In conclusion, this research plan** provides a roadmap from conceptualization through execution to writing. By following the stages outlined and leveraging Zyra’s capabilities, we aim to produce a compelling case study of an AI agent augmenting scientific discovery in weather prediction. This plan keeps us on track to meet the submission deadline and addresses both the innovative and pragmatic aspects (literature grounding, reproducibility, ethical compliance) needed for a successful paper at Agents4Science 2025. We are excited to bring this to fruition and contribute to the emerging dialogue on AI as a true partner in science. ## References 1 3 4 5 6 7 27 [^1]: Call for Papers - Open Conference of AI Agents for Science 2025. https://agents4science.stanford.edu/call-for-papers.html 2 28 3 [^2]: Stanford-Initiated First Academic Conference for AI Authors: First Author Must Be an AI. https://eu.36kr.com/en/p/3375685195848200 8 9 10 26 [^3]: Home - NOAA-GSL/zyra GitHub Wiki. https://github-wiki-see.page/m/NOAA-GSL/zyra/wiki 11 12 15 [^4]: WGNE Forecast Verification Methods. https://wgne.net/bluebook/uploads/2022/sections/BB_22_S10.pdf 16 17 [^5]: Lundblad et al. (2011). Exploratory Visualization for Weather Verification. https://www.researchgate.net/publication/224255922_Exploratory_Visualization_for_Weather_Data_Verification 18 19 20 21 29 22 [^6]: Lu et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. https://arxiv.org/abs/2408.06292 23 [^7]: Martin Vetterli on AI Copernicus. https://twitter.com/MartinVetterli/status/1197798877379866625 25 [^8]: NOAA AWS Open Data Registry. https://registry.opendata.aws/collab/noaa/ 30 31 [^9]: Generative Ensemble Deep Learning for Severe Weather Prediction. https://arxiv.org/html/2310.06045v2 32 [^10]: Interpretable ML for Weather and Climate Prediction. https://www.sciencedirect.com/science/article/abs/pii/S1352231024004722 [^11]: NeurIPS Ethics Guidelines. https://www.cawcr.gov.au/projects/verification/verif_web_page.html 24 [^12]: WMO Forecast Verification Guide. https://www.cawcr.gov.au/projects/verification/