The ESS-DIVE Team is looking forward to participating in the 2022 AGU Fall Meeting. Below are several abstracts that we will be presenting at the meeting!
BASIN-3D: Data Synthesis Software for Earth Science Researchers (H12P-0874)
Presenter: Danielle Christianson
Presentation Type: Poster
Session Date and Time: Monday, 12 December 2022; 9:00 AM – 12:30 PM CST / 7:00 AM – 10:30 AM PST
Session Number and Title: H12P: Hydroinformatics and Data Science: Pathways to Support Reproducible Watershed Modeling II Poster
Session URL: https://agu.confex.com/agu/fm22/webprogrampreliminary/Paper1153073.html
Abstract
Danielle S Christianson, Valerie C Hendrix, Catherine Wong, Charuleka Varadharajan, and Deb Agarwal, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
Integration of diverse data required for earth science research remains a time-consuming task. While approaches such as data warehousing and federation (brokering) have reduced effort, the difficult work of harmonizing formats, units, and semantics remains for the individual researcher. Often the resulting one-off data synthesis products become outdated as data change. In addition, the researcher does not have flexibility to tailor synthesis to specific data needed.
In this presentation, we describe BASIN-3D (Broker for Assimilation, Synthesis and Integration of eNvironmental Diverse, Distributed Datasets), a tool we developed to address this gap. BASIN-3D synthesizes diverse data from a variety of remote and/or local data sources in real-time without the need for additional storage. The software can be deployed as a python package in python-based applications like Jupyter notebooks or as a web service application. Data sources can be configured using a plugin approach that maps their data model and vocabularies to BASIN-3D’s. These mappings enable BASIN-3D to handle translation of query parameters and results into a standardized data model. BASIN-3D then outputs the synthesized results in common data formats (e.g., hdf5, Pandas dataframe, json) specified by the researcher.
We describe the unique features of BASIN-3D using two prototype U.S. Department of Energy use cases. The first is a python-based application that combines USGS National Water Information Systems NWIS and DayMet time series data to support study of disturbances on river water quality with ML techniques. The second is a Django web-based application that integrates time series data from NWIS and project-based databases for access via an online data portal. Finally, we discuss future work and ongoing challenges such as optimization for large data volumes, adding new data sources, and tradeoffs between generalization and customizability.
Community-developed (meta)data reporting formats to enable data reuse in environmental repositories (IN41A-06)
Presenter: Charuleka Varadharajan
Presentation Type: Online Poster Discussion
Session Date and Time: Thursday, 15 December 2022; 8:40 AM – 8:48 AM CST / 6:40 AM – 6:48 AM PST
Session Number and Title: IN41A: Adopting Trustworthy Data Repository Stewardship to Enable Reuse of Data Across Disciplines I Online Poster Discussion
Session URL: https://agu.confex.com/agu/fm22/webprogrampreliminary/Paper1163669.html
Abstract
Charuleka Varadharajan1, Robert Crystal-Ornelas1, Dylan O’Ryan2, Bond-Lamberty Benjamin3, Kathleen Beilsmith4, Kristin Boye5, Madison Burrus1, Shreyas Cholia1, Danielle S Christianson1, Michael Cameron Crow6, Joan E Damerow1, Kim S Ely7, Amy E Goldman8, Susan L Heinz9, Valerie C Hendrix1, Zarine Kakalia1, Kayla Cerise Mathes10, Fianna O’Brien1, Stephanie C Pennington11,12, Emily Robles1, Alistair Rogers7, Maegen Simmonds1,13, Terri Velliquette6, Pamela Weisenhorn14, Jessica Nicole Welch6, Karen Whitenack1 and Deb Agarwal1, (1)Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (2)Lawrence Berkeley National Lab, Berkeley, CA, United States, (3)Pacific Northwest National Laboratory, College Park, MD, United States, (4)Argonne National Laboratory, Chicago, United States, (5)SLAC National Acceleratory Laboratory, Stanford Synchrotron Radiation Lightsource, Menlo Park, CA, United States, (6)Oak Ridge National Laboratory, Oak Ridge, TN, United States, (7)Brookhaven National Laboratory, Environmental and Climate Sciences Department, Upton, NY, United States, (8)Pacific Northwest National Laboratory, Biological Sciences, Richland, WA, United States, (9)Oak Ridge National Laboratory, Oak Ridge, United States, (10)Virginia Commonwealth University, Integrative Life Sciences, Richmond, VA, United States, (11)Pacific Northwest National Laboratory, College Park, United States, (12)Joint Global Change Research Institute, College Park, United States, (13)Pivot Bio, Berkeley, United States, (14)Argonne National Laboratory, Argonne, IL, United States
Findable, Accessible, Interoperable, and Reusable (FAIR) principles are intended to enable the reuse of Earth and environmental science data beyond the purpose for which the data were originally collected. One pathway to making data more reusable is for repositories to encourage contributors to organize and publish data that follow established standards and guidelines. However, Earth science data are diverse and multidisciplinary making it difficult for researchers to determine and use the appropriate standards or formats that apply to their data.
Here, we present 11 reporting formats: instructions, templates, and tools for consistently formatting data, for a diverse set of Earth science (meta)data. These formats were developed through a partnership between the U.S. Department of Energy’s Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) repository and researchers from its science community. They cover a broad range of Earth science (meta)data that includes cross-domain metadata (dataset metadata, location metadata, sample metadata), file-formatting guidelines (file-level metadata, CSV files, terrestrial model data), and domain-specific formats for biological, geochemical, and hydrological data types (amplicon abundance tables, leaf-level gas exchange, soil respiration, water and sediment chemistry, sensor-based hydrologic measurements). We adopted a community consensus process to develop these formats by obtaining extensive input from 247 researchers across 128 institutions. This resulted in a pragmatic set of reporting formats that are based on scientific use cases. We also describe lessons learned from this process and guidelines that communities can use to create new reporting formats that are tailored to their scientific workflows. Such community-developed reporting formats lend themselves to easy adoption, enabling scientific data synthesis and knowledge discovery by making it easier for data contributors to provide (meta)data that are more FAIR.
A FAIR Guided and Community-Oriented Approach to Improving Metadata Quality in a Large Scale Data Repository (IN41A-03)
Presenter: Emily Robles
Presentation Type: Online Poster Discussion
Session Date and Time: Thursday, 15 December 2022; 8:16 AM – 8:24 AM CST / 6:16 AM – 6:24 AM PST
Session Number and Title: IN41A: Adopting Trustworthy Data Repository Stewardship to Enable Reuse of Data Across Disciplines I Online Poster Discussion
Session Link: https://agu.confex.com/agu/fm22/webprogrampreliminary/Paper1199893.html
Abstract
Emily Robles1, Charuleka Varadharajan1, Madison Burrus1, Shreyas Cholia1, Robert Crystal-Ornelas1, Joan E Damerow1, Hesham Elbashandy1, Valerie C Hendrix1, Christopher S. Jones2, Matthew B. Jones3, Zarine Kakalia1, Mario Melara1, Fianna O’Brien1, Peter Slaughter2, Karen Whitenack1 and Deb Agarwal1, (1) Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (2) National Center for Ecological Analysis and Synthesis, Santa Barbara, CA, United States, (3) National Center for Ecological Analysis and Synthesis, DataONE, Santa Barbara, CA, United States
The Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository stores diverse Earth and environmental science data generated by projects funded by the U.S. Department of Energy (DOE). ESS-DIVE strives to publish datasets that adhere to the FAIR principles, which state that data should be findable, accessible, interoperable, and reusable, and to develop practices that enable data providers to improve the quality of data submissions. However, these strategies must also be able to scale with the growth of the repository, apply to a wide range of data types, and be valuable to both data providers and users.
To address these challenges, we have developed metadata requirements for the ESS-DIVE repository through research into dataset best practices, review of journal metadata requirements, and with community feedback to ensure that they are useful and applicable across data types. We then developed a two-part, semi-automated dataset review workflow that programmatically verifies whether datasets meet metadata quality requirements before publication. The automated component includes FAIR metadata checks developed by the National Center for Ecological Analysis and Synthesis (NCEAS) that were customized to fit ESS-DIVE’s publishing requirements. Results from the automated checks are compiled into a Metadata Assessment Report, providing instant feedback to data providers that identifies where and how their metadata can be improved. ESS-DIVE reviewers then carry out a manual, content-focused metadata review based on FAIR principles. Finally, revision requests are sent by reviewers, who then collaborate 1:1 with data providers until their dataset is eligible for publication.
Since implementation, 401 datasets have been reviewed using the semi-automated dataset review workflow. We have found that incorporating automated metadata validation has reduced review time, allowing the publication workflow to scale as the repository grows and freeing up time for reviewers to interact 1:1 with data providers to improve their publication practices. Finally, by tracking all review results, we are able to make transparent, data based recommendations to our community and continue to improve automation where possible.
Enabling proper Citation of Individual Objects Across Large Collections of Datasets (IN42B-0338)
Presenter: Deb Agarwal
Presentation Type: Poster
Session Date and Time: Thursday, 15 December 2022; 9:00 AM – 12:30 PM CST / 7:00 AM – 10:30 AM PST
Session Number and Title: IN42B: Adopting Trustworthy Data Repository Stewardship to Enable Reuse of Data Across Disciplines II Poster
Session Link: https://agu.confex.com/agu/fm22/webprogrampreliminary/Paper1188622.html
Abstract
Deb Agarwal1, Martina Stockhause2, Lesley A Wyborn3, Justin James Henry Buck4, James Ayliffe4 and Shelley Stall5, (1) Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (2) German Climate Computing Centre (DKRZ), Hamburg, Germany, (3) Australian National University, Canberra, ACT, Australia, (4) National Oceanography Center, BODC, Liverpool, United Kingdom, (5) American Geophysical Union, Data Leadership, Washington, DC, United States
The recent emphasis on publishing datasets in repositories and making it available has, as hoped for, led to many publications and other outputs that combine large numbers of individual datasets. However, this leads to the problem of providing a proper citation for each individual dataset used. We have built an international community of practice involving stakeholders from across the earth science publication spectrum from data generators to data repositories to journal publishers to develop a means of enabling large numbers of individual data components to be cited within a publication. This work was launched by information sessions at AGU 2020 and has been working since to develop use cases, potential solutions, and understand constraints. The current progress is that we have three exemplar use cases: the IPCC report graph attributions, the delivery of British Oceanographic Data Center data collections to users, and use of the AmeriFlux/FLUXNET individual datasets as a group. Each of these use cases illuminate a different set of challenges and needs. The temporary name we have given the solution is a reliquary. We have developed ‘cocktail napkin’ examples of the reliquaries that would be needed for each use case. We have also begun work with the existing dataset collection solutions to prototype these reliquaries. An RDA working group on complex data citations is also being proposed to bring together the community and develop best practice guidelines that are acceptable to researchers, publishers, PID infrastructure providers, and repositories and will enable tracing, citation and credit to be given to those who developed/funded each of the individual datasets/data objects within an individual reliquary. This important capability in the data publishing infrastructure is a key element of enabling dataset reusability and trust in the system.
In this talk, we will describe our current status and the next steps in addressing complex data citations. This talk will provide an opportunity to learn about and to help to further enumerate the use cases for complex data citations as well as identify the best practices.
Sample tracking and synthesis needs for exploring ecosystem response to climate and environmental disturbance (IN55A-03)
Presenter: Joan Damerow
Presentation Type: Oral
Session Date and Time: Friday, 16 December 2022; 3:06 PM – 3:14 PM CST / 1:06 PM – 1:14 PM PST
Session Number and Title: IN55A: Global Community Efforts to Make Samples, Specimens, and Sampling Features (As Well Digital Information About Them) Comply with the FAIR and CARE Principles III Oral
Session Link: https://agu.confex.com/agu/fm22/webprogrampreliminary/Paper1121127.html
Abstract
Joan E Damerow1, Elisha M Wood-Charlson1, Charuleka Varadharajan1, Mikayla Borton2, Eoin Brodie3, Richard S. Canon4, Shreyas Cholia1, Paramvir Dehal5, Zachary Crockett6, Emiley Eloe-Fadrosh7, Ricardo J Eloy Alves8, Kjiersten Fagnan7, Amy E Goldman9, David Hays10, Valerie C Hendrix1, Lee Ann McCue11, Nancy Shiao-Lynn Merino12, Marka Miller4, Chris Mungall13, Supratim Mukherjee10, T.B.K. Reddy10, Patrick Sorensen3, Montana L Smith14, James Stegen15, Pajau Vangay4, Pamela Weisenhorn16, Steven Wilson17 and Deb Agarwal1, (1) Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (2) Colorado State University, Soil and Crop Sciences, Fort Collins, United States, (3) Lawrence Berkeley National Laboratory, Earth and Environmental Sciences Area, Berkeley, CA, United States, (4) Lawrence Berkeley National Laboratory, Berkeley, United States, (5) Lawrence Berkeley National Lab, Berkeley, United States, (6) Oak Ridge National Laboratory, Oak Ridge, United States, (7) Joint Genome Institute, Walnut Creek, CA, United States, (8) Lawrence Berkeley National Laboratory, Climate and Ecosystem Sciences Division, Berkeley, CA, United States, (9) Pacific Northwest National Laboratory, Biological Sciences, Richland, WA, United States, (10) Joint Genome Institute, Berkeley, United States, (11) Pacific Northwest National Lab, Richland, United States, (12) Lawrence Livermore National Laboratory, Livermore, United States, (13) Lawrence Berkeley National Laboratory, Environmental Genomics and Systems Biology, Berkeley, United States, (14) Pacific Northwest National Laboratory, Environmental Molecular Sciences Laboratory, Richland, United States, (15) Pacific Northwest National Laboratory, Richland, United States, (16) Argonne National Laboratory, Argonne, IL, United States, (17)Joint Genome Institute, Berkeley, CA, United States
There is a growing need to understand ecosystem responses to warming, disturbances such as extreme events, and anthropogenic activities. A common workflow for such research is to collect a physical sample (like soil or water), and then send it out for a variety of physical, chemical, and biological analyses. However, these data are often analyzed in different labs, the data are stored in different repositories and databases, and often with different identifier and metadata practices/requirements. A particular challenge many researchers face is difficulty in integrating data generated by these multi-pronged analyses, particularly when working across disciplines such as microbiology, hydrology, atmospheric sciences, and geochemistry.We investigated several case-studies to determine a science-focused approach to link related biological (e.g., microbial data) and environmental data (e.g., soil/water properties) generated from analyses of samples across five online data systems that support the U.S. Department of Energy’s Biological and Environmental Research (BER) program. We used project data with assigned persistent identifiers and standard metadata for samples to link related data, as analyzed and published over a period of three years. To do this, we developed a common sample data model across the various stages of data collection and analysis represented across relevant data platforms. This included a focus on a common identifier schema for samples and associated data, validation and harmonization of varying metadata requirements across platforms, and the preservation of data citation and license information. In the process, we also engaged with national and international organizations, including the Genomic Standards Consortium (GSC), National Center for Biotechnology Information (NCBI), International General Sample Number (IGSN), DataCite, Research Data Alliance (RDA), and Earth Science Information Partners (ESIP) in an effort to coordinate approaches to this challenge.
We present conclusions from interviews with data contributors and users to understand scientific needs for sample tracking and synthesis. In addition, we present results from a review of studies that integrate microbial and environmental data to determine ecosystem responses to climate and other environmental disturbances. We are moving beyond isolated infrastructure for individual data types, towards connected infrastructure that allows sample tracking and data synthesis across multiple data types, institutions, and online data systems. This work has the potential to advance the interdisciplinary study of complex ecosystems and changes over time.