Come meet the ESS-DIVE team at AGU 2019 in San Francisco! We are available during our “Meet the scientist” time slot, 3:30-5 pm on Tuesday, Dec 10th at the Berkeley Lab Booth in the Exhibits Hall.
We also will be at several oral and poster presentations listed below.
Designing the ESS-DIVE Data Repository to be Trusted by the Community and FAIR ( IN14B-16 )
Presenter: Deb Agarwal
Presentation Type: eLightning
Session Date and Time*: Monday, 9 December 2019; 16:00 – 18:00
Session Number and Title: IN14B: Best Practices and Realities of Research Data Repositories eLightning
Location: Moscone South; eLightning Theater III
Abstract
The US Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository is still in its early implementation and growth phase. The focus of the repository has been on three areas of development: data access capabilities, standardization of data, and services to support projects providing data to the repository. Our approach is designed around user experience methods and involves significant discussion and involvement of the community in the design and development of capabilities. The priorities of the repository are continually revised and refined based on input from the community. We are following the developments of CoreTrustSeal and FAIR principles for data, and they are targets we hope to achieve in the future.
Our primary near-term goal is to build a repository that is trusted by the community and that is the preferred storage facility for data generated by the DOE Environmental System Science program. We continually strive to ensure our data are easily findable, accessible, interoperable, and reusable (FAIR). Achieving this goal requires a partnership with the data providers to gather the necessary metadata and standardized data. One challenge is that FAIR principles are designed to address the needs of the data user, and largely ignore the needs of the data provider. In this talk we present our repository and our approach to working with data providers to move their data toward FAIR principles. We also discuss the challenges we see in incentivizing the data provider to care about some aspects of FAIR data. Our ESS-DIVE team includes members of project teams that store data in the ESS-DIVE repository, and these dual perspectives give us some insight into the motivations and needs of the data provider. The motivations and priorities of data providers do not always align with the needs of the repository.
Standardizing Metadata Quality Review for an Environmental Data Repository (IN14B-09)
Presenter: Zarine Kakalia
Presentation Type: eLightning
Session Date and Time*: Monday, 9 December 2019; 16:00 – 18:00
Session Number and Title: IN14B: Best Practices and Realities of Research Data Repositories eLightning
Location: Moscone South; eLightning Theater III
Abstract
The Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE), is a data repository developed to support earth and environmental science projects funded by the U.S. Department of Energy (DOE), and is part of the DataONE network. One of the challenges ESS-DIVE faces is ensuring that submitted data packages have thorough metadata necessary to find and use the dataset.
Our goal is to ensure all data packages published on ESS-DIVE have high-quality metadata that meet FAIR data principles. However, extensive metadata quality reviews can involve significant staff time and resources. Therefore, we implement a combination of automated checks to catch issues upon submission, and a manual process for in-depth content reviews requiring domain knowledge. A majority of the automated checks were developed by DataONE and the Arctic Data Center and are designed to assess the findability, accessibility, interoperability and reusability (FAIR-ness) of datasets by checking for the presence of metadata fields and word counts. We are testing this suite of DataONE metadata checks as well as additional checks needed for our community. Automated checks reduce the time needed for manual reviews and provide instant feedback to users, thus expediting the publication process. To standardize the manual review process and provide consistent feedback to dataset authors, we use a checklist form with specific requirements for each metadata element. Completed forms for each dataset enable tracking the quality of datasets before and after review, and the amount of time taken on the review process.
We have found that the combination of automated quality reports and specific guidance in the review process is an effective approach to improve metadata and reduce manual review time. In addition, data from the completed review forms will allow us to assess whether the automated checks have decreased the manual review time and improved metadata quality.
Addressing Paradigm Shifts and Competing Interests in an Open Science World (IN22C-18)
Presenter: Deb Agarwal
Presentation Type: eLightning
Session Date and Time*: Tuesday, 10 December 2019; 10:20 – 12:20
Session Number and Title: IN22C: Open Knowledge Networks and Semantics for Geosciences: Successes and Challenges of Open Science eLightning
Location: Moscone South; eLightning Theater III
Abstract
Open science has the potential to democratize science access in unprecedented ways. However, the change to open science shifts the costs and benefits in subtle ways for the scientists involved. Understanding these shifts will allow us to better address the needs of all parties and increase the amount of open science available. In the case of open publications, there is a shift in the costs of publishing from the reader to the author. This can affect the costs of scholarly output that were not planned into a grant and changes the publishers’ business model. In the case of data, there has traditionally been little or no academic credit for data collection. Credit is acquired from authorship of publications using that data. In an open data world, the collector of the data shifts from having exclusive publishing rights to having non-exclusive publishing rights with the hope of more citations of the data. This change requires that the publication of a dataset has similar academic credit to the publication of a paper. These aspects are already understood and are hopefully on track to be solved.
In this talk, we describe 12 years of experience working with data users and providers to move towards open data in biogeosciences. We have addressed this challenge as the US Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository and as members of the data teams for many earth science projects including: AmeriFlux (ameriflux.lbl.gov), FLUXNET (fluxnet.org), NGEE-Tropics (ngee-tropics.lbl.gov/), and Watershed SFA (watershed.lbl.gov). Moving global, diverse communities to open science paradigms requires finding the right incentives and building trust with the community. We discuss the incentives we have used to encourage open data as well as the benefits and pitfalls encountered along the way. Often the motivations for, and barriers to moving to open science data are not as straightforward as we initially thought and can vary across countries and projects.
Community use of persistent sample identifiers and metadata standards: supporting efficient data management in the field, laboratory, and online (IN32A-05)
Presenter: Joan Damerow
Presentation Type: Oral
Session Date and Time: Wednesday, 11 December 2019; 10:20 – 12:20
Presentation Length: 11:20 – 11:35
Session Number and Title: IN32A: Communities, Tools, and Policies That Enable Integration of Earth, Space, and Environmental Science Data and Cyberinfrastructures II: Tools and Policies
Location:Moscone West; 2018, L2
Abstract
Physical samples are foundational entities for research in earth and environmental sciences; they are not only the basis of individual studies but could also be integrated with other data to inform new and broader-scale questions. Data contributors to the Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) repository often work in large, interdisciplinary teams and send samples to multiple facilities for analyses. This community needs an efficient system for persistent sample identification and tracking that is suitable for the field, laboratory analyses, and online publication.
We are conducting a community pilot test on the use of persistent identifiers for physical samples–specifically, International Geo Sample Numbers (IGSNs). Six projects with a variety of sample types are registering samples for IGSNs, standardizing sample collection metadata, and publishing their sample metadata in the System for Earth Sample Registration (SESAR) sample catalog and ESS-DIVE. The purpose of the test is to evaluate the experience of users and to decide on essential standardized metadata for our community. We gathered information for the pilot test through discussions with project teams and documented several components, such as the efficiency of the process (i.e. use of templates, labeling, registering samples, and updating metadata) and any apparent problems. We resolved uncertainties in use of metadata fields, and added standard terms as needed. Throughout the pilot test, we also gathered feedback on desired use cases, which include: improvements in data management, advanced search capabilities, ability to link identifiers, and ability to integrate and reuse sample data
The pilot test results will inform community-driven standards and tools for sample identifiers, tracking, and metadata in the ESS-DIVE repository. Our overall goal is to provide practical recommendations for efficient sample data management while also preserving and maximizing the potential value of samples into the future.
Utilizing Diverse Data in Scientific Analysis and Modeling for Water Resource Management (IN51A-01)
Presenter: Charuleka Varadharajan
Presentation Type: Oral
Session Date and Time: Friday, 13 December 2019; 08:00 – 10:00
Presentation Length: 08:00 – 08:15
Session Number and Title: IN51A: Data and Information Services for Interdisciplinary Research and Applications in Earth Science I
Location: Moscone West; 2018, L2
Abstract
The Earth’s water resources are being characterized at unprecedented resolutions due to the growth of sensor networks, remote sensing, and other observational tools. However, our ability to utilize ‘water big data’ for scientific analysis and modeling is still limited for many reasons. First, water data are complex and diverse, making it challenging to integrate and compare across data types. Second, it is difficult to discover and synthesize data across providers, as data and metadata are not provided using standardized formats. Third, real world data often need substantial quality checks and cleaning for scientific use. Finally, data preparation for both mechanistic and data-driven models is not trivial and involves gap filling, transformations, and conversion into formats that can be used by the models.
Here, we present technologies developed for curation, assessment, integration, visualization, and publication of water data for research funded by the U.S. Department of Energy (DOE). The Data Management Framework of the Watershed Function Scientific Focus Area (SFA) consists of cyberinfrastructure to (a) store diverse data in a queryable database, (b) scripts in Jupyter notebooks to QA/QC these data, (c) a broker BASIN-3D to integrate diverse, distributed, multiscale data into a unified view, and (d) search and access portals to enable data exploration through interactive tools and visualizations. ESS-DIVE is a data repository for DOE-funded environmental data, and is promoting the development of data/metadata standards in partnership with its community to ensure long-term data interoperability and reusability. Data from the SFA and other DOE watershed research efforts are publicly released through ESS-DIVE. Finally, we present our experience with using publicly-available water data from various providers in Colorado and California, and discuss challenges in using data as inputs to deep learning and mechanistic models.
Several federal and state efforts are now prioritizing open water data infrastructure. Future data systems need to enable seamless discovery and access of data from different providers in standardized formats. Also needed are scientific workflows that connect data to models. These advances are needed to provide timely predictions of water availability and quality to stakeholders
A Community-Centered Approach to Managing Environmental Data in Repositories (IN51F-0690)
Presenter: Charuleka Varadharajan
Presentation Type: Poster
Session Date and Time: Friday, 13 December 2019; 08:00 – 12:20
Session Number and Title: IN51F: Data Integration: Enabling the Acceleration of Science Through Connectivity, Collaboration, and Convergent Science II Posters
Location: Moscone South, Poster Hall
Abstract
The Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a U.S. Department of Energy (DOE) data repository for DOE-sponsored research in earth and environmental sciences. Data stored in ESS-DIVE are highly diverse, spanning many science domains, and encompassing field, experimental, and modeling research across a variety of terrestrial and subsurface ecosystems. ESS-DIVE’s mission is to enable broad access to, and improve the usability of data stored in the archive. A key objective is to encourage data providers to contribute well-structured, high-quality data, with an intent to enable data users to easily build processing, synthesis, and analysis capabilities for those data.
ESS-DIVE’s approach from its inception has been to partner with its scientific research community to make the process of submitting and using data easier and more rewarding. We engage our user community, which comprises individual DOE projects, DOE cyberinfrastructure groups, and data users, through a variety of means. This includes face-to-face meetings during site visits to major data contributors, meetings of an advisory board consisting of the leads of major projects, conducting monthly webinars and online surveys to seek feedback on new features or priorities, and tutorials to train users. We utilize established user-experience research methods to determine user needs and priorities, and have embedded environmental scientists in the ESS-DIVE team to provide domain expertise to guide infrastructure development. We also work with the community to identify, develop and adopt consistent data and metadata standards that are most likely to be suitable for, and used by researchers submitting data to ESS-DIVE.
Here, we present the story of how ESS-DIVE has engaged its community, evolved to incorporate user needs and priorities, and lessons learnt through this process. The community-centered approach has so far resulted in dramatically increasing user interest in ESS-DIVE infrastructure and standards development. We believe this approach will maximize the value of ESS-DIVE datasets into the future to ultimately advance the scientific understanding and prediction of hydro-biogeochemical and ecosystem processes that occur from bedrock through soil and vegetation to the atmospheric interface.
Increasing Efficiency in Data Publication using Semi-Automated Workflow (IN51F-0705)
Presenter: Fianna O’Brien
Presentation Type: Poster
Session Date and Time: Friday, 13 December 2019; 08:00 – 12:20
Session Number and Title: IN51F: Data Integration: Enabling the Acceleration of Science Through Connectivity, Collaboration, and Convergent Science II Posters
Location: Moscone South, Poster Hall
Abstract
With the growing necessity for open access data, researchers are required to play the roles of both data provider and publisher. The Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data archive provides a publishing workflow to increase the accessibility of data produced by earth and environmental science projects funded by the Department of Energy. ESS-DIVE is presented with the challenge of providing a publication workflow that efficiently disseminates diverse datasets that meet FAIR standards.
ESS-DIVE provides quality assessment and data citation services to enhance dataset visibility for researchers. This multi-step, heretofore highly manual process requires considerable staff resources. ESS-DIVE has streamlined the ingest workflow by enabling communication between existing system components, integrating customer service desk software with the data archive. Publication team members can track and access data submissions, manage the documentation of the ingest process, and communicate with data providers through this centralized location. The use of automated metadata checks and the development of a manual review checklist has also dramatically improved the efficiency of the data publication process. Once a review has been satisfactorily completed, datasets are published on ESS-DIVE with a persistent and unique identifier.
While the speed of the publication process depends on the responsiveness of the data provider and the quality of the initial submission, the integration of a semi-automated workflow has dramatically improved not only the efficiency of our data publication process but also its consistency and reliability, bolstering impactful research efforts to address modern environmental challenges. We aim to continue to reduce the time and energy required of environmental scientists to contribute data to their field, and to offer review and support throughout the publishing process.