The ESS-DIVE data repository aims to preserve and provide long-term access to Environmental Systems Science (ESS) research data. Effective data preservation starts with data management planning, and continues through scientific data and metadata collection, processing, submission to the repository, publication, and more.
ESS-DIVE provides project data management resources that support long-term data preservation, including a variety of training opportunities, data standards/reporting formats, and other data submission guidelines. The ESS-DIVE review process then ensures that dataset metadata is complete and enables others to find, access, and interpret data included in the dataset.
The primary copy of published data is stored on production filesystems at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL). Datasets are then managed following best practices for systems administration at NERSC.
Open formats support future data access, even as software changes over time. As such, ESS-DIVE uses and encourages the use of open data and metadata formats as much as possible. Metadata are managed in both the open Ecological Metadata Language (EML) and JSON-LD. We also encourage researchers to provide data using open data formats such as ASCII Comma Separated Values (CSV) for tabular data. The repository also supports open access via the DataONE REST API and the ESS-DIVE API for external groups to be able to access published datasets.
ESS-DIVE will update dataset metadata (EML, JSON-LD) as standards evolve. Additionally, ESS-DIVE will maintain and update data/metadata reporting formats as necessary. We will provide guidance on updating file formats when newer reporting format versions become available.
All public datasets in ESS-DIVE are shared openly under one of two data usage licenses: Creative Commons Attribution 4.0 International License or the CC0 1.0 Universal Public Domain Dedication license. Metadata will always be available under Creative Commons Public Domain.
ESS-DIVE has worked with the community to develop a number of open data standards and reporting formats suitable for storing Earth and environmental science data. Use of community reporting formats help make data more open, reusable, and readily integrated across datasets and projects to gain new insights and advance scientific outcomes.
Data submitters can edit published datasets at any point. Changes to the data files require approval by the ESS-DIVE publication team before being made public. Any update to a dataset with a DOI will result in a new internal ESS-DIVE identifier corresponding to the new version, but retains the same DOI. The version downloaded is identified by an access date in the citation provided which can be traced back to the specific version of the data.
Redundancy and Replication
Redundancy is a fundamental feature of any data repository, helping to ensure materials are continuously available to researchers. ESS-DIVE is hosted at NERSC at LBNL. The DataONE network provides redundancy via replication to network nodes in other geographic regions and automatic self-healing capabilities. DataONE replication enables users to discover and download datasets if the main repository site is unavailable. The ESS-DIVE team also maintains two standby copies of ESS-DIVE: one copy in the LBNL Information Technology Division data center and the other copy at National Center for Ecological Analysis and Synthesis (NCEAS) in Santa Barbara, California, U.S. For short duration outages of a few hours or less, users are redirected to one of the copies for download, search, and read access. For more extended outages, one of the standby services is made the primary.
A full archival copy is made nightly on the NERSC HPSS tape archive system. Replication is automated, and occurs any time that a change to any file in the public datasets is made. Replication ensures that data and metadata remain available even in the case of unplanned local system outages (such as a regional-scale fire or earthquake event), and provides for higher-performance access to data from multiple replica sites.
Auditing ensures the integrity of ESS-DIVE data replicas over time. To verify replicas, we leverage the DataONE periodic auditing process, which computes checksum values of stored data files and compares them against values created at initial deposit. Berkeley and NCEAS staff regularly run and verify the auditing process on ESS-DIVE content.
It is extremely difficult to predict and sustain funding for single institutions in the long term. Our replication policy ensures that our data remains available during normal operations. As a member of the DataONE network, the ESS-DIVE repository has options to easily ensure continuity of access to data should support for the archive change. This means that all ESS-DIVE data may be replicated to a suitable DataONE member node as part of a wind-down protocol, ensuring that the data remains available.