DataONE Federation

2022-06-10

DataONE is a community driven project providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data.

DataONE comprises a distributed network of data centers, science networks or organizations. These organizations can expose their data within the DataONE network through the implementation of the DataONE Member Node service interface. In addition to scientific data, Member Nodes can provide computing resources, or services such as data replication, to the DataONE community.

DataONE Service Interface

Access to DataONE data is provided by the DataONE service interface (aka REST API). DataONE maintains a set of user oriented tools called the Investigator Toolkit (ITK) that use this service interface to search, download and upload data to DataONE Member Nodes.

The dataone R package is one of the tools in the ITK.

If more detailed information about the DataONE Federation services is required than is provided by the dataone R package help, then in-depth documentation for the DataONE web services can be found at:

Please see the DataONE Glossary for definitions of some terms used in this document that are used to describe DataONE services and architecture.

For additional overview information about the DataONE Federation, please https://www.dataone.org.

New Features in DataONE Version 2.0

The following features are available via the dataone package, for member nodes that have implemented the DataONE version 2.0 API.

1. Series Identifiers

Each data, metadata, and resource map object in DataONE has a unique identifier, referred to in DataONE documentation as a Persistent Identifier (PID). A PID is associated with one object in DataONE and always refers to the same object, the same set of bytes that are stored on the DataONE network.

A data or metadata object can be updated on a DataONE Member Node by using the R method dataone::updateObject(), so that a new version of the object becomes the active version that is discoverable in searches of DataONE. The older version is still available if the PID is known, but this version will not show up in DataONE searches. In order to download the new version, the new PID must be discovered and specified when the object is downloaded.

With dataone Version 2.0, an additional, optional identifier can be associated with an object, the Series Identifier (SID). Using SIDs the most current version of an object can be obtained, without the need to determine the PID of the latest version. For example, if the SID is specified when the object is downloaded, the most recent version of that object will be downloaded.

2. DataONE User Authentication Using Tokens

Uploading data to DataONE requires that a DataONE user identity be provided. In DataONE Version 1.x, the identity of a user was provided by an X.509 client certificate. DataONE Version 2.0 adds an additional method to provide identity information - an authentication token. An authentication token is an encrypted text string that is provided by DataONE that can be sent to DataONE services to prove a user’s identity.

Authentication tokens can be used with Member Nodes that have been upgraded to the DataONE Version 2.0 Member Node API

The process of providing user identity information either via an X.509 certificate or via an authentication token is referred to as authentication.

Authentication tokens can be obtained from a user’s DataONE account settings web page. To create an authentication token:

  options(dataone_token = "eyJhbGciOiJSUzI1NiJ9...")

(For this example, the typically very long token value has been truncated for brevity, however the entire string must be used for the token to be valid.)

If the authentication token is defined as shown above, it will automatically be used when using methods from the dataone R client.

The authentication token must be safegaurded like a password. Anyone with access to it can access content in DataONE as the user identity contained in the token. Care should be taken to not add this code to any published scripts or documents. This code will expire after a certain time period after which it will be necessary to obtain a new one.

Note that the token shown above is for use in the DataONE production environment. If it is necessary to use authentication in a DataONE test environment, then the steps shown above should be followed by navigating to the following web page to generate the token: https://search.test.dataone.org. If you generate a token from this web page, then the R option name used should be “dataone_test_token”, not “dataone_token”, for example:

options(dataone_test_token = "eyJhbGciOiJSUzI1NiJ9...")

(For this example, the typically very long token value has been truncated for brevity, however the entire string must be used for the token to be valid.)

For an explanation of DataONE environments, please see DataONE environments.

Detailed, technical information about user identities and authentication in DataONE can be viewed at

DataONE Authentication

You can check your token by entering the following R commands:

libary(dataone)
am <- AuthenticationManager()
getTokenInfo(am)

The most important column of the information returned by getTokenInfo() is the expired column. If a token is expired, then it will not provide authentication and must be re-created from the user’s profile page as shown above.

3. Ability To Update System Metadata

Metadata is maintained by DataONE for each object that has been uploaded to it. This SystemMetadata for an object contains information such as the access policy that determines the users that can read or update the data, the data’s format type, how many replicated copies of the data to create, etc.

Member Nodes that have been upgraded to DataONE Version 2.0 Member Node API now the have the ability to update the system metadata of a data object without having to update (replace) the data object itself. So for example, an object can be uploaded to DataONE without having ‘public’ read enabled (the data creator or rightsholder and possibly a specified list of users could have access however). At a later date, the system metadata. could be updated to allow public read.

See help("updateSystemMetadata") for more information.

DataONE Environments

DataONE nodes are separated into several networks or environments. Each environment provides an isolated installations of the DataONE services, such that nodes in one environment do not communicate with nodes in another.

Currently DataONE uses the following environments: production, staging, staging2, sandbox, sandbox2, dev and dev2. All environments except production are test environments that may have different version of software than the released version of the DataONE infrastructure, and may experience service outages as required by the DataONE development team.

Most users will only need to use the production and staging2 environments.

The production environment is the current working release of the DataONE infrastructure and is the environment that supports the operations necessary to fully implement the DataONE system. There is only one production environment.

The staging environment provides an installation of DataONE infrastructure that is a copy of the production environment. It is used to prepare for a new release of the infrastructure by testing the upgrade and software replacement procedures.

The staging2 is a copy of the production environment and can be used by DataONE Member Nodes that are preparing to join the production environment. Staging2 usually has the released version of the DataONE infrastructure, and is more stable that the other test environments.

The sandbox and sandbox2 environments offer more stable environments than the dev environments, and is intended to provide a more stable system where new features or alternative implementations may be evaluated within an environment that is close to a particular release of the DataONE infrastructure.

The dev and dev2 environments are intended for use by DataONE developers and are unstable, with software upgrades and service outages occurring as needed by the development staff.

The dataone R package uses the following abbreviations for the DataONE environments:

DataONE environment dataone R package abbreviation
production PROD
stage STAGING
stage 2 STAGING2
sandbox SANDBOX
sandbox2 SANDBOX2
dev DEV
dev2 DEV2

Each DataONE environment has a CN that maintains a registry of all MNs in that network. The CN can be queried for a list of all MNs in a network, for example, to see the MNs that are currently in the production environment:

cn <- CNode("PROD")
unlist(lapply(listNodes(cn), function(x) x@name))

To see the URL service endpoint for a CN, you can view the endpoint slot:

cn@endpoint

To see the member nodes that are currently in the STAGE2 environment:

cn <- CNode("STAGING2")
unlist(lapply(listNodes(cn), function(x) x@name))