Datasets

From UMIACS
Revision as of 19:05, 7 February 2023 by Mbaney (talk | contribs)
Jump to navigation Jump to search

UMIACS hosts a number of datasets in read-only mode on some shared filesystems used by some of our SLURM computing clusters. The motivation behind this is to provide commonly-used datasets in a well-defined location in order to de-duplicate their use elsewhere, thus reducing overall storage usage.

Clusters

Requesting a new dataset

You can request a new dataset by contacting staff with a link to the dataset's official download location. Torrents or other peer-to-peer re-hosting are not allowed unless sanctioned by the dataset owners.

  • CML/Vulcan: If the uncompressed/final dataset size is over 100GB, staff will first contact the faculty approver for the cluster to ensure they approve of using the storage space. If the size is under 100GB, no faculty approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.
  • Nexus: Please let us know which faculty member you are working with that requires use of the dataset you are requesting. Then, if the uncompressed/final dataset size is over 50GB, the Director of Computing Facilities must first approve of using the storage space. If the size is under 50GB, no approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.

If there are no terms and conditions, staff will download/extract the dataset, copy it to the appropriate location depending on what cluster you are requesting it for, and let you know when it is available for use on that cluster.

If there are terms and conditions in excess of well-defined Creative Commons licenses, staff will first need to have the dataset's terms and conditions approved by UMD's Office of Research Administration (ORA). Since the dataset will be hosted by us in a location accessible by all users of a cluster, not all of which will have individually agreed to the terms and conditions, having a single person (staff member or not) agree to a set of terms and conditions is not sufficient to publish it. After approval, staff will perform the same steps as mentioned above (download/extract, copy to appropriate location, and let you know when ready).

Dataset use

All datasets are read-only to users. Any intermediate data generated from a dataset will need to be stored in a location other than the shared filesystem hosting the dataset and other datasets.

Exceptions may be granted if there is a set of intermediate data generated from a dataset that you believe will be useful to a subset of a cluster's users. If you suspect this is the case for some of your generated data, please contact staff. We will follow the above procedure and upon whatever approvals may be necessary, copy the intermediate data into the shared filesystem.