Datasets

UMIACS hosts a number of datasets in read-only mode on some shared filesystems used by our SLURM computing clusters. The motivation behind this is to provide publicly accessible datasets in a well-defined location in order to de-duplicate their use elsewhere, thus reducing overall storage usage.

Dataset Directories

Besides the differing faculty approvers, lab-named dataset directories are optimized for use by that lab's compute nodes, and /fs/nexus-datasets is optimized for use by the Tron nodes.

CML (/fs/cml-datasets)
- List of datasets -- faculty approver Tom Goldstein
GAMMA (/fs/gamma-datasets)
- List of datasets -- faculty approver Dinesh Manocha
Nexus (/fs/nexus-datasets)
- List of datasets
Vulcan (/fs/vulcan-datasets)
- List of datasets -- faculty approver Abhinav Shrivastava

Requesting a new dataset

You can request a new dataset by contacting staff with a link to the dataset's official download location. Mirrors and torrents/other peer-to-peer re-hosting are not allowed unless sanctioned by the dataset owners.

CML/GAMMA/Vulcan: If the uncompressed/final dataset size is over 100GB, staff will first contact the faculty approver for the cluster to ensure they approve of using the storage space. If the size is under 100GB, no faculty approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.
Nexus: Please let staff know which faculty member's research you are working on that requires use of the dataset you are requesting. Then, if the uncompressed/final dataset size is over 50GB, the Director of Computing Facilities must first approve of using the storage space. If the size is under 50GB, no approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.

If there are no terms and conditions, staff will download/extract the dataset, copy it to the appropriate location depending on what filesystem you are requesting it on, and let you know when it is available for use.

If there are terms and conditions in excess of well-defined Creative Commons licenses or similar, staff will first need to have the dataset's terms and conditions approved by UMD's Office of Research Administration (ORA). Since the dataset will be hosted by UMIACS as an institution, in a location accessible by all users of a cluster, not all of whom will have individually agreed to the terms and conditions, having a single person, UMIACS staff member or not, agree to a set of terms and conditions is not sufficient to host it. After approval by ORA, staff will perform the same steps as mentioned above (download/extract, copy to appropriate location, and let you know when ready).

Dataset use

All datasets are read-only to users. Any intermediate data generated from a dataset will need to be stored in a location other than the shared filesystem hosting the dataset and other datasets.

Exceptions may be granted if there is a set of intermediate data generated from a dataset that you believe will be useful to a subset of a cluster's users. If you suspect this is the case for some of your generated data, please contact staff. We will follow the above procedure and upon whatever approvals may be necessary, copy the intermediate data into the shared filesystem.

Datasets

Dataset Directories

Requesting a new dataset

Dataset use

Navigation menu

Search