Datasets: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
UMIACS hosts a number of datasets in read-only mode on some shared filesystems used by some of our [[SLURM]] computing clusters. The motive behind this is to provide commonly-used datasets in a well-defined location in order to de-duplicate their use elsewhere, thus reducing overall storage usage.
UMIACS hosts a number of datasets in read-only mode on some shared filesystems used by our [[SLURM]] computing clusters. The motivation behind this is to provide publicly accessible datasets in a well-defined location in order to de-duplicate their use elsewhere, thus reducing overall storage usage.


==Clusters==
==Dataset Directories==
* [[CML]]
* [[Nexus/CML | CML]] (<code>/fs/cml-datasets</code>)
** [[CML#Datasets | List of datasets]] -- faculty approver Tom Goldstein
** [https://info.umiacs.umd.edu/datasets/list/?q=CML List of datasets] -- faculty approver Tom Goldstein
* [https://wiki.umiacs.umd.edu/cfar/index.php/Vulcan Vulcan]
* [[Nexus/GAMMA | GAMMA]] (<code>/fs/gamma-datasets</code>)
** [https://wiki.umiacs.umd.edu/cfar/index.php/Vulcan/Storage#Dataset_Storage List of datasets] -- faculty approver Abhinav Shrivastava
** [https://info.umiacs.umd.edu/datasets/list/?q=GAMMA List of datasets] -- faculty approver Dinesh Manocha
* [[Nexus]] (<code>/fs/nexus-datasets</code>)
** [https://info.umiacs.umd.edu/datasets/list/?q=Nexus List of datasets]
* [[Nexus/Vulcan | Vulcan]] (<code>/fs/vulcan-datasets</code>)
** [https://info.umiacs.umd.edu/datasets/list/?q=Vulcan List of datasets] -- faculty approver Abhinav Shrivastava


==Requesting a new dataset==
==Requesting a new dataset==
You can request a new dataset by [[HelpDesk | contacting staff]] with a link to the dataset's official download location (no torrents or other peer-to-peer re-hosting). If the uncompressed dataset size is over 100GB, staff will first contact the faculty approver for the cluster to ensure they approve of using the storage space. If the size is under 100GB, no faculty approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.
You can request a new dataset by [[HelpDesk | contacting staff]] with a link to the dataset's official download location. Mirrors and torrents/other peer-to-peer re-hosting are not allowed unless sanctioned by the dataset owners.  
* '''CML/GAMMA/Vulcan''': If the uncompressed/final dataset size is over 100GB, staff will first contact the faculty approver for the cluster to ensure they approve of using the storage space. If the size is under 100GB, no faculty approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.
* '''Nexus''': Please let staff know which faculty member's research you are working on that requires use of the dataset you are requesting. Then, if the uncompressed/final dataset size is over 50GB, the [https://www.umiacs.umd.edu/our-experts/staff Director of Computing Facilities] must first approve of using the storage space. If the size is under 50GB, no approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.


If there are no terms and conditions, staff will download/extract the dataset, copy it to the appropriate location depending on what cluster you are requesting it for, and let you know when it is available for use on that cluster.
If there are no terms and conditions, staff will download/extract the dataset, copy it to the appropriate location depending on what filesystem you are requesting it on, and let you know when it is available for use.


If there are terms and conditions, staff will first need to have the dataset's terms and conditions approved by [https://ora.umd.edu/ UMD's Office of Research Administration (ORA)]. Since the dataset will be hosted by us in a location accessible by all users of a cluster, not all of which will have individually agreed to the terms and conditions, having a single person (staff member or not) agree to a set of terms and conditions is not sufficient to publish it. After approval, staff will perform the same steps as mentioned above (download/extract, copy to appropriate location, and let you know when ready).
If there are terms and conditions in excess of well-defined [https://creativecommons.org/about/cclicenses/ Creative Commons licenses] or similar, staff will first need to have the dataset's terms and conditions approved by [https://ora.umd.edu/ UMD's Office of Research Administration (ORA)]. Since the dataset will be hosted by UMIACS as an institution, in a location accessible by all users of a cluster, not all of whom will have individually agreed to the terms and conditions, having a single person, UMIACS staff member or not, agree to a set of terms and conditions is not sufficient to host it. After approval by ORA, staff will perform the same steps as mentioned above (download/extract, copy to appropriate location, and let you know when ready).


==Dataset use==
==Dataset use==
All datasets are read-only to users. Any intermediate data generated from a dataset will need to be stored in a location other than the shared filesystem hosting the dataset and other datasets.
All datasets are read-only to users. Any intermediate data generated from a dataset will need to be stored in a location other than the shared filesystem hosting the dataset and other datasets.


Exceptions may be granted if there is a set of intermediate data generated from a dataset that you believe will be useful to a subset of a cluster's users. If you suspect this is the case for some of your generated data, please [[HelpDesk | contact staff]]. We will reach out to the faculty approver for the cluster (as in the above section) and, upon approval, copy the intermediate data into the shared filesystem.
Exceptions may be granted if there is a set of intermediate data generated from a dataset that you believe will be useful to a subset of a cluster's users. If you suspect this is the case for some of your generated data, please [[HelpDesk | contact staff]]. We will follow the above procedure and upon whatever approvals may be necessary, copy the intermediate data into the shared filesystem.

Latest revision as of 14:15, 24 October 2025

UMIACS hosts a number of datasets in read-only mode on some shared filesystems used by our SLURM computing clusters. The motivation behind this is to provide publicly accessible datasets in a well-defined location in order to de-duplicate their use elsewhere, thus reducing overall storage usage.

Dataset Directories

Requesting a new dataset

You can request a new dataset by contacting staff with a link to the dataset's official download location. Mirrors and torrents/other peer-to-peer re-hosting are not allowed unless sanctioned by the dataset owners.

  • CML/GAMMA/Vulcan: If the uncompressed/final dataset size is over 100GB, staff will first contact the faculty approver for the cluster to ensure they approve of using the storage space. If the size is under 100GB, no faculty approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.
  • Nexus: Please let staff know which faculty member's research you are working on that requires use of the dataset you are requesting. Then, if the uncompressed/final dataset size is over 50GB, the Director of Computing Facilities must first approve of using the storage space. If the size is under 50GB, no approval is required. Staff will then inspect the dataset and see if there are any terms and conditions that must be agreed to before downloading.

If there are no terms and conditions, staff will download/extract the dataset, copy it to the appropriate location depending on what filesystem you are requesting it on, and let you know when it is available for use.

If there are terms and conditions in excess of well-defined Creative Commons licenses or similar, staff will first need to have the dataset's terms and conditions approved by UMD's Office of Research Administration (ORA). Since the dataset will be hosted by UMIACS as an institution, in a location accessible by all users of a cluster, not all of whom will have individually agreed to the terms and conditions, having a single person, UMIACS staff member or not, agree to a set of terms and conditions is not sufficient to host it. After approval by ORA, staff will perform the same steps as mentioned above (download/extract, copy to appropriate location, and let you know when ready).

Dataset use

All datasets are read-only to users. Any intermediate data generated from a dataset will need to be stored in a location other than the shared filesystem hosting the dataset and other datasets.

Exceptions may be granted if there is a set of intermediate data generated from a dataset that you believe will be useful to a subset of a cluster's users. If you suspect this is the case for some of your generated data, please contact staff. We will follow the above procedure and upon whatever approvals may be necessary, copy the intermediate data into the shared filesystem.