WebCrawling

From UMIACS
Revision as of 14:01, 1 March 2024 by Derek (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Web crawling, scraping, or otherwise downloading large sets of publicly available information on the Internet (hereafter referred to as just "crawling") should be handled with care.

You should always understand if any publicly accessible data you are planning on crawling is encumbered or not. Some things that are publicly accessible are still under copyright and it is important to understand what the restrictions are on any data that you crawl from remotely accessible sites.

There are a number of different organizations that still give access to academic institutions (such as UMIACS) through IP based filtering, but have restrictions on crawling. The services that they provide may also not be adequately built for programmatic crawling. In this case, these organizations may ban IP addresses or ranges if crawling is observed on those IP addresses or ranges. This can then prevent other individuals from using these services from the same IP addresses or ranges. In the case of UMIACS, bad behavior on the part of a single user from UMIACS systems can result in all access to a service being banned from UMIACS' public IP addresses.

Some examples of databases or sites that may have restrictions on crawling are (not an exhaustive list):

You should never try to evade limitations or restrictions imposed by the site or owner for publicly available service that you are looking to crawl.