WebCrawling

From UMIACS
Revision as of 19:17, 27 February 2024 by Mbaney (talk | contribs)
Jump to navigation Jump to search

Web crawling, scraping, or otherwise downloading large sets of publicly available information on the Internet should be handled with care.

You should always understand if any publicly available data you are planning on crawling is encumbered or not. Some things that are publicly accessible are still under copyright and are important to understand what restrictions on any data that you crawl from remotely accessible sites.

There are a number of different organizations that still give access to academic institutions (such as UMIACS) through IP based filtering, but have restrictions on crawling. The services that they provide may also not be adequately built for programmatic crawling. In this case, these organizations may ban IP addresses or ranges if crawling is observed on those IP addresses or ranges. This can then prevent other individuals from using these services from the same IP addresses or ranges. In the case of UMIACS, bad behavior on the part of a single user can result in all access to a service being banned from UMIACS' public IP addresses.

Some examples of databases or sites that may have restrictions on crawling are (not an exhaustive list):

You should not be ever trying to evade limitations or restrictions of data within a publicly available service that you are looking to scrape.