WebCrawling

From UMIACS
Revision as of 15:02, 21 February 2024 by Derek (talk | contribs) (Created page with "Web crawling, scraping or otherwise downloading large sets of publicly available information on the Internet should be handled with care. You should always understand if any data you are planning on scraping is publicly available but is still encumbered. Some things that are publicly accessible are still under copyright and are important to understand what restrictions on any data that you crawl from remotely accessible sites. There are a number of different sites tha...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Web crawling, scraping or otherwise downloading large sets of publicly available information on the Internet should be handled with care.

You should always understand if any data you are planning on scraping is publicly available but is still encumbered. Some things that are publicly accessible are still under copyright and are important to understand what restrictions on any data that you crawl from remotely accessible sites.

There are a number of different sites that still give access to academic institutions through IP based filtering but have restrictions on scraping or crawling. There can also be the case that the service is not adequately built for programmatic crawling. In this case these organizations only often option to sites who crawl them is to ban IP addresses (or ranges). This of course can have impacts on individuals who try to then use these services from UMIACS public IPs and are blocked.

Some examples but not an exhaustive list are the following of databases or sites that may have restrictions:

Users should not be ever trying to evade limitations or restrictions of data within a publicly available service they are looking to scrape.