Whether you call it web scraping, web crawling or spidering, programmatic data access and extraction belongs to the realm of technologies that have outpaced developments within the world of legislation. In the following piece, we provide an introduction to legal theories and concepts that have proven relevant to web scraping activities, as well as some widely accepted guidelines for good bot behaviour.
Web scraping is defined as the extraction and copying of data from a website into a structured format using a computer program. These programs are interchangeably referred to as web scrapers, web crawlers, or bots.
There are good bots and bad bots, as well as – we assume – bots of somewhat ambiguous morality. According to one estimate,1 bots make up around one half of all internet traffic and most of them are malicious in nature. Bad bots steal competitor content, overload web servers, spam forums, and create phantom baskets on e-commerce websites.
Web scraping should only be used to access data that is publicly available. In other words, the information does not exist behind a paywall, a firewall, or any other type of code-based restriction. The benefit of automated access is that data collection can take place on a scale – and at a speed – that is not achievable through manual methods.
However, for all its convenience, web scraping usually entails more than just extracting raw data. Information on the web is messy and unstructured; it needs to be deduplicated, filtered, and integrated with one’s system of choice before it can become a subject for analysis.
Find out more about Neudata and keep up-to-date by signing up to our email newsletter.