What is web scraping for
If you only use the web through a browser, you’re missing out on a lot of opportunities. While browsers are great for running JavaScript, displaying images, and organizing objects into a more human-readable format (among other things), web scrapers are great at collecting and processing large amounts of data quickly. Instead of viewing one page at a time through a narrow monitor window, you can crawl databases that span thousands or even millions of pages at once.
Additionally, web scrapers can go where traditional search engines can’t. A Google search for “cheapest flights to Moscow” will return tons of ads and popular flight search sites. Google only knows what these websites say on their content pages, not the exact results of the various queries entered into the flight search engine. However, a well-designed web scraper can track the cost of a flight to Moscow over time, across different websites, and tell you the best time to buy a ticket.
Scraping or APIs?
You may be asking, “Isn’t data collection what APIs are for?” Yes, APIs can be great if you find one that suits your purposes. They are designed to provide a convenient stream of well-structured data from one computer program to another. You can find APIs for many types of data you want to use, such as Twitter tweets or Wikipedia pages. In general, it is preferable to use an API (if one exists) than to build a bot to get the same data. However, an API may not exist or be useful for your purposes for several reasons:
- You are scraping relatively small, finite data sets from a large number of websites without a single API.
- The data you want is small or uncommon, and the creator has not seen fit to create an API for it.
- The source does not have the infrastructure or technical capability to create an API.
- The data is valuable and/or proprietary and is not intended to be widely distributed.
Even when an API exists, the request volume and rate limits, data types, or data format it provides may not be sufficient for your purposes.
This is where web scraping comes in. With some exceptions, if you can view the data in a browser, you can access it with a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do almost anything with the data.
Obviously, there are many extremely practical applications for accessing virtually unlimited data: market forecasting, machine translation, and even medical diagnostics have all benefited greatly from the ability to extract and analyze data from news sites, translated texts, and health forums, respectively.
Regardless of your field, web scraping almost always provides a way to more effectively direct business practices, improve productivity, or even branch out into an entirely new area.
Source - https://surfsky.io