How to web scrape and data mine any website to excel
|Web scraping, data mining any website|
|Posting on UpWork|
We're looking for a CSV file / Excel Spreadsheet of all participating stores on this website:
ResultThe result of my scraping solution is below.
|Result of web scraping|
How I did it
The solution basically does the following:
- Create a corpus of all the URLs that contain the company data
- Scrape all the URLs in the corpus and store the data in a file.
Creating a corpus
Before we can start extracting data from webpages, we need to spider parts of the website to figure out what pages we need to scrape. This spidering will give us a set of URLs that refer to as a corpus.
Python3 and Libraries
I build the solution using the Python programming language and a few libraries such as:
Documentation of this library can be found at https://dryscrape.readthedocs.io/
Documentation of this library can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
According to its documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
In-depth video explaining the code behind the solution
I will not share the code for this solution, because I do not want just anyone to run this code and scrape the website in question. However, I did I post a two-part video series explaining the code behind the solution in detail so that fellow coders can see, understand and possibly learn how to data mine a website in this way.