Scraping with Python

Scraping with Python

Hello World,

There is more information on the Internet than any human can absorb in a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it. Then, You need web scraping. So let's discuss this scrapping thing.

Now, the Scrap Means: to take out something from somewhere and If we add 'web' to it, then it means to take out something from the web resources.

Web scraping automatically extracts data and presents it in a format you can easily make sense of. In this tutorial, we’ll focus on its applications in the webpage data extraction, but web scraping can be used in a wide variety of situations.

Roadmap for this blog:

- Setup with the tools and library

- Basic of the HTML DOM

- Scrapping Rule

- Coding part

Pre-Requisite:

- A basic understanding of programming concept in python

----------------------------------------------------------

Getting Started with the tools

Okay, So for the scrapping we would be using python. Why python? Because python is best optimal fast and most importantly easy.

Everyone uses different machines like Windows, Mac, Linux.

So For Mac/Linux users, Python is pre-installed in OS X. Open up Terminal and type `python --version`. You should see your python version is 2.x/3.x depending on the supporting version by OS. For Windows users, please install Python through the official website: https://www.python.org/downloads/.

We are going to use Python as our scraping language, together with two simple and powerful library called, BeautifulSoup and a library called requests. 'request' is for the making a call/request to a web server and the 'BeautifulSoup' is to scrap the data out of the response/resource.

Next, You can install these libraries by hitting the following commands in your terminal/command :

pip install beautifulsoup4

pip install requests

Well, We can also have a `requirement.txt` as a dependency document as we can directly install all the dependency mentioned in the requirement.ext with the following command:

pip install -r requirement.txt

The Basics

Before we start jumping into the code, let’s understand the basics of HTML and some rules of scraping.

You may refer to https://www.w3schools.com/ web resource to understand the HTML tags and the CSS selectors

Okay, so let's understand the rules of scrapping before going for the scrap.

Scraping Rules

1- You should check a website’s Terms and Conditions before you scrape it. just read the website's robot.txt as /robot.txt

2- Do not request data from the website too aggressively with your program (also known as spamming). Make sure your program behaves in a reasonable manner (i.e. acts like a human).

3- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

How do we Scrap:

1- Make a request to a web source

2- Get the web response

3- Extract the info from the response

To Be continued in next part of the blog ....

Comments