Last updated November 30, 2020
In AI Mysteries

Hands-On Guide To Web Scraping Using Python and Scrapy

Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites.Scrapy is an open-source web crawling system, written in Python. Initially intended for web scratching, it can likewise be utilised to separate information utilising APIs or as a universally useful web crawler.

Share

Published on November 30, 2020

by Ankit Das

Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites. Scrapy and Beautiful Soup are some of the famous web scrapers used to extract reviews from famous websites like Amazon, Zomato to analyze it.

Scrapy is an open-source web crawling system, written in Python. Initially intended for web scraping, it can likewise be utilized to separate information utilizing APIs or as a universally useful web crawler. This web crawler is used to create our own spiders. It helps to select specific parts from the webpage using selectors like CSS and XPath.

Here, we will cover the details of components that are used in Scrapy for web crawling purposes. Further, we will extract the data from a particular site using Python and Scrapy.

How does Scrapy work?

The engine receives an HTTP request from the spiders. It delivers that request to the Scheduler as it is responsible for tracking the order of request. The more preferred requests are sent back to the Engine by the Scheduler. The Engine sends a request to the downloader, and in return, it receives back a response. The response is then sent back to the spider for processing activity. Finally, the Engine sends a response to the item pipeline that gives specific parts of the data that are asked to extract.

Scrapy comes with whole new features of creating a spider, running it and then saving data easily by scraping it.

IDE Used for Scrapy

For this project use either Pycharm or Visual Studio as we can see the output in the terminal

Creating a virtual environment

It is a great idea to establish one virtual environment as it separates the program and doesn’t influence some other projects present in the machine.

To create a virtual environment first install it by using.

sudo apt-get install python3-venv

Create one folder and then activate it

./Scripts/activate

Installing Scrapy Library

It is difficult to install scrapy in Window 10. So, it is recommended to install it using anaconda navigator.

pip install scrapy.

Creating Scrapy project

After installing Scrapy, we need to create a scrapy project.

scrapy startproject corona

Creating Spider

In Scrapy, one Spider is made which slithers over the site and assists with fetching information, so to make one, move to the spider folder and make one python document over there.

First thing is to name the Spider by assigning it with a named variable and afterwards give the beginning URL through which the Spider will begin scraping.

Default Folder Structure

Go to C drive and open the user Folder. We can see that there will be a folder in the name of the project “worldometer”.

Spider Structure

Fetching data from a given web page

Our main aim is to get every URL from the site. Get all the URLs or anchor labels from it. To do this, we have to make one more technique parse to get information from the given URL.

Presently for retrieving information from the given page, use selectors. These selectors can be either from CSS or from Xpath.

Code Implementation

Open the corona.py folder in our IDE. We will fetch the data from the URL mentioned in the start_urls domain. In our project, the XPath selector is used to fetch the data from the world meter website. For working of the selectors, we need to right-click on the website to get the text and link. The yield command will give the items that are asked to fetch.

import scrapy
class CoronaSpider(scrapy.Spider):
    name = 'corona'
    allowed_domains = ['www.worldometers.info/coronavirus/']
    start_urls = ['http://www.worldometers.info/coronavirus/']
    def parse(self, response):
        for country in response.xpath("//td/a"):
            name=country.xpath(".//text()").get()
            link=country.xpath(".//@href").get()
            yield{
                  'country_name':name,
                  'country_link':link
                }

Storing the data

Finally, run the spider and get output in simple CSV file

scrapy crawl NAME_OF_SPIDER -o File_Name.csv

Conclusion

In this article, we have covered the procedure to fetch data from a particular website using Scrapy and Python. It is interesting to see how easily we can fetch data using web scrapers. But, it becomes an uphill task if some web-sites block our IP. Further, we can explore more about web scrapers as it is one of the most important steps of data analysis.

Access all our open Survey & Awards Nomination forms in one place

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.