MITB Banner

Hands-On Guide To Web Scraping Using Python and Scrapy

Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites.Scrapy is an open-source web crawling system, written in Python. Initially intended for web scratching, it can likewise be utilised to separate information utilising APIs or as a universally useful web crawler.

Share

scrapy

Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites. Scrapy and Beautiful Soup are some of the famous web scrapers used to extract reviews from famous websites like Amazon, Zomato to analyze it.

Scrapy is an open-source web crawling system, written in Python. Initially intended for web scraping, it can likewise be utilized to separate information utilizing APIs or as a universally useful web crawler. This web crawler is used to create our own spiders. It helps to select specific parts from the webpage using selectors like CSS and XPath.

Here, we will cover the details of components that are used in Scrapy for web crawling purposes. Further, we will extract the data from a particular site using Python and Scrapy.

How does Scrapy work?

The engine receives an HTTP request from the spiders. It delivers that request to the Scheduler as it is responsible for tracking the order of request. The more preferred requests are sent back to the Engine by the Scheduler. The Engine sends a request to the downloader, and in return, it receives back a response. The response is then sent back to the spider for processing activity. Finally, the Engine sends a response to the item pipeline that gives specific parts of the data that are asked to extract.

scrapy

Scrapy comes with whole new features of creating a spider, running it and then saving data easily by scraping it.

IDE Used for Scrapy

For this project use either Pycharm or Visual Studio as we can see the output in the terminal

Creating a virtual environment

It is a great idea to establish one virtual environment as it separates the program and doesn’t influence some other projects present in the machine.

To create a virtual environment first install it by using.

sudo apt-get install python3-venv

Create one folder and then activate it 

./Scripts/activate

Installing Scrapy Library

It is difficult to install scrapy in Window 10. So, it is recommended to install it using anaconda navigator.

pip install scrapy.

Creating Scrapy project

After installing Scrapy, we need to create a scrapy project.

scrapy startproject corona
scarapy_proj

Creating Spider

In Scrapy, one Spider is made which slithers over the site and assists with fetching information, so to make one, move to the spider folder and make one python document over there.

First thing is to name the Spider by assigning it with a named variable and afterwards give the beginning URL through which the Spider will begin scraping. 

spider_scrapy

Default Folder Structure

Go to C drive and open the user Folder. We can see that there will be a folder in the name of the project “worldometer”.

folder_scrapy

Spider Structure

spider_scrapy

Fetching data from a given web page

Our main aim is to get every URL from the site. Get all the URLs or anchor labels from it. To do this, we have to make one more technique parse to get information from the given URL. 

Presently for retrieving information from the given page, use selectors. These selectors can be either from CSS or from Xpath.

Code Implementation

Open the corona.py folder in our IDE. We will fetch the data from the URL mentioned in the start_urls domain. In our project, the XPath selector is used to fetch the data from the world meter website. For working of the selectors, we need to right-click on the website to get the text and link. The yield command will give the items that are asked to fetch.

project_scrapy
import scrapy
class CoronaSpider(scrapy.Spider):
    name = 'corona'
    allowed_domains = ['www.worldometers.info/coronavirus/']
    start_urls = ['http://www.worldometers.info/coronavirus/']
    def parse(self, response):
        for country in response.xpath("//td/a"):
            name=country.xpath(".//text()").get()
            link=country.xpath(".//@href").get()
            yield{
                  'country_name':name,
                  'country_link':link
                }   

Storing the data

Finally, run the spider and get output in simple CSV file 

scrapy crawl NAME_OF_SPIDER -o File_Name.csv

Conclusion

In this article, we have covered the procedure to fetch data from a particular website using Scrapy and Python. It is interesting to see how easily we can fetch data using web scrapers. But, it becomes an uphill task if some web-sites block our IP. Further, we can explore more about web scrapers as it is one of the most important steps of data analysis.

Share
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.