Now Reading
Hands-On Guide To Web Scraping Using Python and Scrapy

Hands-On Guide To Web Scraping Using Python and Scrapy

Ankit Das
scrapy

Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites. Scrapy and Beautiful Soup are some of the famous web scrapers used to extract reviews from famous websites like Amazon, Zomato to analyze it.

Scrapy is an open-source web crawling system, written in Python. Initially intended for web scraping, it can likewise be utilized to separate information utilizing APIs or as a universally useful web crawler. This web crawler is used to create our own spiders. It helps to select specific parts from the webpage using selectors like CSS and XPath.

Here, we will cover the details of components that are used in Scrapy for web crawling purposes. Further, we will extract the data from a particular site using Python and Scrapy.

How does Scrapy work?

The engine receives an HTTP request from the spiders. It delivers that request to the Scheduler as it is responsible for tracking the order of request. The more preferred requests are sent back to the Engine by the Scheduler. The Engine sends a request to the downloader, and in return, it receives back a response. The response is then sent back to the spider for processing activity. Finally, the Engine sends a response to the item pipeline that gives specific parts of the data that are asked to extract.



scrapy

Scrapy comes with whole new features of creating a spider, running it and then saving data easily by scraping it.

IDE Used for Scrapy

For this project use either Pycharm or Visual Studio as we can see the output in the terminal

Creating a virtual environment

It is a great idea to establish one virtual environment as it separates the program and doesn’t influence some other projects present in the machine.

To create a virtual environment first install it by using.

sudo apt-get install python3-venv

Create one folder and then activate it 

./Scripts/activate

Installing Scrapy Library

It is difficult to install scrapy in Window 10. So, it is recommended to install it using anaconda navigator.

pip install scrapy.

Creating Scrapy project

After installing Scrapy, we need to create a scrapy project.

scrapy startproject corona
scarapy_proj

Creating Spider

In Scrapy, one Spider is made which slithers over the site and assists with fetching information, so to make one, move to the spider folder and make one python document over there.

First thing is to name the Spider by assigning it with a named variable and afterwards give the beginning URL through which the Spider will begin scraping. 

See Also
deepdream

spider_scrapy

Default Folder Structure

Go to C drive and open the user Folder. We can see that there will be a folder in the name of the project “worldometer”.

folder_scrapy

Spider Structure

spider_scrapy

Fetching data from a given web page

Our main aim is to get every URL from the site. Get all the URLs or anchor labels from it. To do this, we have to make one more technique parse to get information from the given URL. 

Presently for retrieving information from the given page, use selectors. These selectors can be either from CSS or from Xpath.

Code Implementation

Open the corona.py folder in our IDE. We will fetch the data from the URL mentioned in the start_urls domain. In our project, the XPath selector is used to fetch the data from the world meter website. For working of the selectors, we need to right-click on the website to get the text and link. The yield command will give the items that are asked to fetch.

project_scrapy
import scrapy
class CoronaSpider(scrapy.Spider):
    name = 'corona'
    allowed_domains = ['www.worldometers.info/coronavirus/']
    start_urls = ['http://www.worldometers.info/coronavirus/']
    def parse(self, response):
        for country in response.xpath("//td/a"):
            name=country.xpath(".//text()").get()
            link=country.xpath(".//@href").get()
            yield{
                  'country_name':name,
                  'country_link':link
                }   

Storing the data

Finally, run the spider and get output in simple CSV file 

scrapy crawl NAME_OF_SPIDER -o File_Name.csv

Conclusion

In this article, we have covered the procedure to fetch data from a particular website using Scrapy and Python. It is interesting to see how easily we can fetch data using web scrapers. But, it becomes an uphill task if some web-sites block our IP. Further, we can explore more about web scrapers as it is one of the most important steps of data analysis.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
What's Your Reaction?
Excited
2
Happy
0
In Love
0
Not Sure
0
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top