Web Scraping is a procedure to extract information from sites. This can be done with the assistance of web scraping programming known as web scrapers. They consequently load and concentrate information from the sites dependent on client prerequisites. Scrapy and Beautiful Soup are some of the famous web scrapers used to extract reviews from famous websites like Amazon, Zomato to analyze it.
Scrapy is an open-source web crawling system, written in Python. Initially intended for web scraping, it can likewise be utilized to separate information utilizing APIs or as a universally useful web crawler. This web crawler is used to create our own spiders. It helps to select specific parts from the webpage using selectors like CSS and XPath.
Here, we will cover the details of components that are used in Scrapy for web crawling purposes. Further, we will extract the data from a particular site using Python and Scrapy.
How does Scrapy work?
The engine receives an HTTP request from the spiders. It delivers that request to the Scheduler as it is responsible for tracking the order of request. The more preferred requests are sent back to the Engine by the Scheduler. The Engine sends a request to the downloader, and in return, it receives back a response. The response is then sent back to the spider for processing activity. Finally, the Engine sends a response to the item pipeline that gives specific parts of the data that are asked to extract.
Scrapy comes with whole new features of creating a spider, running it and then saving data easily by scraping it.
IDE Used for Scrapy
For this project use either Pycharm or Visual Studio as we can see the output in the terminal
Creating a virtual environment
It is a great idea to establish one virtual environment as it separates the program and doesn’t influence some other projects present in the machine.
To create a virtual environment first install it by using.
sudo apt-get install python3-venv
Create one folder and then activate it
./Scripts/activate
Installing Scrapy Library
It is difficult to install scrapy in Window 10. So, it is recommended to install it using anaconda navigator.
pip install scrapy.
Creating Scrapy project
After installing Scrapy, we need to create a scrapy project.
scrapy startproject corona
Creating Spider
In Scrapy, one Spider is made which slithers over the site and assists with fetching information, so to make one, move to the spider folder and make one python document over there.
First thing is to name the Spider by assigning it with a named variable and afterwards give the beginning URL through which the Spider will begin scraping.
Default Folder Structure
Go to C drive and open the user Folder. We can see that there will be a folder in the name of the project “worldometer”.
Spider Structure
Fetching data from a given web page
Our main aim is to get every URL from the site. Get all the URLs or anchor labels from it. To do this, we have to make one more technique parse to get information from the given URL.
Presently for retrieving information from the given page, use selectors. These selectors can be either from CSS or from Xpath.
Code Implementation
Open the corona.py folder in our IDE. We will fetch the data from the URL mentioned in the start_urls domain. In our project, the XPath selector is used to fetch the data from the world meter website. For working of the selectors, we need to right-click on the website to get the text and link. The yield command will give the items that are asked to fetch.
import scrapy class CoronaSpider(scrapy.Spider): name = 'corona' allowed_domains = ['www.worldometers.info/coronavirus/'] start_urls = ['http://www.worldometers.info/coronavirus/'] def parse(self, response): for country in response.xpath("//td/a"): name=country.xpath(".//text()").get() link=country.xpath(".//@href").get() yield{ 'country_name':name, 'country_link':link }
Storing the data
Finally, run the spider and get output in simple CSV file
scrapy crawl NAME_OF_SPIDER -o File_Name.csv
Conclusion
In this article, we have covered the procedure to fetch data from a particular website using Scrapy and Python. It is interesting to see how easily we can fetch data using web scrapers. But, it becomes an uphill task if some web-sites block our IP. Further, we can explore more about web scrapers as it is one of the most important steps of data analysis.