One of the most critical assets for data-driven organisations is the kind of tools used by their data science professionals. Web crawler and other such web scraping tools are few of those tools that are used to gain meaningful insights. Web scraping allows efficient extraction of data from several web services and helps in converting raw and unstructured data into a structured whole.
There are several tools available for web scraping, such as lxml, BeautifulSoup, MechanicalSoup, Scrapy, Python Requests and others. Among these, Scrapy and Beautiful Soup are popular among developers.
In this article, we will compare these two web scraping tools, and try to understand the differences between them. Before diving deep into the tools, let us first understand what these tools are.
Scrapy is an open-source and collaborative framework for extracting the data you need from websites in a fast and simple manner. This tool can be used for extracting data using APIs. It can also be used as a general-purpose web crawler. Thus, Scrapy is an application framework, which can be used for writing web spiders that crawl websites and extract data from them.
The framework provides a built-in mechanism for extracting data – known as selectors – and can be used for data mining, automated testing, etc. Scrapy is supported under Python 3.5+ under CPython and PyPy starting with PyPy 5.9.
Features of Scrapy:
- Scrapy provides built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions
- An interactive shell console for trying out the CSS and XPath expressions to scrape data
- Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
Scraping With Scrapy
If you just want to install scrapy globally in your system, you can install scrapy library using the python package ‘pip’. Open your terminal or command prompt and type the following command.
pip install scrapy
If you want scrapy to be in your conda environment just type in and execute the following command in your terminal
conda install -c conda-forge scrapy
The scrapy shell: It allows to scrape web pages interactively using the command line.
To open scrapy shell type
Scraping with Scrapy Shell
Follow the steps below to start scraping :
1. Open the html file in a web browser and copy the url.
2. Now in the scrapy shell type and execute the following command:
Replace url– with the url of the html file or any webpage and the fetch command will download the page locally to your system.
You will get a similar message in your console
[scrapy.core.engine] DEBUG: Crawled (200)
3. Viewing the response
The fetch object will store whatever page or information it fetched into a response object. To view the response object simply type in and enter the following command.
The console will return a True and the webpage that was downloaded with fetch() will open up in your default browser.
4. Now that all the data you need is available locally. You just need to know what data you need.
5. Scraping the data: Coming back to the console, all the elements need to be printed behind the webpage that was fetched earlier. Enter the following command:
Click here to get the detailed web scraping.
Beautiful Soup is one of the most popular Python libraries which helps in parsing HTML or XML documents into a tree structure to find and extract data. This tool features a simple, Pythonic interface and automatic encoding conversion to make it easy to work with website data.
This library provides simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree, and automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
Features of Beautiful Soup:
- This Python library provides a few simple methods, as well as Pythonic idioms for navigating, searching, and modifying a parse tree
- The library automatically converts incoming and outgoing documents to Unicode and UTF-8, respectively
- This library sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility
Scraping With Beautifulsoup
Installing Beautiful Soup 4
Beautiful Soup library can be installed using PIP with a very simple command. It is available on almost all platforms. Here is a way to install it using Jupyter Notebook.
This library can be imported with the following code and assign it to an object.
We will be using this basic, and default, HTML doc to parse the data using Beautiful Soup.
The following code will expand HTML into its hierarchy:
Exploring The Parse Tree
To navigate through the tree, we can use the following commands:
Beautiful Soup has many attributes which can be accessed and edited. This extracted parsed data can be saved onto a text file.
Click here to know more about web scraping with BeautifulSoup.
Scrapy VS Beautiful Soup
Scrapy is an open-source framework, whereas Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. A framework inverts the control of the program and informs the developer what they need. Whereas in the case of a library, the developer calls the library where and when they need it.
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
Beautiful Soup works best when working on smaller projects. On the other hand, Scrapy may be the better choice for larger projects with more complexities, as this framework can add custom functionalities and can develop pipelines with flexibility and speed.
For a beginner who is trying hands-on web scraping for the first time, Beautiful Soup is the best choice to start with. Scrapy can be used for scraping, but it is comparatively more complex than the former.
The developer’s community of Scrapy is stronger and vast compared to that of Beautiful Soup. Also, developers can use Beautiful Soup for parsing HTML responses in Scrapy callbacks by feeding the response’s body into a BeautifulSoup object and extracting whatever data they need from it.