MITB Banner

Cheerio: A Simple Tool to Create Your Web Scraping Bot

web scraping using node.js framework cheerio

Share

web scraping using node.js framework cheerio

Illustration by web scraping using cheerio node.js

Web scraping is a technique of using robot scripts to crawl the internet for you and return reliable data. Web scrapers are capable of crawling thousands of websites in minutes if implemented correctly with a write toolset and programming language. It is a powerful way of obtaining large amounts of information that can be further cleaned and processed to extract insights. Even in some cases of counterfeit goods, web scraping tools can be used to surf the internet to find fake selling products. And we can report them easily as now we have links to all the sites. Before the web scraping era, it was a very hectic job to manually search and go through each website by yourself. While web scraping may seem simple, the actual process not. Web scraping is a complex process that requires technical knowledge. Fortunately, the industry introduces some tools that can be used without technical knowledge like Diffbot and parsehub.

Today we are going to discuss Cheerio is a node.js framework that helps in interpreting and analyzing the web-pages using jQuery-like syntax.

Cheerio is a fast, flexible, and lean implementation for the server, but why do we need it when we have puppeteer the same Node.js based web scraping tool because puppeteer is more used for automating browser task as it supports real-time visual surfing of the internet as the script runs. Puppeteer can work in websites built with angular and react, you can take screenshots and make pdfs, but in term of speed Cheerio is a clear winner it is a minimalist tool for doing web scraping and you can combine it with other modules to make an end to end script that will save the output in CSV and return other things too. Cheerio is a perfect fit for web scraping tasks. It works with chrome and raw HTML data.

Also, Cheerio works with the simplest consistent DOM model, as shown below:

DOM models used by cheerio
DOM model : image credit

The Document Object Model(DOM) is an interface that treats an HTML document or XML as a tree-like structure as shown in the above images, where each node is an object and it represents the part of the document. DOM representation of a document is in logical tree order. Each branch of the DOM tree contains an object.

The history of DOM was also linked back to the “browser wars” of late 1990-

  • DOM Level 1 was a complete HTML or XML document based model
  • DOM Level 2 was released during late 2000, it introduced the getElementById() function and also supports XML namespaces and  CSS.
  • DOM Level 3 was released in 2004, which added support for Xpath and even handling using the keyboard.
  • DOM Level 4 was released in 2015, It was a snapshot of WHATWG Standards.

Cheerio is not a browser, it is a module in node.js which parses markup and provides an API for manipulating the data structure. It never interprets the result as a web browser does. Specifically, it does not produce a visual representation, apply CSS, loading of external resources, or executing javascript.

Features

  • Familiar syntax
  • Loading HTML as a string
  • Blazingly fast
  • Consistent DOM model.
  • Incredibly flexible.
  • It uses @FB55’s forgiving htmlparse2.

Installation

To use npm commands first install node.js from here, which comes with a prebuilt (npm) node package manager.

npm i cheerio

Quickstart

Initialize project

Always initialize a project in a separate folder with a separate environment using the npm command, you may need to run this every time you manually install dependencies to your project.

mkdir foldername
cd foldername

#Initialize npm and 
#store dependencies in node_modules folder
npm init -y

# install cheerio in this environment
npm i cheerio

The above commands will create a package.json file in your directory which contains all the information about the dependencies, author, GitHub repository, and its version too.

Now, open your command prompt or terminal and type node, and write the below command to check if everything is installed and working, this command will import the cheerio module.

const cheerio = require('cheerio');

Cheerio Methods Explanation

An example we are using is this small HTML document, save this code in a file.html in the same directory.

<ul id="fruits">
  <li class="apple">Apple</li>
  <li class="orange">Orange</li>
  <li class="pear">Pear</li>
</ul>

Now for loading a file from a local directory, we need to install one more module, i.e. file system for file handling as we do in python and other languages to read lines or update data in a file. Use the following command to install fs in your environment. Considering Cheerio is already installed from the above quickstart tab.

npm i fs

Loading

To load a HTML document in with cheerio use following commands:

#importing modules
const cheerio = require('cheerio');
var fs = require('fs');

#loading HTML document
const $ =cheerio.load(fs.readFileSync('file.html'));

Selector

Cheerio selector commands are simple and identical to jQuery’s if you are familiar with the language. Append the below command in the last of your javascript file and save it in demo.js

 $('.apple', '#fruits').text()

Similarly,

$('ul .pear').attr('class')
//output=> pear

$('li[class=orange]').html()
//output=> Orange

Attributes

.attr(name, value) is used to modify attributes. If you set an attribute value to NULL, it means it is removed. You can also pass a map and function.

$('ul').attr('id')

$('.apple').attr('id', 'favorite').html()

.prop(name, value)  for setting properties.

$('input[type="checkbox"]').prop('checked')

$('input[type="checkbox"]').prop('checked', true).val()

.data(name, value) for setting data attributes. Run the below commands and see what’s the output.

$('<div data-apple-color="red"></div>').data()

$('<div data-apple-color="red"></div>').data('apple-color')

const apple = $('.apple').data('kind', 'mac')

There are many others methods For forms we have methods like:

 [.serialize()](#serialize) – [.serializeArray()](#serializearray).

For traversing we have:

  [.find(selector)](#findselector)  [.nextAll([selector])](#nextallselector) functionindex-element-) – [.filter( selector )

.filter( selection )

.filter( element ) .not( function(index, elem) )](#not-selector—not-selection—not-element—not-functionindex-elem-) – [.has( selector )

For manipulation :

– [.append( content, [content, …] )](#append-content-content–) – [.appendTo( target )](#appendto-target-) – [.html( [htmlString] )](#html-htmlstring-) – [.css( [propertName] )

.css( [ propertyNames] )

.css( [propertyName], [value] )

.css( [propertName], [function] )

.css( [properties] )](#css-propertname—css–propertynames—css-propertyname-value—css-propertname-function—css-properties-)

Read more about their usage here.

Let’s Scrape data from Wikipedia 

We are going to scrape the table rows from the Wikipedia page: List of largest companies by revenue here, this example is taken from StackOverflow, we are going to learn how this script can extract the table from Wikipedia and also exported in CSV for our further data analysis.

  1. Install additional modules for this script like a request for HTTP, json2csv for export our data, fs for file handling. Install each of them by using the commands given below:
npm i request
npm i json2csv
npm i fs
  1. Import libraries
var request = require("request-promise")
var cheerio = require("cheerio")
var fs = require("fs") 
var json2csv = require("json2csv").Parser
  1. Load the link in the wiki variable 
const wiki = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue";
  1. Now let’s make the request, here await works on the function that return a promise, uri is a Wikipedia page link to the destination of an HTTP request, headers are an object of HTTP headers(key-value). gzip: True for accept-encoding.
(async () => {
        const response = await req({
            uri: wiki,
            headers: {
                accept:
                    "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "en-US,en;q=0.9"
            },
            gzip: true,
  1. Inspect the Webpage, as we already discussed a couple of times in our previous article like in Beautiful soup. By doing right click -> inspect on target webpage it opens the web page source code in a panel, where you can look out for all the elements having your data as shown below. Note all the tags, class, id for further extraction.
inspecting the web page for doing web scraping
  1. Now first load the source code using cheerio and initialize two list data and data2. Insert all the required elements in rows as demonstrated below and then push it into the data list.
}).then(function (html) {
            let $ = cheerio.load(html);
            let data = [];
            let data2 = [];
            let name, rank, cols, col;
            let rows = $('table.wikitable tbody tr').each((idx, elem) => {
                rank =$(elem).find('th').text().replace(/[\n\r]+/g,'');
                //name = $(elem).find('td a').html();
                data2 = [];
                cols = $(elem).find('td').each((colidx, colelem) => {
                    col = $(colelem).text().replace(/[\n\r]+/g,'');
                    data2.push(col,);    
                });

                data.push({
                    rank,
                    ...data2,
                });
            });
  1. For exporting the data into a CSV use the below lines of code, we used json2csv module to parse our data from a JSON object to CSV format, and then with the help of file system module: fs we will write our data into output.csv
            // exporting data into csv
            const j2cp = new json2csv()
            const csv = j2cp.parse(data);

            fs.writeFileSync("./output.csv", csv, "utf-8");
        }).catch(function (err) {
            console.log(err);
        });
    }
    )();

 Output

output of extracted data from wikipedia
spreadsheet

Conclusion

Cheerio is a great tool and performs very fast in web scraping tasks. We learned about some of the methods of cheerio and also gone into the history of DOM models and then finished it with a web scraper that can scrape data from Wikipedia in seconds, Cheerio is actively maintained by cheeriojs, so if you are interested in contributing, you can start reading their contribution instruction here. Cheeriojs have other projects too like dom-serializer(render dom nodes), jsdoc(an API documentation generator for JavaScriopt), cheerio-select(CSS selector engine supporting jQuery selectors). You should learn more about cheerio here.

Share
Picture of Mohit Maithani

Mohit Maithani

Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.