Web scraping is a technique of extracting data from the internet, and it is the main source of information, as the data is increasing the Web scraping techniques are trending more and more, you can see below how the worldwide popularity of Web Scraping has increased over the years.
Now there are many tools for web scraping in the market some provide free service, some paid and other Graphical interfaces for noncoders now based on different needs we have already covered many web scraping tools as follows:
- Goutte (a PHP based web scraper)
- Cheerio (Node.js based web scraping tool)
- Puppeteer (Node.js framework)
- MechanicalSoup (Python framework)
- Diffbot (Graphical interface for web scraping without coding)
- Parsehub (No need for coding, full GUI enabled)
- BeautifulSoup (Most famous of all mostly used by developers)
- Selenium (Headless browsing, realtime web scraping)
- Urllib & Requests (Python-based scrapping tool)
ScrapingBee
Today we are going to discuss ScrapingBee that is very popular and used by many fortune companies on a daily basis for Web scraping tasks.
ScrapingBee is a web scraping tool created by Kevin Sahin and Pierre de Wulf, Kevin is a web scraping expert and author of the Java web scraping book, Pierre is a data scientist. ScrapingBee makes scraping the web easy and also they provide API so no need to worry about programming languages. It works well with every language. They solved many problems like headless chrome surfing on the server and dynamically changing proxy IP to never get blocked, ScrapingBee waits for 2000 milliseconds before returning the Source code, because it scrapes web pages HTML like a real browser in the headless environment does. Here are some of the features you should know before getting started.
Features of ScrapingBee
- Used for price monitoring and other web scraping stuff.
- Extracting data without getting blocked.
- Uses a large proxy pool.
- No rate limit barrier due to dynamic proxies.
- Lead generation directly from Google Sheets.
- No worrying about running headless chrome on a server
Getting Started
Go to the ScrapingBee website and signup, and they provide a free plan which includes 1000 free API calls, that’s enough to learn and test this API.
Now access the dashboard and copy the API key we needed later in this tutorial. ScrapingBee supports multi-language support so you can directly use the API key in your projects from now on.
Installation
Scaping bee provides REST API support so it can be used with any programming language like CURL, Python, NodeJS, Java, Ruby, Php, and Go. We are going to use Python with the Request framework and BeautifulSoup for further Scraping. Install them using PIP as follows:
# Install the Python Requests library:
pip install requests
# Additional modules we needed during this tutorial:
pip install BeautifulSoup
Quickstart
Use the below code to initiate ScrapingBee web API, here we are creating a Request call with parameters URL, API key and in return, the API responds with an HTML content of the target URL.
Python
import requests
def send_request():
response = requests.get(
url="https://app.scrapingbee.com/api/v1/",
params={
"api_key": "INSERT-YOUR-API-KEY",
"url": "https://example.com/",
},
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
send_request()
We can use BeautifulSoup to make this output more readable by just adding a prettify code, learn more about BeautifulSoup.
Encoding
You can also encode the URL you want to scrape by using urllib.parse as follows:
import urllib.parse
encoded_url = urllib.parse.quote("YOUR URL")
ScrapingBee API also supports other languages too like:
JAVA
Examples are taken from here
import java.io.IOException;
import org.apache.http.client.fluent.*;
public class SendRequest
{
public static void main(String[] args) {
sendRequest();
}
private static void sendRequest() {
// Classic (GET )
try {
// Create request
Content content = Request.Get("https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL")
// Fetch request and return content
.execute().returnContent();
// Print content
System.out.println(content);
}
catch (IOException e) { System.out.println(e); }
}
}
PHP
Example credit
<?php
// get cURL resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL');
// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// send the request and save response to $response
$response = curl_exec($ch);
// stop if fails
if (!$response) {
die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}
echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;
// close curl resource to free up system resources
curl_close($ch);
?>
Let’s scrape data from OLX using ScrapingBee API
We are going to use python language and a simple code using Request, Beautifulsoup, and SrapingBee API for URL requesting and we will extract all the smartphones more specifically tablets from OLX with their names and price:
- Import the modules
#IMPORT MODULES
Import requests
from bs4 import BeautifulSoup
from time import sleep
- Initialize URL and API parameter for our Web API, and it will return the web page source code.
KEY = 'Your_API_key'
URL = 'https://www.olx.in/tablets_c1455'
params = {'api_key': KEY, 'url': URL, 'render_js': 'False'}
- Request to ScrapingBee web API and return the content in variable “r”
r = requests.get('http://app.scrapingbee.com/api/v1/', params=params, timeout=20)
- Inspect OLX and find where the product name and prices are by going to this URL: https://www.olx.in/tablets_c1455 and anywhere on page click Right Click-> Inspect to open the page in developer mode and start inspecting the page using Cursor icon at the left corner above source code.
- Let’s check the API, by returning status code from it and use Beautiful Soup to scrape all the classes having the name ‘EIR5N’ and
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.EIR5N')
- Loop inside the classes and find the product name and price by using command find(), find all the span tags where data-aut-id=itemTitle and itemPrice
for span in links:
product_name = span.find('span', {'data-aut-id': 'itemTitle'})
print(product_name.text)
price = span.find('span', {'data-aut-id': 'itemPrice'})
print(price.text)
- Full code
import requests
from bs4 import BeautifulSoup
from time import sleep
def main():
KEY = 'G2B2GSAPF1LTBJBAR7F0UT8H0VSLC6V7V6EGJRO3MFWFO3EH'
URL = 'https://www.olx.in/tablets_c1455'
params = {'api_key': KEY, 'url': URL, 'render_js': 'False'}
r = requests.get('http://app.scrapingbee.com/api/v1/', params=params, timeout=20)
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, 'lxml')
classes = soup.select('.EIR5N')
for span in classes:
product_name = span.find('span', {'data-aut-id': 'itemTitle'})
print(product_name.text)
price = span.find('span', {'data-aut-id': 'itemPrice'})
print(price.text)
main()
Output
Conclusion
In this tutorial, we learned about ScrapingBee: an API used for Web scraping, this API is special because it provides you Javascript rendering of pages for which you need tools like Selenium that supports headless browsing. Javascript rendering is based on the DOM model. Also, we have seen an example where we scraped the product’s name and price from OLX using this API.
Remember ScrapingBee is not a scraping tool it’s a web API that works with other scraping scripts when there are many restrictions on websites, and we need a solution that can still give us the output and never get blocked. ScrapnigBee API can request a single URL 1000 times without getting blocked, It returns source code very fast, and it is also very simple to use. For more information about this API, you can follow the official documentation.