Guide To Goutte: A Simple PHP Web Scraper

goutte a php framework for web scraping

Web scraping is a term used to describe a way to automatically extract data from the internet, we have seen many web scraping tools so far like BeautifulSoup with python, Diffbot without coding a GUI based tool, Puppeteer with Node.js but is it possible to scrape the data from website using PHP?

Yes, Goutte made it easy for developers to use PHP to scrape data. Goutte was originally written by Fabien Potencier. He is a creator of the Symfony framework, which is now maintained by FriendsOfPHP. Goutte is a library that is based on PHP 5.5+ version and Guzzle 6+; Guzzle is a PHP HTTP client that is the requirement of Goutte framework, it is used to send HTTP requests. Some Pros about Guzzle is as follows:

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.
  • Simple interface for building POST requests.
  • The Same interface can send both synchronous and asynchronous requests.
  • PSR-7 interfaces for requests.
  • No hard dependency on cURL, PHP streams.

Read more about Guzzle here.

PHP is a server scripting language. It is a very powerful tool for making dynamic and interactive Web pages. PHP is widely used and a great competitor to Microsoft’s ASP. When we talk about data extraction from the internet, PHP is the last thing that comes into mind. Goutte is based on the Symfony framework.

Symfony is a set of PHP components: a Philosophy, a Web application framework, and a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs and was published as free software in 2005 and was released under MIT licence.

Symfony Framework
Image: credit

Symfony is used by large numbers of developers and contains many great features like:

  • Create complex web applications
  • Standalone PHP Micro Framework.
  • Very fast.
  • Stable framework
  • Good open source community & contribution. 

Read more about the Symfony framework here.

Goutte provides a decent API to crawl websites and extract data from HTML/XML documents.

web scraping overview using php and Goutte framework

That means you can login into websites, submit forms using POST, upload a file and many more all by just using the Goutte framework at your server, you can also run this framework on a local computer. 

Getting Started

First, let’s see how to set up a PHP environment, what are the requirements, how to install an additional framework one by one.

Requirements for Goutte

As discussed above, Goutte depends on :

After downloading unzip and adding the extracted directory path into the environment variable For installation procedure of PHP visit here. For checking if PHP installed properly use the below command:

php --version
php--version command to check if php is installed
  • After PHP, Download Composer from here, it is a dependency manager for PHP.
  • Guzzle 6+ (use composer command to install), Read more.
composer require guzzlehttp/guzzle

Installing Goutte

Now install goutte using composer, it will add fabpot/goutte as a required dependency in your composer.json file:

composer require fabpot/goutte

Example

A web app that will Scrape GitHub repository list from your account; using Goutte a php framework!

The following example is taken from here, we are going to create a script that will log in to your personal Github account and scrape all the repository list into your browser.

  1. Create a project folder and name 
  2. Download the Goutte library repository from GitHub using the below command and after extracting you will get a directory name “Goutte”, we are going to use this directory in further process.
git clone https://github.com/FriendsOfPHP/Goutte.git
  1. Now use the composer command to initialize your local directory with composer.json file, we are going to install goutte dependencies init.
composer require fabpot/goutte
  1. Let’s first create a basic interface for the user so that anyone can extract their repository list from GitHub by entering their username and password into the given below form.
<form method="POST">
<div class="form-group">
<h1>Github Repositories scraper</h1>
<label for="git_email">Email address</label>
<input type="email" class="form-control" id="git_email" name="git_email" placeholder="Enter email">
</div>
<div class="form-group">
<label for="git_pwd">Password</label>
<input type="password" class="form-control" id="git_pwd" name="git_pwd" placeholder="Password">
</div>
<button type="submit" class="btn btn-primary">Submit</button>
</form>
index.html github login page template

This is a client-side user interface from where our scraper is going to read the username and password.

  1. Check if our page is submitting the data. This is PHP code which runs first when the user hits the submit button after filling the form.
if(isset($_POST["“git_email"]) && isset($_POST["git_pwd"]) && !empty($_POST["“git_email"]) && !empty($_POST["git_pwd"])){
}
  1. Import libraries and importing client.php from Goutte directory which we downloaded by using git clone command also we are importing vendor autoload.php that helps in autoloading PHP classes.
require_once("vendor/autoload.php");
require_once("Goutte/Goutte/Client.php");
$client = new Client();
  1. Inspect Github and search for login elements: username and password.
inspecting github login page
  1. Initialize crawler variable using request and GET command on URL.
$crawler = $client->request('GET', 'https://github.com/login');
  1. Now we know the elements where username and password tag are, so set the parameter and form:
$form = $crawler->selectButton('Sign in')->form();

$form->setValues(['login' => $_POST["git_email"], 'password' => $_POST[“git_pwd"]]);

$crawler = $client->submit($form);
  1. Checking if the login was successful by checking if meta tag having name “octolytics-actor-login”
$username = "";
$crawler->filter('meta')->each(function ($node) {
global $username;
if(trim($node->attr("name")) == "octolytics-actor-login"){
$username = ($node->attr("content"));
return;
}
});
  1. Navigate the URL of the GitHub repository of the user, i.e. for example https://github.com/mmaithani?tab=repositories
$crawler = $client->request('GET', 'https://github.com/'.$username.'?tab=repositories');
  1. Let’s inspect the repository page. All the repository’s names are inside the ankle tag that is inside the class “source”. We can use the filter function to extract text from ankle tag using filter(li.source a)  
inspecting elements for web scraping
$crawler->filter('li.source a')->each(function ($node) {
if(is_numeric($node->text()) === false){
echo $node->text();
echo "<br/>";
}
});
  1. Before running the final script, first, let’s see the files and directories we are having, index.php is our main script which is inspired by this article. Goutte directory is essential for this project, vendors contain our autoloader PHP script, composer.json is out dependencies file.
file directories php project for web scraping
  1. Let’s run our script in the browser. To create a lightweight server for our web app, first, open the terminal/command prompt(CMD) in this directory and use the following commands:
php -S 127.0.0.1:8000
  1. Now go to http://127.0.0.1:8000/, It will load the index.php automatically, and the output will be something like this, log in with your Github account and click submit.
index.html login page template github
Custom GitHub login page
  1. The output will be shown in the browser, all the repositories list including private repo too because we scraped the data after login, so we have full access to user data.
extracted GitHub repositories list names, even the private GitHub repo are shows too.
output

Conclusion

Goutte is quite fast, can imitate basic user actions, supports async requests, and even doesn’t require any browser.

We saw a web application that is capable of scraping data from the GitHub account of the user, Goutte a friend of PHP which is capable of working with client-side application and also it can take user inputs and scrape accordingly with full control over the account. There are some cons of Goutte too like it doesn’t support JavaScript and also can’t take pictures as we do in Puppeteer. Indeed Goutte is a lightweight wrapper on top of the best frameworks.

More Great AIM Stories

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.

[class^="wpforms-"]
[class^="wpforms-"]