Now Reading
Guide To Goutte: A Simple PHP Web Scraper

Guide To Goutte: A Simple PHP Web Scraper

goutte a php framework for web scraping

Web scraping is a term used to describe a way to automatically extract data from the internet, we have seen many web scraping tools so far like BeautifulSoup with python, Diffbot without coding a GUI based tool, Puppeteer with Node.js but is it possible to scrape the data from website using PHP?

Yes, Goutte made it easy for developers to use PHP to scrape data. Goutte was originally written by Fabien Potencier. He is a creator of the Symfony framework, which is now maintained by FriendsOfPHP. Goutte is a library that is based on PHP 5.5+ version and Guzzle 6+; Guzzle is a PHP HTTP client that is the requirement of Goutte framework, it is used to send HTTP requests. Some Pros about Guzzle is as follows:

  • Simple interface for building POST requests.
  • The Same interface can send both synchronous and asynchronous requests.
  • PSR-7 interfaces for requests.
  • No hard dependency on cURL, PHP streams.

Read more about Guzzle here.

PHP is a server scripting language. It is a very powerful tool for making dynamic and interactive Web pages. PHP is widely used and a great competitor to Microsoft’s ASP. When we talk about data extraction from the internet, PHP is the last thing that comes into mind. Goutte is based on the Symfony framework.

Symfony is a set of PHP components: a Philosophy, a Web application framework, and a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs and was published as free software in 2005 and was released under MIT licence.

Symfony Framework
Image: credit

Symfony is used by large numbers of developers and contains many great features like:

  • Create complex web applications
  • Standalone PHP Micro Framework.
  • Very fast.
  • Stable framework
  • Good open source community & contribution. 

Read more about the Symfony framework here.

Goutte provides a decent API to crawl websites and extract data from HTML/XML documents.

web scraping overview using php and Goutte framework

That means you can login into websites, submit forms using POST, upload a file and many more all by just using the Goutte framework at your server, you can also run this framework on a local computer. 

Getting Started

First, let’s see how to set up a PHP environment, what are the requirements, how to install an additional framework one by one.

Requirements for Goutte

As discussed above, Goutte depends on :

After downloading unzip and adding the extracted directory path into the environment variable For installation procedure of PHP visit here. For checking if PHP installed properly use the below command:

See Also
diffbot

php --version
php--version command to check if php is installed
  • After PHP, Download Composer from here, it is a dependency manager for PHP.
  • Guzzle 6+ (use composer command to install), Read more.
composer require guzzlehttp/guzzle

Installing Goutte

Now install goutte using composer, it will add fabpot/goutte as a required dependency in your composer.json file:

composer require fabpot/goutte

Example

A web app that will Scrape GitHub repository list from your account; using Goutte a php framework!

The following example is taken from here, we are going to create a script that will log in to your personal Github account and scrape all the repository list into your browser.

  1. Create a project folder and name 
  2. Download the Goutte library repository from GitHub using the below command and after extracting you will get a directory name “Goutte”, we are going to use this directory in further process.
git clone https://github.com/FriendsOfPHP/Goutte.git
  1. Now use the composer command to initialize your local directory with composer.json file, we are going to install goutte dependencies init.
composer require fabpot/goutte
  1. Let’s first create a basic interface for the user so that anyone can extract their repository list from GitHub by entering their username and password into the given below form.
<form method="POST">
<div class="form-group">
<h1>Github Repositories scraper</h1>
<label for="git_email">Email address</label>
<input type="email" class="form-control" id="git_email" name="git_email" placeholder="Enter email">
</div>
<div class="form-group">
<label for="git_pwd">Password</label>
<input type="password" class="form-control" id="git_pwd" name="git_pwd" placeholder="Password">
</div>
<button type="submit" class="btn btn-primary">Submit</button>
</form>
index.html github login page template

This is a client-side user interface from where our scraper is going to read the username and password.

  1. Check if our page is submitting the data. This is PHP code which runs first when the user hits the submit button after filling the form.
if(isset($_POST["“git_email"]) && isset($_POST["git_pwd"]) && !empty($_POST["“git_email"]) && !empty($_POST["git_pwd"])){
}
  1. Import libraries and importing client.php from Goutte directory which we downloaded by using git clone command also we are importing vendor autoload.php that helps in autoloading PHP classes.
require_once("vendor/autoload.php");
require_once("Goutte/Goutte/Client.php");
$client = new Client();
  1. Inspect Github and search for login elements: username and password.
inspecting github login page
  1. Initialize crawler variable using request and GET command on URL.
$crawler = $client->request('GET', 'https://github.com/login');
  1. Now we know the elements where username and password tag are, so set the parameter and form:
$form = $crawler->selectButton('Sign in')->form();

$form->setValues(['login' => $_POST["git_email"], 'password' => $_POST[“git_pwd"]]);

$crawler = $client->submit($form);
  1. Checking if the login was successful by checking if meta tag having name “octolytics-actor-login”
$username = "";
$crawler->filter('meta')->each(function ($node) {
global $username;
if(trim($node->attr("name")) == "octolytics-actor-login"){
$username = ($node->attr("content"));
return;
}
});
  1. Navigate the URL of the GitHub repository of the user, i.e. for example https://github.com/mmaithani?tab=repositories
$crawler = $client->request('GET', 'https://github.com/'.$username.'?tab=repositories');
  1. Let’s inspect the repository page. All the repository’s names are inside the ankle tag that is inside the class “source”. We can use the filter function to extract text from ankle tag using filter(li.source a)  
inspecting elements for web scraping
$crawler->filter('li.source a')->each(function ($node) {
if(is_numeric($node->text()) === false){
echo $node->text();
echo "<br/>";
}
});
  1. Before running the final script, first, let’s see the files and directories we are having, index.php is our main script which is inspired by this article. Goutte directory is essential for this project, vendors contain our autoloader PHP script, composer.json is out dependencies file.
file directories php project for web scraping
  1. Let’s run our script in the browser. To create a lightweight server for our web app, first, open the terminal/command prompt(CMD) in this directory and use the following commands:
php -S 127.0.0.1:8000
  1. Now go to http://127.0.0.1:8000/, It will load the index.php automatically, and the output will be something like this, log in with your Github account and click submit.
index.html login page template github
Custom GitHub login page
  1. The output will be shown in the browser, all the repositories list including private repo too because we scraped the data after login, so we have full access to user data.
extracted GitHub repositories list names, even the private GitHub repo are shows too.
output

Conclusion

Goutte is quite fast, can imitate basic user actions, supports async requests, and even doesn’t require any browser.

We saw a web application that is capable of scraping data from the GitHub account of the user, Goutte a friend of PHP which is capable of working with client-side application and also it can take user inputs and scrape accordingly with full control over the account. There are some cons of Goutte too like it doesn’t support JavaScript and also can’t take pictures as we do in Puppeteer. Indeed Goutte is a lightweight wrapper on top of the best frameworks.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top