Powerful Custom Entities with the Diffbot PHP Client

Bruno Skvorc
Share

A while back, we looked at Diffbot, the machine learning AI for processing web pages, as a means to extract SitePoint author portfolios. That tutorial focused on using the Diffbot UI only, and consuming the API created would entail pinging the API endpoint manually. Additionally, since then, the design of the pages we processed has changed, and thus the API no longer reliably works.

In this tutorial, apart from rebuilding the API so that it works again, we’ll use the official Diffbot client to build custom entities that correspond to the data we seek (author portfolios).

Diffbot logo

Bootstrapping

We’ll be using Homestead Improved as usual. The following few commands will bootstrap the Vagrant box, create the project folder, and install the Diffbot client.

git clone https://github.com/swader/homestead_improved hi_diffbot_authorfolio; cd hi_diffbot_authorfolio
./bin/folderfix.sh
vagrant up; vagrant ssh
mkdir -p Code/Project/public; cd Code/Project; touch public/index.php
composer require swader/diffbot-php-client

Additionally, we can install Symfony’s vardumper as a development requirement, just to get prettier debug outputs.

composer require symfony/var-dumper --dev

If we now give index.php the following content, provided we added homestead.app to our host machine’s /etc/hosts file, we should see “Hello world” if we visit http://homestead.app in our browser:

<?php
// index.php

require '../vendor/autoload.php';

echo "Hello World";

Diffbot Initialization

Note that to follow along, you’ll need a free Diffbot token – get one here.

define('TOKEN', 'token');
use Swader\Diffbot\Diffbot;

$d = new Diffbot(TOKEN);

This is all we need to init Diffbot. Let’s test it on a sample article.

echo $d->createArticleAPI('https://www.sitepoint.com/crawling-searching-entire-domains-diffbot')->call()->getAuthor(); // Bruno Skvorc

Custom API

First, we need to rebuild our API from the last post, so that it can become operational again. We do this by logging into the dev panel and going to https://www.diffbot.com/dev/customize/.

Let’s create a new API:

API creation screenshot

After entering a sample URL like www.sitepoint.com/author/bskvorc/, we can add some custom fields, like author:

Author field creation

We can use this same approach to define fields like bio, and nextPage, in order to activate Diffbot’s automatic pagination:

Next page field creation

We also need to define a collection which would gather all the article cards and process them. Making a collection entails selecting an element the selector of which is repeated multiple times. In our case, that’s the li element of the .article-list class.

Defining a new collection

Within that collection, we define fields for each card (when in doubt, the browser’s dev tools can help us identify the classes and elements we need to specify as selectors to get the desired result):

Defining fields inside the collection

Primary category field definition

Besides title and primary category, we should also to extract the date of publication, primary category URL, article URLs, number of likes, etc. For the sake of brevity, we’ll skip defining those here.

If we now access our endpoint directly rather than in the API toolkit, we should get the fully merged 9 pages of posts back, processed just the way we want them.

http://api.diffbot.com/v3/diffpoint?token=token&url=https://www.sitepoint.com/author/bskvorc/

Diffpoint custom API result

We can see that the API successfully found all the pages in the set and returned even the oldest of posts.

Extending the Client

Let’s see if the Custom API behaves as expected.

echo $d->createCustomAPI('https://www.sitepoint.com/author/bskvorc', 'diffpoint')->call()->getBio();

This should echo the correct bio.

This step is, in a way, optional. We could consume the returned data as is, and just iterate through keys and arrays, but let’s pretend our data is much more complex than a simple portfolio page and do it right regardless.

We need two new classes: an Entity Factory, and an Entity. Let’s create them at /src/AuthorFolio.php and src/CustomFactory.php, relative to the root of our project (src is in the root folder).

AuthorFolio

Let’s start with the new entity. As per the docs, we have an abstract class we can extend.

<?php

// src/AuthorFolio.php

namespace My\Custom;

use Swader\Diffbot\Abstracts\Entity;

class AuthorFolio extends Entity
{

}

We extend the abstract entity and give our new entity its own namespace. This is optional, but useful. At this point, the entity would already be usable – it is essentially identical to the Wildcard entity which uses magic methods to resolve requests for various properties of the returned data (which is why the getBio method in the example above worked without us having to define anything). But the goal is to have the AuthorFolio class verbose, with support for custom, SitePoint-specific data and maybe some shortcut methods. Let’s do this now.

The API will return the full list of an author’s articles – but not their count. To find out how many posts an author has, we’d have to count the articles property, so let’s wrap that process in a shortcut method. We can also tell PHPStorm that the class will have an articles property using the @property tag, so it stops complaining about accessing the field with magic methods:

<?php

// src/AuthorFolio.php

namespace My\Custom;

use Swader\Diffbot\Abstracts\Entity;

/**
 * Class AuthorFolio
 * @property array articles
 * @package My\Custom
 */
class AuthorFolio extends Entity
{
    public function getType()
    {
        return 'authorfolio';
    }

    public function getNumPosts()
    {
        return count($this->articles);
    }
}

Other methods we could define are totalLikes, activeSince, favoredCategory, etc.

CustomFactory

The entity being ready, it’s time to define a custom factory to bind it to the type of return data we’re getting from our custom API. We’re writing an alternative to the default factory, but the original class already contains some methods we can use – it’s designed to be reused by its children. As such, we merely need to extend the original, map the new type to our custom entity, and we’re done.

<?php

// src/CustomFactory.php

namespace My\Custom;

use Swader\Diffbot\Factory\Entity;

class CustomFactory extends Entity
{
    public function __construct()
    {
        $this->apiEntities = array_merge(
            $this->apiEntities,
            ['diffpoint' => '\My\Custom\AuthorFolio']
        );
    }
}

We merged the original API-to-entity list with our own custom binding, thereby telling the Factory class to both keep an eye on the standard types and APIs, and our new ones. This means we can keep using this factory for default Diffbot APIs as well.

Plugging the Factory In

To make our classes autoloadable, we should probably add them to composer.json:

  "autoload": {
    "psr-4": {
      "My\\Custom\\": "src"
    }
  }

We activate these new autoload mappings by running composer dump-autoload.

Next, we instantiate the new factory, plug it into our Diffbot instance, and test the API:

$d = new Diffbot(TOKEN);

$d->setEntityFactory(new My\Custom\CustomFactory());

$api = $d->createCustomAPI('https://www.sitepoint.com/author/bskvorc', 'diffpoint');
$api->setTimeout(120000);

$result = $api->call();

dump($result->getNumPosts());

Result image

Note that we increased the timeout because a heavily paginated set of posts can take a while to render on Diffbot’s end.

Conclusion

In this tutorial, by using the official Diffbot client, we constructed custom entities and built a custom API which returns them. We saw how easy it is to leverage machine learning and optical content processing for grabbing arbitrary data from websites of any type, and we saw how heavily customizable the Diffbot client is.

While this was a rather simple example, it isn’t difficult to imagine advanced use cases on more complex entities, or perhaps several of them spread over multiple APIs, all processed through a single EntityFactory, each custom API corresponding to a special Entity type. With a well trained visual neural network, the only processing limit is one’s imagination.

If you’d like to read more about the Diffbot client, check out the full docs and play around for yourself – just don’t forget to fetch a fresh free two-week demo token!