Solving the search problem with Laravel and TNTSearch

Solving the search problem

Adding search functionality to your project can sometimes be tiresome. It is often required to search against large text documents where the database LIKE operator simply isn’t enough. It cannot handle cases where you don’t write the exact search query. The query “%Laravel — The PHP Framework%” is not the same as "%The PHP Framework — Laravel%" and often you won't get any results for such queries. It's not even worth discussing how much MySQL LIKE operator performance sucks.

You might argue that databases like MySQL have full text search support and it shouldn't be to hard to use it. Well, you might be right, but this approach has also its down sides. If you're not using the latest version of MySQL, you'll see that there is no full text search support for InnoDB tables. If you have support, you won't be able to search on multiple columns. The match queries are a little bit complicated and they don't get you good results. Of course there are great solutions out there to handle full text search, most famous ones being Elasticsearch and Sphinx, but often they seem like just an overkill especially for "smaller" projects.

What’s the solution then you may ask? Luckily, there is a package written in pure PHP that does the job for you. You might be wondering how good or fast pure PHP can be. Well, to comfort you here are some numbers. To build up a demo page for the search which has 57000 Tv Shows and 130000 Actors and all is searchable:

Indexing: 30sec
Fetching results: 0.001 - 0.02sec (on a MacBook Pro computer depending on search query)

As you can see the fetching is extremely fast and since the package is using state of the art relevance algorithms the relevance is much higher than in other PHP solutions. The performance of the package is great because the index is built on top of a sqlite database which is an extension written in C and comes default with PHP. And you cannot beat C :)

In this tutorial we’ll show how simple it is to build a documentation with full text search support. PHPUnit has great documentation, but it is missing the search functionality. Our task in this tutorial is to implement it.

For the documentation layout we'll be useing http://getuikit.com/. It's as they claim a lightweight and modular front-end framework for developing fast and powerful web interfaces.

Before we can build an inverted index we need to somehow get the phpunit documentation. Our approach here is to simply scrape the latest version of the docs. We'll create a laravel command that will be responsible for scraping the phpunit site. After the scraping we'll create the index.

Let’s create a file and call it IndexDocumentation.php. It will be placed in app/Console/Commands. The command name will be docs:index so that we later can run php artisan docs:index.

The same command name is used to index the official Laravel documentation btw :)

<?php namespace App\Console\Commands;

use Illuminate\Console\Command;
use TeamTNT\TNTSearch\TNTSearch;
use Config;
use Goutte\Client;

class IndexDocumentation extends Command
{
    /**
     * The console command name.
     *
     * @var string
     */
    protected $name = 'docs:index';

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = 'Index all documentation with TNTSearch';
}

Scraping the phpunit site is very simple and can be done with one function. What we're doing here is that we're going to the phpunit.de site, taking the table of contest and saving it to index.hml. Each chapter is saved to its own html file. All of the documentation is saved to resources/docs/

<?php

public function scrapePHPUnitDe()
{
    $client = new Client();
    $crawler = $client->request('GET', 'https://phpunit.de/manual/current/en/index.html');
    $toc = $crawler->filter('.toc');
    file_put_contents(base_path('resources/docs/').'index.html', $toc->html());

    $crawler->filter('.toc > dt a')->each(function($node) use ($client) {
        $href = $node->attr('href');
        $this->info("Scraped: " . $href);
        $crawler = $client->request('GET', $href);
        $chapter = $crawler->filter('.col-md-8 .chapter, .col-md-8 .appendix')->html();
        file_put_contents(base_path('resources/docs/').$href, $chapter);
    });
}

Once we have scraped all of the documentation, it's time to build the inverted index. An inverted index stores all of the words that are contained in the documents and their positions. For example it tells us that the words "Martin Fowler" can be found only in writing-tests-for-phpunit.html. Here's a definition from wikipedia:

In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents (named in contrast to a Forward Index, which maps from documents to content).

Using TNTSearch, building the index is trivial. We need to provide it with some basic configuration. The storage directive represents where the index will be stored. The driver is filesystem since it can also read from a database, but what we want here is the files that we downloaded before. The location directive looks for the documentation files. Exclude and extension directives are self explanatory.

<?php

public function handle()
{
    $this->scrapePHPUnitDe();

    $tnt = new TNTSearch;

    $config = [
        "storage"   => storage_path(),
        "driver"    => 'filesystem',
        "location"  => base_path('resources/docs/'),
        "extension" => "html",
        "exclude"   => ['index.html']
    ];

    $tnt->loadConfig($config);
    $indexer = $tnt->createIndex('docs');
    $indexer->run();
}

The command is now complete and we can run php artisan docs:index. This will create an index that we can query against.

Searching the index

Now that we have the index it won't be a problem to search for relevant documentation. The frontend part will have twitter's typeahead plugin and on every keypress it will hit the server asking for search results. The route that we’ll hit is /search and it will dispatch our request to the DocsController.php. The search method looks like this:

<?php

public function search(Request $request)
{
    $this->tnt->loadConfig([
        "storage"   => storage_path(),
        "driver"    => 'filesystem',
    ]);

    $this->tnt->selectIndex("docs");
    $this->tnt->asYouType = true;

    $results = $this->tnt->search($request->get('query'), $request->get('params')['hitsPerPage']);

    return $this->processResults($results, $request);
}

A handy feature is the asYouType property, which means it will return results even for single letters. The only thing left is to prepare the output for the frontend:

<?php

public function processResults($res, $request)
{
    $data = ['hits' => [], 'nbHits' => count($res)];

    foreach ($res as $result) {
        $file = file_get_contents($result['path']);
        $crawler = new Crawler;
        $crawler->addHtmlContent($file);
        $title = $crawler->filter('h1')->text();

        $relevant = $this->tnt->snippet($request->get('query'), strip_tags($file));

        $data['hits'][] = [
            'link' => basename($result['path']),
            '_highlightResult' => [
                'h1' => [
                    'value' => $this->tnt->highlight($title, $request->get('query')),
                ],
                'content' => [
                    'value' => $this->tnt->highlight($relevant, $request->get('query')),
                ]
            ]
        ];
    }

    return response()->json($data);
}

There are couple of helper function like highlight or snippet which let you better format the output. A working example of this documentation can be found here or if you want another showcase take a look here. If you like the package, don’t forget to star it on github.



comments powered by Disqus