Full-text search with Elasticsearch

The Elasticsearch River plugin is not compatible with RethinkDB 2.2 and higher. We’ll be revisiting this article to update it with the new official Java driver and the RethinkDB Logstash input plugin soon.

Q: What’s the best way to perform full-text searches with RethinkDB?
A: Use the Elasticsearch River for RethinkDB.

Before you start

What Elasticsearch does

Elasticsearch is a database that stores documents in a crafty way that makes it fast to search large fields of pure text. For instance, it indexes words in different ways depending on how frequent they are in your overall data. It doesn’t waste time checking common words like “is” and “to” when returning results unless they actually make a difference. It also performs stemming, so that a search for “looked” will return results containing the words “looks” and “looking.”

It also returns results ordered from most relevant to least, not worrying about small differences. Say you want to ask the question: “What documents best match the phrase ‘Holy guacamole, Batman’?” If the hoped-for guacamole reference isn’t found, a full-text search should reply with documents containing good matches like “Holy smokes, Batman!” and “Holy armadillo, Batman!” In short, you should be using a full-text search database like Elasticsearch if you find yourself writing convoluted regular expressions to grep through big text fields.

For those applications that need full-text search, we’ve written a plugin for Elasticsearch (called a river) that keeps RethinkDB synced up with Elasticsearch’s indexes. It uses changefeeds to push new, updated and deleted documents to Elasticsearch in real-time. In addition, it loads existing documents from your RethinkDB tables, so you can get going right away.

Warning! If the RethinkDB river plugin loses connection with the RethinkDB server it’s pulling data from, there’s no way to guarantee no documents will be lost. This should change in the future with improvements to changefeeds, but currently the only way to be sure is to backfill every time, which will still miss deleted documents.

For now, the plugin works best when backfilling or replicating into Elasticsearch is an option, and when it’s all right to risk having some outdated data in the index.

Venturing into the river

To install the river, we’ll use the plugin program that comes with Elasticsearch. On most platforms the program is named plugin, but it’s sometimes called elasticsearch-plugin:

plugin --install river-rethinkdb --url http://goo.gl/JmMwTf

Depending on how you’ve installed Elasticsearch, you may need to become the elasticsearch user or root to run this command.

Now that we’ve installed the plugin, the next step is to actually configure it to connect to our RethinkDB instance. We can do that by talking to Elasticsearch’s REST API. There are three concepts we need to deal with in the API: indexes, types, and documents. A document is the actual data being stored itself and is just JSON. A type contains documents and is similar to a table in RethinkDB. An index contains types and is similar to a database in RethinkDB.

To configure our river, we need to create a type called rethinkdb in the _river index. Then we need to insert a document with the id _meta into that type. Elasticsearch lets us create the document and the type in one go with a PUT request:

$ curl -XPUT localhost:9200/_river/rethinkdb/_meta -d '
{
  "type": "rethinkdb",
  "rethinkdb": {
    "host": "localhost",
    "port": 28015,
    "databases": {
      "blog": {
        "posts": { "backfill": true },
        "comments": { "backfill": true }
      }
    }
  }
}

Here we’ve told the river to watch two tables in the blog database: posts and comments. The river should also pull in all existing documents from those tables before it starts watching for updates to the tables. By default, the river inserts documents into a type named after its table, and into an index named after its database. So, in the example above, we’d get a new index named “blog” with two types: “posts” and “comments.”

You can also specify explicitly which index and type you want synced documents to go to:

$ curl -XPUT localhost:9200/_river/rethinkdb/_meta -d '
{
  "type": "rethinkdb",
  "rethinkdb": {
    "host": "localhost",
    "port": 28015,
    "databases": {
      "blog": {
        "posts": {
          "backfill": true,
          "index": "fooBlog",
          "type": "barPosts"
        }
      }
    }
  }
}

Once you’ve got the data in your Elasticsearch server, you’re ready to go. Here’s an example of a simple query using the Elasticsearch REST API:

$ curl localhost:9200/blog/posts/_search?q=body:yams

The results of which might look something like:

{
    "_shards": {
        "failed": 0,
        "successful": 1,
        "total": 1
    },
    "hits": {
        "hits": [
            {
                "_id": "261f4990-627b-4844-96ed-08b182121c5e",
                "_index": "blog",
                "_score": 1.0,
                "_source": {
                    "body": "You won't believe these ten amazing ways to cook yams...",
                    "id": "261f4990-627b-4844-96ed-08b182121c5e",
                    "title": "Thanksgiving dinner blog",
                    "userId": 10.0
                },
                "_type": "posts"
            }
        ],
        "max_score": 1.0,
        "total": 1
    },
    "timed_out": false,
    "took": 6
}

For the full details on querying, you’ll want to read up on how to query Elasticsearch.