Build an IRC bot in Go with RethinkDB changefeeds

Dan Cannon's GoRethink project is among the most popular and well-maintained third-party client drivers for RethinkDB. Dan recently updated the driver to make it compatible with RethinkDB 1.16, adding support for changefeeds. The language's native concurrency features make it easy to consume changefeeds in a realtime Go application.

To see GoRethink in action, I built a simple IRC bot that monitors a RethinkDB cluster and sends notifications to an IRC channel when issues are detected. I built the bot with Go, using Dan's driver and an IRC client library called GoIRC.

Monitor the issues table

As I described in my last blog post, RethinkDB 1.16 introduced a new set of system tables that you can use to monitor and configure a RethinkDB cluster. You can interact the system tables using ReQL queries, just like you would with any other RethinkDB table.

The current_issues table contains a list of problems that currently affect the operation of the cluster. RethinkDB adds items to this table when servers drop from the cluster or other similar incidents occur. When a user intervenes to resolve an issue, the cluster will remove it from the table.

RethinkDB changefeeds provide a way to subscribe to a stream of realtime database updates. I used the following Go code to attach a changefeed to the current_issues table, watching for new issues that are characterized as critical. When issues are found, it prints them to the terminal:

type Issue struct {
    Description, Type string
}

db, err := r.Connect(r.ConnectOpts{Address: "localhost:28015"})
if err != nil {
    log.Fatal("Database connection failed:", err)
}

issues, _ := r.Db("rethinkdb").Table("current_issues").Filter(
    r.Row.Field("critical").Eq(true)).Changes().Field("new_val").Run(db)

go func() {
    var issue Issue
    for issues.Next(&issue) {
      if issue.Type != "" {
        log.Println(issue.Description)
      }
    }
}()

The ReQL expression uses the filter command to match only the issues in which the critical property carries the value true. The changefeed attached to the query will only emit documents that match the filter condition.

When consuming the output of the changefeed, you can wrap the handler in a goroutine (as demonstrated in the code example above) so that it will operate asynchronously in the background instead of blocking execution. Using goroutines and channels for asynchronous programming can simplify the architecture of your realtime application.

The Go driver can unmarshal JSON data returned by your ReQL queries and map the document properties to struct fields. In the example above, I defined a struct called Issue that has Description and Type fields. When I use the Next method to pull a document from the changefeed and assign it to a variable of type Issue, the fields map to the document properties with the same names. You can also optionally use struct field tags to manually associate fields with specific properties.

Make an IRC bot

The GoIRC library makes it relatively easy to create a simple IRC bot. The following code connects to an IRC server and instructs the bot to join a specific channel:

ircConf := irc.NewConfig("mybot")
ircConf.Server = "localhost:6667" 
bot := irc.Client(ircConf)

bot.HandleFunc("connected", func(conn *irc.Conn, line *irc.Line) {
    log.Println("Connected to IRC server")
    conn.Join("#mychannel")
})

To make the IRC bot push cluster issue notifications into the desired channel, I just had to add a few lines to the changefeed handler in the previous code example:

issues, _ := r.Db("rethinkdb").Table("current_issues").Filter(
    r.Row.Field("critical").Eq(true)).Changes().Field("new_val").Run(db)

go func() {
    var issue Issue
    for issues.Next(&issue) {
        if issue.Type != "" {
            text := strings.Split(issue.Description, "\n")[0]
            message := fmt.Sprintf("(%s) %s ...", issue.Type, text)
            bot.Privmsg("#mychannel", message)
        }
    }
}()

I also wanted to give my bot the ability to handle some basic commands from the user. Specifically, I wanted the program to continue running until a user in the IRC channel tells the bot to quit. I created a handler for the privmsg event and set up a channel to keep the bot running until it receives the command:

quit := make(chan bool, 1)

...

bot.HandleFunc("privmsg", func(conn *irc.Conn, line *irc.Line) {
    log.Println("Received:", line.Nick, line.Text())
    if strings.HasPrefix(line.Text(), config.IRC.Nickname) {
        command := strings.Split(line.Text(), " ")[1]
        switch command {
        case "quit":
            log.Println("Received command to quit")
            quit <- true
        }
        ...
    }
})

...

<- quit

I used a switch statement so that I can easily introduce new commands in the future by adding additional cases that match other strings. For now, I'll keep it simple. The whole bot is implemented in just 80 lines of code, which you can see on GitHub. You can easily adapt this example to make IRC bots that pipe any data you want from your RethinkDB applications into an IRC channel.

Build realtime web apps with Go and RethinkDB

IRC integration is a great exercise, but I also wanted to see what it is like to build realtime web applications with the Go driver. I decided to build a Go version of the simple cluster monitoring application that I demonstrated in my previous blog post.

The new version written in Go is just as succinct as the original Node.js implementation. I used a third-party Socket.io library to broadcast data from a changefeed that monitors RethinkDB's stats table:

server, _ := socketio.NewServer(nil)

conn, _ := r.Connect(r.ConnectOpts{Address: "localhost:28015"})
stats, _ := r.Db("rethinkdb").Table("stats").Filter(
    r.Row.Field("id").AtIndex(0).Eq("cluster")).Changes().Run(conn)

go func() {
    var change r.WriteChanges
    for stats.Next(&change) {
        server.BroadcastTo("monitor", "stats", change.NewValue)
    }
}()

http.Handle("/socket.io/", server)
http.Handle("/", http.FileServer(http.Dir("public")))
log.Fatal(http.ListenAndServe(":8091", nil))

The frontend, as detailed in the previous blog post, receives the data from Socket.io and graphs it in realtime with Fastly's Epoch library. You can see the complete source code of the Go version of the cluster monitoring demo on GitHub.

Concluding thoughts

Go is ostensibly a systems language, but it is fairly conducive to web application development. The Go library ecosystem has much of what you need to build modern web applications, including template processors and URL routing frameworks.

Working with JSON in conventional statically-typed languages is often a painful exercise—but it's not as painful in Go, because you can naturally map complex JSON documents to nested structs. That capability is fairly compelling when working with the output of ReQL queries.

If you'd like to see a more complete example of a realtime web application built with RethinkDB and Go, you can check out Dan's Todo List demo on GitHub.

Want to try it yourself? Install RethinkDB and check out the thirty-second quick start guide.

Resources:

A realtime RethinkDB cluster monitoring app with live graphs

When we announced the RethinkDB 1.16 last week, we showed how you can use changefeeds with the new ReQL-based admin API to monitor the status of a RethinkDB cluster. In this blog post, I'm going to expand on that example and show you how I used the same underlying RethinkDB features to build a realtime cluster monitoring dashboard with live graphs—much like the dashboard built into the RethinkDB web UI.

Query the stats table

RethinkDB 1.16 introduced a new set of system tables that you can use to monitor and configure a RethinkDB cluster. You can interact the system tables using ReQL queries, just like you would with any other RethinkDB table.

The built-in stats table contains statistics about the current activity of the cluster. You can query the stats table to see, for example, the current number of queries performed per second on individual tables and servers or the entire cluster. To get just the cluster-wide statistics, I use the following query:

r.db("rethinkdb").table("stats").filter(r.row("id")(0).eq("cluster"))

To get a live feed of the data, I simply chain the changes command to the end of the query. The changefeed will continuously emit updates with the latest statistical data.

Stream changes to your frontend with Socket.io

I used Node.js and Socket.io to create a web-based dashboard with a live view of the cluster statistics. The frontend is built with Polymer The backend attaches a changefeed to the cluster monitoring query and emits all of the updates through Socket.io:

var express = require("express");
var sockio = require("socket.io");
var r = require("rethinkdb");

var app = express();
app.use(express.static(__dirname + "/public"));

var io = sockio.listen(app.listen(8099), {log: false});
console.log("Server started on port " + 8099);

r.connect({db: "rethinkdb"}).then(function(c) {
  r.table("stats").filter(r.row("id")(0).eq("cluster")).changes().run(c)
    .then(function(cursor) {
      cursor.each(function(err, item) {
        io.sockets.emit("stats", item);
      });
    });
});

The frontend is built with the data binding system from Polymer, an open source Web Components framework. The Socket.io client catches all of the updates from the server and uses data bindings to display the data to the end user:

<template id="cluster" is="auto-binding">
  <ul class="stats">
    <li>Reads/sec: </li>
    <li>Writes/sec: </li>
    <li>Queries/sec: </li>
    <li>Clients: /</li>
  </ul>
</template>

<script>
  var cluster = document.querySelector("#cluster");
  cluster.toFixed = function(value, precision) {
    return Number(value).toFixed(precision);
  };

  var socket = io.connect();
  socket.on("stats", function(data) {
    cluster.stats = data.new_val.query_engine;
  });
</script>

Polymer's data bindings operate on plain JavaScript objects, so all I have to do is take the latest data from Socket.io and assign it to a property on the template. All of the data bindings that access the property will update automatically every time the value changes.

Display realtime data with live graphs

I used an open source library called Epoch to display live graphs of the realtime data. Epoch is built on top of the D3 visualization framework, but it abstracts away a lot of D3's underlying complexity. In situations where you just want a simple realtime chart, Epoch can save you some time. I added the following line of HTML at the location in the page where I want the live graph:

<div class="epoch category40" id="chart" style="width: 600px; height: 200px;"></div>

I also added some JavaScript code to initialize the graph and add data nodes every time Socket.io picks up new stats from the server:

function timestamp() { return (new Date).getTime() / 1000; }

var chart = $("#chart").epoch({
  type: "time.line",
  axes: ["left", "bottom"],
  data: [
    {label: "Writes", values: [{time: timestamp(), y: 0}]},
    {label: "Reads", values: [{time: timestamp(), y: 0}]}
  ]
});

var socket = io.connect();
socket.on("stats", function(data) {
  cluster.stats = data.new_val.query_engine;
  chart.push([
    { time: timestamp(), y: cluster.stats.written_docs_per_sec },
    { time: timestamp(), y: cluster.stats.read_docs_per_sec}
  ]);
});

Each point in the graph has a Y axis value and a UNIX timestamp that correlates with its position on the X axis. The Epoch library and D3 take care of everything else, including interpolating the points, setting up the axis ticks, and animating the graph as time passes.

The graph has two lines, so that it can simultaneously display the volume of read and write operations. When initially configuring Epoch, you can tell it to have multiple data lines by setting up the data property as an array with multiple objects. When I use the push method to add a new point, I pass in an array with one object for each line.

Display server status information

In addition to a live graph of cluster statistics, I also want the dashboard to show the status of every server in the cluster. RethinkDB has a system table called server_status that you can query to obtain that information. I attached a changefeed to a simple query on the server_status table in order to get a live stream of changes:

r.connect({db: "rethinkdb"}).then(function(c) {
  r.table("server_status").changes().run(c)
    .then(function(cursor) {
      cursor.each(function(err, item) {
        io.sockets.emit("servers", item);
      });
    });
});

I also need to propagate the initial state of the table to the frontend whenever a user loads the page. I wired up a callback that triggers every time the application receives a new connection from a Socket.io client. In the callback, the application connects to the database, fetches the current contents of the server_status table, and transmits the data to the user:

io.sockets.on("connection", function(socket) {
  var conn;
  r.connect({db: "rethinkdb"}).then(function(c) {
    conn = c;
    return r.table("server_status").run(conn);
  })
  .then(function(cursor) { return cursor.toArray(); })
  .then(function(result) {
    socket.emit("servers", result);
  })
  .error(function(err) { console.log("Failure:", err); })
  .finally(function() {
    if (conn)
      conn.close();
  });
});

On the frontend, I can use that initial data to populate the table and then modify it as needed when further updates are available:

socket.on("servers", function(data) {
  if (data.length)
    return cluster.servers = data;

  if (!data.old_val)
    return cluster.servers.push(data.new_val);

  for (var s in cluster.servers)
    if (cluster.servers[s].id == data.old_val.id)
      cluster.servers[s] = data.new_val;
});

In this particular situation, updating the changing values is a little tricky. In most cases, an update will represent a change in status for an existing server that is already in the list. When that occurs, I have to iterate through the list, find the record with the matching id, and replace it with the new data.

I also get notifications, however, when the user adds or removes a server from the cluster. If the record's old_val property is empty, I can assume that it's a new server, which means that all I have to do is append it to the list. I could also similarly remove records when I get an update with an empty new_val property, but I chose not to bother because it's easier to have the template conditionally hide empty records.

I added the following markup to my HTML template to display the list of servers. Notice the use of conditional expressions to hide empty records and highlight disconnected servers:

<table>
  <template repeat="">
    <tr style=""
        class="">
      <td></td>
      <td></td>
      <td></td>
      <td></td>
    </tr>
  </template>
</table>

As you can see, it doesn't take much code to stream realtime RethinkDB table updates to a frontend web client. It's worth noting that the techniques used in this article are broadly applicable to data stored in RethinkDB, not just the special system tables. If you want to add live graphs and realtime streaming to your RethinkDB application, it's quite easy to add changefeeds to your queries and use Socket.io (or an equivalent library) to get the data to your frontend.

Want to try it yourself? Install RethinkDB and check out the thirty-second quick start guide.

Resources:

RethinkDB 1.16: cluster management API, realtime push

Today, we're happy to announce RethinkDB 1.16 (). Download it now!

The 1.16 release is a precursor to the upcoming 2.0 release, and is the biggest RethinkDB release to date with over 300 enhancements. This release includes two exciting new features: a comprehensive API for large cluster management, and realtime push functionality that dramatically simplifies the development of realtime web apps.

The cluster management API builds upon the sharding and replication functionality in previous versions of RethinkDB, and adds complete control and visibility into the operational details. It includes:

  • A reconfigure command to manipulate shards and replicas
  • A rebalance command to balance data across shards on demand
  • A writable table_config system table that gives precise control of sharding and replication configuration
  • A table_status system table that gives detailed visibility into the state of every table in the cluster
  • A stats system table that gives access to comprehensive statistics
  • A writable jobs system table that gives control of the background jobs running in the cluster
  • cluster_config, current_issues, db_config, logs, server_config, and server_status system tables for additional control and visibility

The realtime push functionality is the start of an exciting new database access model -- instead of polling the database for changes, the developer can tell RethinkDB to continuously push updated query results to applications in realtime. We dramatically expanded the changes command to support the following queries:

  • r.table(TABLE).get(ID).changes()
  • r.table(TABLE).between(LEFT_ID, RIGHT_ID).changes()
  • r.table(TABLE).filter(CONDITION).changes()
  • r.table(TABLE).map(TRANSFORMATION).changes()
  • r.table(TABLE).orderBy(CONDITION).limit(NUMBER).changes()
  • r.table(TABLE).min(INDEX).changes()
  • r.table(TABLE).max(INDEX).changes()

If you're upgrading from previous versions, you may need to recreate your indexes.

Note: In RethinkDB 1.16 the rethinkdb admin command has been removed and replaced with the new ReQL management API.

Programmatic cluster management

In previous versions of RethinkDB, some cluster management operations were available through the web interface and others were accessible through a specialized command line tool. RethinkDB 1.16 unifies all of the cluster management capabilities supported by the database and exposes them via a simple ReQL API.

We worked with users running large RethinkDB deployments to design the new cluster management API, and settled on three major design goals:

  • All cluster management and monitoring functionality should be accessible programmatically
  • Performing common operations should be simple and intuitive
  • Detailed control and visibility should be available, and should be as simple as possible

As of this release, you can now perform cluster management tasks with ReQL queries in a REPL or with scripts written in any programming language that has a RethinkDB driver.

Sharding and replication

ReQL's createTable command now accepts two new optional arguments: shards and replicas. If you specify the sharding and replication factor, the database will automatically partition and distribute the table. You can modify the settings later by calling the reconfigure command on the table object. You can also optionally use tagging to explicitly control how many replicas are assigned to individual servers:

r.table('users').reconfigure(
    shards=2,
    replicas={'us_west':3, 'us_east':2},
    primary_replica_tag='us_east'
).run(conn)

New sharding web interface

The web UI has been completely rebuilt to take advantage of the new ReQL clustering API. We've also updated the sharding and replication web interface to give administrators more visibility and control:

As you change the number of shards and replicas, RethinkDB presents a visual diff of the current and proposed cluster configurations. Administrators can see exactly where the data will go before approving the proposed plan.

Precise control

RethinkDB 1.16 introduces a number of system tables that expose database settings and the internal state of the cluster. You can query and interact with system tables using conventional ReQL commands, just like you would with any other RethinkDB table.

To exercise granular control over sharding and replication, you can use the new table_config table. Each document in table_config represents a different table in your database cluster, and includes details on sharding and replication settings. A table_config document typically looks like this:

{
  id: "31c92680-f70c-4a4b-a49e-b238eb12c023",
  name: "tablename",
  db: "test",
  primary_key: "id",
  shards: [
    {primary_replica: "a", "replicas": ["a", "b"]},
    {primary_replica: "b", "replicas": ["a", "b"]}
    ],
  write_acks: "majority",
  durability: "hard"
}

When you modify those properties using the update command, the cluster will apply the new settings. You can also use this approach to tweak some advanced table settings for behaviors like durability.

The high-level reconfigure command is a porcelain command on top of the table_config system table. When you call reconfigure, the command compiles high-level settings like the number of shards and replicas into a concrete configuration document, and updates the appropriate document in the table_config system table. You can also call reconfigure with a dry_run optional argument to see the proposed configuration before applying it.

Using the high-level configuration commands and the finer-grained control offered by table_config, you can create elaborate scripts that automate much of your cluster configuration in a testable, repeatable way.

Monitoring

Alongside the configuration table, RethinkDB 1.16 also introduces several new read-only tables that you can query to get detailed information about the status of the cluster:

  • The table_status table contains information about table availability. You can see if the table is ready for reads and writes and you can see the status of all of the table's shards.
  • The server_status table shows the status and availability of individual servers within your RethinkDB cluster. Each document in the table represents a single RethinkDB server instance. It shows network configuration details, the process PID, and other administrative information.
  • The stats table exposes detailed statistics that reflect the current state of servers, tables, and your cluster. You can see queries, reads, and writes per second, the number of active client connections, and other relevant statistics.

Job control

Another much-anticipated feature in RethinkDB 1.16 is support for managing long-running operations. The new jobs table shows all of the background tasks and queries in progress on your cluster. A typical document in the jobs table might look like this:

{
  "duration_sec": 0.00759,
  "id": ["query", "3f6d08ae-d643-44b3-b643-e2812bfbbf93"],
  "info": {"client_address":"::1", "client_port":56751},
  "servers": ["batman_4rl"],
  "type":"query"
}

If you want to terminate a query, simply delete the corresponding row from the table:

r.db('rethinkdb').table('jobs').get(["query", "3f6d08ae-d643-44b3-b643-e2812bfbbf93"]).delete()

Realtime push

Instead of polling the database for changes, you can now tell RethinkDB to continuously push updated query results to applications in realtime. This is the start of an exciting new database access model that should make building modern, realtime apps dramatically easier.

For example, suppose you're building a realtime leaderboard for a game. You can get started with the database by using a familiar request-response query paradigm:

r.table('gameplays').orderBy(r.desc('score')).limit(5).run(conn)

As of RethinkDB 1.16, you can also ask the database to push changes to your app every time a gameplay that modifies the leaderboard is recorded in the database:

r.table('gameplays').orderBy(r.desc('score')).limit(5).changes().run(conn)

The first result of the query is just the top five gameplays. However, when the developer tacks on the changes command, RethinkDB will keep the cursor open, and push updates onto the cursor any time a relevant change occurs in the database. The expanded changes command works on a wide variety of queries:

  • r.table(TABLE).get(ID).changes()
  • r.table(TABLE).between(LEFT_ID, RIGHT_ID).changes()
  • r.table(TABLE).filter(CONDITION).changes()
  • r.table(TABLE).map(TRANSFORMATION).changes()
  • r.table(TABLE).orderBy(CONDITION).limit(NUMBER).changes()
  • r.table(TABLE).min(INDEX).changes()
  • r.table(TABLE).max(INDEX).changes()

It also includes bells and whistles like latency awareness, that make building realtime apps much more convenient. For example, if the query results change too quickly and you don't want to update the DOM more frequently than every fifty milliseconds, you can tell changes to squash updates on a fifty millisecond window, and the database will take care of aggregating diffs and removing duplicates:

r.table('projects').get(PROJECT_ID).changes(squash=0.05).run(conn)

To learn more about how the changes command can make building realtime apps dramatically easier, read our post "Advancing the realtime web".

Realtime monitoring

The new features in the 1.16 release can be used together in composable ways. For example, you can attach changefeeds to queries performed on the system monitoring tables in order to get live updates about the state of the cluster.

For example, if you want to create an animated line graph of operation statistics for all tables in your production database, you could set up a feed on the internal statistics table to monitor the RethinkDB cluster itself:

r.db('rethinkdb').table('stats').filter({ 'db': 'prod' }).changes()

You can set up changefeeds on other system tables like jobs, logs, and server_status to get a realtime stream of updates on the state of the cluster.

More improvements

There are many other exciting improvements in this release:

  • A new range command that generates a range of numbers
  • A new wait command that lets you wait for a table to become ready
  • A new toJsonString command that converts a datum to a JSON string
  • The map command is now variadic for mapping over multiple sequences in parallel
  • The min and max commands now accept an index for more efficient evaluation
  • rethinkdb export now exports secondary index information and rethinkdb import re-creates exported indexes
  • kqueue is now used instead of poll for dramatically better performance on OS X

For a full list of over 300 improvements, see the changelog.

Next steps

See the full list of enhancements, and take the new release for a spin!

The team is already hard at work on the upcoming 2.0 release. The 2.0 release will focus on operational and API stability, and will be the first production-ready release of RethinkDB. As always, if there is something you'd like us to prioritize or if you have any feedback on the release, please let us know.

Help work on the 2.0 release: RethinkDB is hiring.

RethinkDB 1.16 webcast: learn about upcoming features

Join us tomorrow (Friday January 30th) for a live webcast at 1:30PM PT. Daniel Mewes, RethinkDB's director of engineering, and systems engineer Tim Maxwell will introduce some of the new features included in our upcoming 1.16 release.

The webcast will offer a first look at RethinkDB's new comprehensive cluster management API, which makes it easy to configure sharding and replication with ReQL expressions. You'll also get a hands-on introduction to RethinkDB's realtime push functionality, demonstrating how RethinkDB can dramatically simplify the development of realtime web apps. The webcast will conclude with a live Q&A segment, where you'll have a chance to query the RethinkDB team in realtime.

Visit the RethinkDB Bay Area Meetup Group page to RSVP for the webcast.

Advancing the realtime web

Over the past few months the team at RethinkDB has been working on a project to make building modern, realtime apps dramatically easier. The upcoming features are the start of an exciting new database access model -- instead of polling the database for changes, the developer can tell RethinkDB to continuously push updated query results to applications in realtime.

This work started as an innocuous feature to help developers integrate RethinkDB with other realtime systems. A few releases ago we shipped changefeeds -- a way to subscribe to change notifications in the database. Whenever a document changes in a table, the server pushes a notification describing the change to subscribed clients. You can subscribe to changes on a table like this:

r.table('accounts').changes().run(conn)

Originally we intended this feature to help developers push data from RethinkDB to specialized data stores like ElasticSearch and message systems like RabbitMQ, but the release generated enormous excitement we didn't expect. Digging deeper, we saw that many web developers used changefeeds as a solution to a much broader problem -- how do you adapt the database to push realtime data to applications?

This turned out to be an important problem for so many developers that we expanded RethinkDB's architecture to explicitly support realtime apps. The first batch of the new features will ship in a few days in the upcoming 1.16 release of RethinkDB, and I'm very excited to share what we've been working on in this post.

Why is building realtime apps so hard?

The query-response database access model works well on the web because it maps directly to HTTP's request-response. However, modern marketplaces, streaming analytics apps, multiplayer games, and collaborative web and mobile apps require sending data directly to the client in realtime. For example, when a user changes the position of a button in a collaborative design app, the server has to notify other users that are simultaneously working on the same project. Web browsers support these use cases via WebSockets and long-lived HTTP connections, but adapting database systems to realtime needs still presents a huge engineering challenge.

A naive way to support live updates is to periodically poll the database for changes, but this solution is unworkable because it entails a tradeoff between the number of concurrent users and the polling interval. Even a small number of users polling the database will place a tremendous load on the database servers, requiring the administrator to increase the polling interval. In turn, high polling intervals very quickly result in an untenable user experience.

A scalable solution to this problem involves many cumbersome steps:

  • Hooking into replication logs of the database servers, or writing custom data invalidating logic for realtime UI components.
  • Adding messaging infrastructure (e.g. RabbitMQ) to your project.
  • Writing sophisticated routing logic to avoid broadcasting every message to every web server.
  • Reimplementing database functionality in the backend if your app requires realtime computation (e.g. realtime leaderboards).

All this requires enormous commitment of time and engineering resources. This tech presentation from Quora gives a good overview of how challenging it can be. The upcoming 1.16 release of RethinkDB is our take on helping developers build realtime apps with minimal effort, and includes the first batch of realtime push features to tackle this problem.

The database for the realtime web

A major design goal was to make the implementation non-invasive and simple to use. RethinkDB users can get started with the database by using a familiar request-response query paradigm. For example, if you're generating a web page for a visual web design app, you can load the UI elements of a particular project like this:

> r.table('ui_elements').get_all(PROJECT_ID, index='projects').run(conn)
{ 'id': UI_ELEMENT_ID,
  'project_id': PROJECT_ID
  'type': 'button',
  'position': [100, 100],
  'size': [200, 100] }

But what if your design app is collaborative, and you want to show updates to all designers of a project in realtime? The 1.16 release of RethinkDB significantly expands the changes command to work on a much larger set of queries. The changes command lets you get the result of the query, but also asks the database to continue pushing updates to the web server as they happen in realtime, without the developer doing any additional work:

> r.table('ui_elements').get_all(PROJECT_ID, index='projects').changes().run(conn)
{ 'new_val':
  { 'id': UI_ELEMENT_ID,
    'project_id': PROJECT_ID
    'type': 'button',
    'position': [100, 100],
    'size': [200, 100] }
}

The first result of the query is just the value of the document. However, when the developer tacks on the changes command, RethinkDB will keep the cursor open, and push updates onto the cursor any time a relevant change occurs in the database. For example, if a different user moves the button in a project, the database will push a diff to every connected web server interested in the particular project, informing them of the change:

{ 'old_val':
  { 'id': UI_ELEMENT_ID,
    'project_id': PROJECT_ID
    'type': 'button',
    'position': [100, 100],
    'size': [200, 100] },
  'new_val':
  { 'id': UI_ELEMENT_ID,
    'project_id': PROJECT_ID
    'type': 'button',
    'position': [200, 200],  # the position has changed
    'size': [200, 100] }
}

Any time a web or a mobile client connects to your Python, Ruby, or Node.js application, you can create a realtime feed using the official RethinkDB drivers. The database will continuously push query result updates to your web server, which can forward the changes back to the client in realtime using WebSockets or one of the many wrapper libraries like SockJS, socket.io, or SignalR. Additionally, you'll be able to access the functionality from most languages using one of the many community supported drivers.

The push access model eliminates the need for invalidation logic in the UI components, additional messaging infrastructure, complex routing logic on your servers, and custom code to reimplement aggregation and sorting in the application. The changes command works on a large subset of queries and is tightly integrated into RethinkDB's architecture. For example, if you wanted to create an animated line graph of operation statistics for all tables in your production database, you could set up a feed on the internal statistics table to monitor the RethinkDB cluster itself:

> r.db('rethinkdb').table('stats').filter({ 'db': 'prod' }).changes().run(conn)

The architecture is designed to be scalable. We're still running benchmarks, but you should be able to create thousands of concurrent changefeeds to scale your realtime apps, and the results will be pushed within milliseconds.

We've also built in many bells and whistles like latency awareness, that make building realtime apps much more convenient. For example, if the query results change too quickly and you don't want to update the DOM more frequently than fifty milliseconds, you can tell changes to squash updates on a fifty millisecond window, and the database will take care of aggregating diffs and removing duplicates:

> r.table('ui_elements').get_all(PROJECT_ID, index='projects').changes(squash=0.05).run(conn)

Comparison with realtime sync services

There are many existing realtime sync services that significantly ease the pain of building realtime applications. Firebase, PubNub, and Pusher are notable examples, and there are many others. These services are excellent for getting up and running quickly. They let you sync documents across multiple browsers, offer sophisticated security models, and integrate with many existing web frameworks.

The upcoming features in RethinkDB are fundamentally different from realtime sync services in four critical ways.

Firstly, most existing realtime sync services offer very limited querying capabilities. You can query for a specific document and perhaps a range of documents, but you can't express even simple queries that involve any computation. For example, sorting, advanced filtering, aggregation, joins, or subqueries are either limited or not available at all. This limitation turns out to be critical for real world applications, so most users end up using realtime sync services side by side with traditional database systems, and build up complex code to duplicate data between the two.

In contrast, RethinkDB is a general purpose database that allows you to easily express queries of arbitrary complexity. This eliminates the need for multiple pieces of infrastructure and additional code to duplicate data and keep it in sync across multiple services.

Secondly, the push functionality of realtime sync services is limited to single documents. You can sync documents across clients, but you can't get a realtime incremental feed for more complex operations. In contrast, RethinkDB allows you to get a feed on queries, not just documents. For example, suppose you wanted to build a realtime leaderboard of top five gameplays in your game world. This requires sorting the gameplays by score in descending order, limiting the resultset to five top gameplays, and getting a continuous incremental feed that pushes updates to your clients any time the resultset changes. This functionality isn't available in realtime sync services, but is trivial in RethinkDB:

r.table('gameplays').order_by(index=r.desc('score')).limit(5).changes().run(conn)

Any time the database gets updated with a new gameplay, this query will inform the developer which items dropped off the leaderboard, and which new gameplays should be included. Internally, the database doesn't merely rerun the query any time there is a change to the gameplays table -- the changefeeds are recomputed incrementally and efficiently.

Thirdly, realtime sync services are closed ecosystems that run in the cloud. While a hosted version of RethinkDB is available through our partners at Compose.io, both the protocol and the implementation are, and always will be, open-source.

Finally, most existing realtime sync services are built to allow access to their API directly from the web browser. This eliminates the need for building a backend in simple applications, and lets new users quickly deploy their apps with less hassle. As a general purpose database RethinkDB expects to be accessed from a backend server, and does not yet provide a sufficiently robust security model to be accessed directly from the web browser. We're playing with the idea of building a secure proxy server to let web clients access RethinkDB directly from the browser, so eventually you might not need to write backend code if your application is simple enough. However, unlike realtime sync services, for now you have to access RethinkDB feeds through the backend code running in your web server.

Comparison with hooking into the replication log

Most traditional database systems offer access to their replication log, which allows clients to learn about the updates happening in the database in realtime. Many infrastructures for realtime apps are built on top of this functionality. There are three fundamental differences between RethinkDB's changefeeds and hooking into the replication log of a database.

Firstly, like with realtime sync, hooking into the replication log gives you access to updates on individual documents. In contrast, RethinkDB's changefeeds allow you to get feeds on query resultsets. Consider the example above, where we're building a leaderboard of top five gameplays in a game world:

r.table('gameplays').order_by(index=r.desc('score')).limit(5).changes().run(conn)

To rebuild this functionality on top of a replication log your application would need to keep track of top five gameplays, and you'd have to write custom code to compare each new record in the gameplays table to decide if it replaces any of the gameplays in the leaderboard. More importantly, consider what happens if the game admin decides the player cheated and their gameplay score has to be reduced. Your code would have to go back to the database and recompute the query from scratch, because it has no information about which gameplay has the new record that should be on the leaderboard.

Writing this code is doable, but is fairly complex and error-prone. In a large application, the complexity can add up quickly if you have many realtime elements. In contrast, RethinkDB's query engine eliminates this complexity by automatically taking care of the computation and sending you the correct updates as the resultset changes in realtime.

Secondly, as you move to sharded environments, working with a replication log presents additional complexity as there isn't a single replication log to deal with. Your application would need to subscribe to multiple replication logs, and manually aggregate the events from replication logs for each shard. In contrast, RethinkDB automatically takes care of handling shards in the cluster, and changefeeds present unified views to your application.

Finally, most database systems don't offer granular filtering functionality for replication logs, so your clients can't get only the parts of the log they're interested in. This presents non-trivial scalability challenges because your infrastructure has to deal with the firehose of all database events, and you need to write custom code to route only the relevant events to appropriate web servers. In contrast, RethinkDB handles scalability issues in the cluster, and each feed gives you exactly the information you need for a particular client.

RethinkDB's changefeeds operate on a higher level of abstraction than traditional replication logs, which significantly reduces the amount of custom code and operational challenges the application developer has to consider.

Integrating with realtime web frameworks

One of the more notable projects that helps developers build realtime apps is Meteor. Meteor is an open-source platform for building realtime apps in JavaScript that promises a significantly improved developer experience. It handles a lot of the boilerplate necessary to build responsive interfaces with live updates, provides a complete platform with client-side and server-side components, and offers many advanced features like latency compensation and security out of the box. The team is making great strides in scalability and maturity of the platform, and many companies are starting to use Meteor to build the next generation of web applications.

Meteor is part of the Node.js ecosystem, and multiple other projects have popped up to bring its functionality to other languages. Volt is a framework that implements similar functionality in Ruby, and webalchemy is an alternative platform for Python. These projects are less mature, but have picked up a lot of interest in their respective ecosystems, and are likely to gain a lot of momentum once they accumulate enough functionality to let developers build high quality, scalable apps.

Meteor, Volt, and webalchemy frameworks run on top of databases, so they're ultimately constrained by the realtime functionality and scalability of existing database systems. We've been collaborating with the Meteor team to ensure our design will work well with these and other similar projects. A few community members have been working on a RethinkDB integration with Meteor and Volt, and we expect robust integrations to become available in the coming months.

More work ahead

The upcoming 1.16 release contains only a subset of the functionality we'd like to include. In the next few releases we plan to expand realtime push even further:

  • We're discussing the implementation for restartable feeds here and here. Feedback welcome!
  • We'd like to make more complex queries available via realtime push. In particular, efficient realtime push implementations for the eq_join command and map/reduce are fairly complex, and aren't making it into 1.16.
  • Exposing the database to the internet entails serious security concerns, so we're kicking around ideas for a secure proxy to enable direct browser access of realtime feeds.

This work is guided by three high level design principles:

  • We believe it's important for realtime database infrastructure to be open. Both the protocol and the implementation are, and always will be, open-source.
  • The implementation should be non-invasive and very simple to use. Developers shouldn't have to care about realtime features until they're ready to add the functionality to their apps.
  • Realtime functionality should be efficient, scalable, and tightly integrated with the rest of the database. It shouldn't feel like an afterthought.

Advancing the realtime web

The new functionality is a start of an exciting new database access model that eliminates many complex steps necessary for building realtime apps today. There is no need to poll the database for changes or introduce additional infrastructure like RabbitMQ. RethinkDB pushes relevant changes to the web server the instant they occur. The amount of additional code the developer has to write to implement realtime functionality in their apps is minimal, and all scalability issues are handled by the RethinkDB cluster.

We'll be releasing the realtime extensions to RethinkDB in the next few days along with tutorials and documentation. In the meantime, you can watch the video with a live demo of the features:

We're hoping RethinkDB 1.16 will make building realtime apps dramatically simpler and more accessible. Stay tuned for more updates, and please share your feedback with the RethinkDB team!