Query RethinkDB tables from PostgreSQL with foreign data wrappers

Rick Otten (@rotten) recently released a foreign data wrapper for PostgreSQL that provides a bridge to RethinkDB. The wrapper makes it possible to expose individual tables from a RethinkDB database in PostgreSQL, enabling users to access their RethinkDB data with with SQL queries.

The wrapper could prove especially useful in cases where a developer wants to incorporate RethinkDB into an existing application built on PostgreSQL, taking advantage of RethinkDB features like changefeeds to easily add realtime updates. You could, for example, use RethinkDB to store and propagate realtime events while continuing to use PostgreSQL for things like account management and other data persistence.

To try the foreign data wrapper myself, I used it to access cat pictures from my CatThink demo in PostgreSQL. I built CatThink last year to illustrate how RethinkDB changefeeds can simplify the architecture of realtime applications. CatThink, which is built with Node.js and Socket.io, uses Instagram's realtime APIs and a RethinkDB changefeed to display a stream of the latest cat pictures posted to the popular photo sharing service.

As I will show you in this article, I used the foreign data wrapper to connect a PostgreSQL instance to a RethinkDB database so that I could retrieve cat picture URLs with simple SQL queries.

Configure the foreign data wrapper

Rick's RethinkDB wrapper is built with Multicorn, a PostgreSQL extension that lets developers implement foreign data wrappers in Python. Using Multicorn made it possible to build the wrapper around the official RethinkDB Python driver. The wrapper is currently a read-only implementation—you can perform queries that retrieve data from the RethinkDB tables, but you can't manipulate the wrapped tables with operations like INSERT or UPDATE.

I performed my experiment on a Linux system running Ubuntu 14.10, RethinkDB 1.15, and PostgreSQL 9.4. I installed the following packages with APT:

apt-get install python-setuptools python-dev postgresql-server-dev-9.4 pgxnclient postgresql rethinkdb git python-pip

To install Multicorn and the RethinkDB foreign data wrapper, I followed the instructions from the project's documentation.

Retrieve RethinkDB data with SQL queries

I used the following command to initialize the foreign data wrapper, specifying the name of the desired database and the host and port of the RethinkDB server:

CREATE SERVER rethink FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'rethinkdb_fdw.rethinkdb_fdw.RethinkDBFDW', host 'localhost', port '28015', database 'cats');

I used the following SQL expression to expose the instacat table, which contains image posting data from Instagram:

CREATE FOREIGN TABLE instacat (id varchar, "user" json, caption json, images json, time timestamp) server rethink options (table_name 'instacat');

In the command, I defined columns that correspond with top-level properties from the documents in the instacat RethinkDB table. I can use those columns when performing a query against the foreign table. Each column in the table is defined with an associated type. I use the json type for properties that contain objects with other nested values. Note that I string-escaped the user column so that it won't be mistaken for a keyword. I only created columns for a subset of the properties available in each record. You can create columns for as many records as you want.

To see the foreign table in action, I performed a simple select query in the SQL console:

SELECT * FROM instacat;

The operation worked as expected, displaying the values from the RethinkDB table. It's also possible to extract individual sub-properties from the JSON objects. PostgreSQL 9.3 introduced a number of specialized SQL operators for working with JSON data. The following query shows how to extract a few individual values out of nested JSON structures in each record from the table:

select "user"->'full_name', caption->'text', images#>'{low_resolution,url}' from instacat;

The -> operator allows you to extract the value of a given field as text. The #> operator lets you specify multiple keys so that you can retrieve a value from an arbitrary depth within a nested JSON structure. The expression images#>'{low_resolution, url}' in PostgreSQL is equivalent to something like r.row("images")("low_resolution")("url") in ReQL. Thanks to the magic of foreign data wrappers, I can now access kitties in my PostgreSQL applications.

Although you can't modify the data, many SQL operations will work as expected. You can even use joins, performing queries that operate across foreign tables and conventional PostgreSQL tables.

Final notes

Given that every query against a foreign table entails a query against the RethinkDB instance through the Python driver, there's a fair amount of overhead involved. The wrapper's documentation recommends using a materialized view in performance-sensitive usage scenarios.

The documentation also suggests setting log_min_messages to debug1 in your postgresql.conf file (/etc/postgresql/9.4/main/postgresql.conf on Ubuntu) during troubleshooting. That will expose errors from the foreign data wrapper in your logs, which make it a bit easier to see what's going on.

Rick's foreign data wrapper makes it easy to incorporate RethinkDB into existing applications built on PostgreSQL. It's also a pretty compelling example of how Multicorn simplifies interoperability between PostgreSQL and external data sources.

Want to try it yourself? Install RethinkDB and check out the thirty-second quick start guide.

Resources:

Using RethinkDB with io.js: exploring ES6 generators and the future of JavaScript

A group of prominent Node.js contributors recently launched a community-driven fork called io.js. One of the most promising advantages of the new fork is that it incorporates a much more recent version of the V8 JavaScript runtime. It happens to support a range of useful ECMAScript 6 (ES6) features right out of the box.

Although io.js is still too new for production deployment, I couldn't resist taking it for a test drive. I used io.js and the experimental rethinkdbdash driver to get an early glimpse at the future of ES6-enabled RethinkDB application development.

ES6 in Node.js and io.js

ES6, codenamed Harmony, is a new version of the specification on which the JavaScript language is based. It defines new syntax and other improvements that greatly modernize the language. The infusion of new hotness makes the development experience a lot more pleasant.

Node.js has a special command-line flag that allows users to enable experimental support for ES6 features, but the latest stable version of Node doesn't give you very much due to its reliance on a highly outdated version of V8. The unstable Node 0.11.x pre-release builds provide more and better ES6 support, but still hidden behind the command-line flag.

In io.js, the ES6 features that are stable and maturely-implemented in V8 are flipped on by default. Additional ES6 features that are the subject of ongoing development are still available through command-line flags. The io.js approach is relatively granular, but strongly encourages adoption of features that are considered safe to use.

Among the most exciting ES6 features available in both Node 0.11.x and io.js is support for generators. A generator function, which is signified by putting an asterisk in front of the name, outputs an iterator instead of a conventional return value. Inside of a generator function, the developer uses the yield keyword to express the values that the iterator emits.

It's a relatively straightforward feature, but some novel uses open up a few very interesting doors. Most notably, developers can use generators to simplify asynchronous programming. When an asynchronous task is expressed with a generator, you can make it so that the yield keyword will suspend execution of the current method and resume when the desired operation is complete. Much like the C# programming language's await keyword, it flattens out asynchronous code and allows it to be written in a more conventional, synchronous style.

Introducing rethinkdbdash

Developed by RethinkDB contributor Michel Tu, rethinkdbdash is an experimental RethinkDB driver for Node.js that provides a connection pool and several other advanced features. When used in an environment that supports generators, rethinkdbdash optionally lets you handle asynchronous query responses with the yield keyword as an alternative to callbacks or promises.

The following example uses rethinkdbdash with generators to perform a sequence of asynchronous operations. It will create a database, table, and index, which it will then populate with remote data:

var bluebird = require("bluebird");
var r = require("rethinkdbdash")();

var feedUrl = "earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_month.geojson";

bluebird.coroutine(function *() {
  try {
    yield r.dbCreate("quake").run();
    yield r.db("quake").tableCreate("quakes").run();
    yield r.db("quake").table("quakes")
                       .indexCreate("geometry", {geo: true}).run();

    yield r.db("quake").table("quakes")
                       .insert(r.http(feedUrl)("features")).run();
  }
  catch (err) {
    if (err.message.indexOf("already exists") == -1)
      console.log(err.message);
  }
})();

Each time that the path of execution hits the yield keyword, it jumps out and waits for the operation to finish before continuing. The behavior is similar to what you would get if you used a promise chain, separating each operation into a then method call. The following is the equivalent code, as you would write it today using promises and the official RethinkDB JavaScript driver:

var conn;
r.connect().then(function(c) {
  conn = c;
  return r.dbCreate("quake").run(conn);
})
.then(function() {
  return r.db("quake").tableCreate("quakes").run(conn);
})
.then(function() {
  return r.db("quake").table("quakes").indexCreate(
    "geometry", {geo: true}).run(conn);
})
.then(function() { 
  return r.db("quake").table("quakes")
                      .insert(r.http(feedUrl)("features")).run(conn); 
})
.error(function(err) {
  if (err.msg.indexOf("already exists") == -1)
    console.log(err);
})
.finally(function() {
  if (conn)
    conn.close();
});

The built-in connection pooling in rethinkdbdash carves off a few lines by itself, but even after you factor that in, the version that uses the yield keyword is obviously a lot more intuitive and concise. The generator approach also happens to be a lot more conducive to using traditional exception-based error handling.

Use rethinkdbdash in a web application

To build a working application, I used rethinkdbdash with Koa, a next-generation Node.js web framework designed by the team behind Express. Koa is very similar to Express, but it makes extensive use of generators to provide a cleaner way of integrating middleware components.

The following example defines a route served by the application that returns the value of a simple RethinkDB query. It fetches and orders records, emitting the raw JSON value:

var app = require("koa")();
var route = require("koa-route");
var r = require("rethinkdbdash")();

app.use(route.get("/quakes", function *() {
  try {
    this.body = yield r.db("quake").table("quakes").orderBy(
      r.desc(r.row("properties")("mag"))).run();
  }
  catch (err) {
    this.status = 500;
    this.body = {success: false, err: err};
  }
}));

When the asynchronous RethinkDB query operation completes, its output becomes the return value of the yield expression. The route handler function takes the returned JSON value and assigns it to a property that represents the response body for the HTTP GET request. It doesn't get much more elegant than that.

A promising future awaits

Generators offer a compelling approach to asynchronous programming. In order to make the pattern easier to express and use with promises, developers have already proposed adding an official await keyword to future versions of the language.

By bringing the latest stable V8 feature to the masses, the io.js project holds the potential to bring us the future just a little bit faster. Although the developers characterize it as "beta" quality in its current form, it's worth checking out today if you want to get a tantalizing glimpse of what's coming .next().

Give RethinkDB a try with io.js or Node. You can follow our thirty-second RethinkDB quickstart guide.

Resources

Hands-on with Remodel: a new Python ODM for RethinkDB

This week, Andrei Horak released Remodel, a new Python-based object document mapping (ODM) library for RethinkDB. Remodel simplifies RethinkDB application development by automating much of the underlying logic that comes into play when working with relations.

Remodel users create high-level model objects and rely on a set of simple class attributes to define relationships. The framework then uses the model objects to generate tables and indexes. It abstracts away the need to do manual work like performing join queries or populating relation attributes when inserting new items. Remodel also has built-in support for connection pooling, which obviates the need to create and manage connections. In this brief tutorial, I'll give you a hands-on look at Remodel and show you how to use it in a web application.

Define your models

To start using Remodel, first install the library. You can use the setup.py included in the source code or you can install it from pip by typing pip install remodel at the command line.

For the purposes of this tutorial, let's assume that we want to build a Starfleet crew roster that correlates crew members with their starships. The first step is to define the models and create the tables:

import remodel.utils
import remodel.connection
from remodel.models import Model

remodel.connection.pool.configure(db="fleet")

class Starship(Model):
    has_many = ("Crewmember",)

class Crewmember(Model):
    belongs_to = ("Starship",)

remodel.utils.create_tables()
remodel.utils.create_indexes()

In an application built with Remodel, all of the model classes must inherit remodel.models.Model. In this application, there are two models: Starship and Crewmember. The has_many and belongs_to class attributes are used to define the relationships between objects. In this case, each Starship can have many Crewmember instances and each Crewmember instance belongs to only one Starship.

The create_tables and create_indexes methods will, as the names suggest, automatically generate tables and indexes based on your defined models. Remodel pluralizes your table names, which means that the Starship model will get a starships table.

The framework instantiates a connection pool, accessible at remodel.connection.pool. You can use the pool's configure method to adjust its behavior and specify connection options, such as the desired database name, host, and port.

Populate the database

Now that the models are defined, you can populate the database with content. To create a new database record, call the create method on one of the model classes:

voyager = Starship.create(name="Voyager", category="Intrepid", registry="NCC-74656")

Remodel doesn't enforce any schemas, so you can use whatever properties you want when you create a record. The create method used above will automatically add the Voyager record to the starships table. Because the Starship model defines a has_many relationship with the Crewmember model, the voyager record comes with a crewmembers property that you can use to access the collection of crew members that are associated with the ship. Use the following code to add new crew members:

voyager["crewmembers"].add(
    Crewmember(name="Janeway", rank="Captain", species="Human"),
    Crewmember(name="Neelix", rank="Morale Officer", species="Talaxian"),
    Crewmember(name="Tuvok", rank="Lt Commander", species="Vulcan"))

The records provided to the add method are instantiated directly from the Crewmember class. You don't want to use the create method in this case because the add method called on the Voyager instance handles the actual database insertion. It will also automatically populate the relation data, adding a starship_id property to each Crewmember record.

To make the example more interesting, add a few more Starship records to the database:

enterprise = Starship.create(name="Enterprise", category="Galaxy", registry="NCC-1701-D")
enterprise["crewmembers"].add(
    Crewmember(name="Picard", rank="Captain", species="Human"),
    Crewmember(name="Data", rank="Lt Commander", species="Android"),
    Crewmember(name="Troi", rank="Counselor", species="Betazed"))

defiant = Starship.create(name="Defiant", category="Defiant", registry="NX-74205")
defiant["crewmembers"].add(
    Crewmember(name="Sisko", rank="Captain", species="Human"),
    Crewmember(name="Dax", rank="Lt Commander", species="Trill"),
    Crewmember(name="Kira", rank="Major", species="Bajoran"))

Query the database

When you want to retrieve a record, you can invoke the get method on a model class. When you call the get method, you can either provide the ID of the specific record that you want or you can provide keyword arguments that perform a query against record attributes. If you want to get a specific starship by name, for example, you can do the following:

voyager = Starship.get(name="Voyager")

You can take advantage of the relations that you defined in your models. If you want to find all of the human members of Voyager's crew, you can simply use the filter method on the crewmembers property:

voyager = Starship.get(name="Voyager")
for human in voyager["crewmembers"].filter(species="Human"):
  print human["name"]

Perform filtering on an entire table by calling the filter method on a model class. The following code shows how to display the captain of each ship:

for person in Crewmember.filter(rank="Captain"):
  print person["name"], "captain of", person["starship"]["name"]

As you might have noticed, the starship property of the Crewmember instance points to the actual starship record. Remodel populates the property automatically to handle the Crewmember model's belongs_to relationship.

When you want to perform more sophisticated queries, you can use ReQL in conjunction with Remodel. Let's say that you want to evaluate Starfleet's diversity by determining how many crew members are of each species. You can use ReQL's group command:

Crewmember.table.group("species").ungroup() \
          .map(lambda item: [item["group"], item["reduction"].count()]) \
          .coerce_to("object").run()

The table property of a model class provides the equivalent of a ReQL r.table expression. You can chain additional ReQL commands to the table property just as you would when creating any ReQL query.

Put it all together

Just for fun, I'm going to show you how to build a web application for browsing the Starfleet crew roster. The app is built with Flask, a lightweight framework for web application development. The example also uses Jinja, a popular server-side templating system that is commonly used with Flask.

In a Flask application, the developer defines URL routes that are responsible for displaying specific kinds of information. The application uses templates to render the data in HTML format. Create a route at the application root:

app = flask.Flask(__name__)

@app.route("/")
def ships():
    return flask.render_template("ships.html", ships=Starship.all())

if __name__ == "__main__":
    app.run(host="localhost", port=8090, debug=True)

When the user visits the site root, the application will fetch all of the starships from the database and display them by rendering the ships.html template. The following is from the template file:

<ul>

</ul>

In the example above, the template iterates over every ship and displays a list item for each one. The list item includes an anchor tag that points to a URL with the ship's ID.

To make the application display the crew members of the ship when the user clicks one of the links, create a new /ship/x route that takes an arbitrary ship ID as a parameter:

@app.route("/ship/<ship_id>")
def ship(ship_id):
    ship = Starship.get(ship_id)
    crew = ship["crewmembers"].all()
    return flask.render_template("ship.html", ship=ship, crew=crew)

Fetch the desired ship from the database using the provided ID. In a real-world application, you might want to check to make sure that the record exists and throw an error if it doesn't. Once you have the ship, fetch the crew via the crewmembers property. Pass both the ship and the crew to the template:

<h1></h1>
<ul>

</ul>

Now create a /member/x route so that the user can see additional information about a crewman when they click one in the list:

@app.route("/member/<member_id>")
def member(member_id):
    member = Crewmember.get(member_id)
    return flask.render_template("crew.html", member=member)

Finally, define the template for that route:

<h1></h1>
<ul>
  <li><strong>Rank:</strong> </li>
  <li><strong>Species:</strong> </li>
</ul>

The template HTML files should go in a template folder alongside your Python script. When you run the Python script, it will start a Flask server at the desired port. You should be able to visit the URL and see the application in action.

Check out Remodel and Install RethinkDB to try it for yourself.

Resources

Make beautiful charts with RethinkDB queries and Charted.co

While building applications with RethinkDB, I often find cases where I want to be able to produce simple visualizations to help me better understand my data. Ideally, I'd like to take the output of a simple query and see what it looks like in a graph with as little work as possible. A new project recently introduced by the developers at Medium offers a compelling solution.

Medium's product science team built a lightweight web application called Charted that makes it easy for users to generate and share graphs. As input, the user provides a URL that points to CSV data. Charted processes the data and produces simple graphs with a clean and elegant design. No configuration is needed, though it allows the user to choose between bar and line formats and customize certain aspects of the output.

Charted is built on D3, a popular frontend JavaScript library that is widely used for data visualization. Simplicity is the chief advantage that Charted offers over rolling your own D3-based data visualizations by hand. Medium runs a hosted instance at Charted.co that anyone can use to publish and share graphs. You can also download the Charted source code from Github and run your own installation.

In order to use Charted with RethinkDB, you will need to convert the output of the desired query into CSV format and publish it at a URL. Fortunately, there are a number of libraries that make it very easy to perform the necessary conversion. In this tutorial, I will show you how I used the Python-based CSVKit framework with Flask to expose the output of a RethinkDB query in a form that I could pass to Charted.

Prepare your data with CSVKit

CSVKit is an open source toolkit for manipulating CSV content. It's primarily intended for use at the command line, but you can also consume it as a library in a Python script. It has a wide range of features, but we are primarily interested in using its built-in support for converting JSON to CSV.

You can import the json2csv function from the csvkit.convert.js module. The function expects to receive a file-like object, which means that you will need to wrap the content in StringIO if you would like to use a string instead of a file:

from csvkit.convert.js import json2csv

data = """[
  {"name": "Scott Summers", "codename": "Cyclops"},
  {"name": "Hank McCoy", "codename": "Best"},
  {"name": "Warren Worthington", "codename": "Angel"}
]"""

print json2csv(StringIO.StringIO(data))

If you run the code above, it will correlate the matching keys and display a comma-separated table of the values:

name,codename
Scott Summers,Cyclops
Hank McCoy,Best
Warren Worthington,Angel

Not bad so far, right? The conversion process is relatively straightforward. If you have nested objects, it will simply ignore them—it only operates on the top-level keys.

Transform data from RethinkDB

Now that you know how to convert JSON to CSV, the next step is applying the function to the output of your desired query. For the purposes of this tutorial, I'm going to use a feed of earthquake data from the USGS. As you might recall, I used that same data a few months ago in a tutorial that introduced geospatial queries.

In this case, I want to get the total number of earthquakes for each given day so that I will be able to plot it on a graph. Start by creating the table and loading the earthquake feed into the database:

c = r.connect()
r.db_create("quake").run(c)
r.db("quake").table_create("quakes").run(c)

url = "earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_month.geojson"
r.table("quakes").insert(r.http(url)["features"]).run(c)

To retrieve the relevant data, start by using the group command to organize the earthquakes by date. Next, append the ungroup command to chain additional operations to the grouped output. Finally, use the merge command to add a property that contains a total count of the records for each individual group:

output = r.db("quake").table("quakes") \
    .group(r.epoch_time(r.row["properties"]["time"] / 1000).date()) \
    .ungroup().merge({"count": r.row["reduction"].count()}).run(conn)

The group command will create a property called reduction that contains all of the values for each group. To get the total number of items for the group, you can simply call the count method on the array stored in reduction. The USGS feed uses high-precision timestamps, so you have to divide the value of the time property by 1000 to get the number of seconds before applying the epoch_time command.

There are a few minor wrinkles that you have to sort out before you convert the output to CSV. The group keys are date objects, which you can't really use for graphing. You must convert those timestamps to simple date strings that are suitable for use in the graph. The order of the keys is also important, because Charted will automatically use the first column as the x-axis in its graphs.

In order to specify the key order and format the timestamps, you will need to iterate over each item in the result set and create an OrderedDict that contains all of the values:

data = json.dumps([OrderedDict([
    ["date", item["group"].strftime("%D")],
    ["count", item["count"]]]) for item in output])

print json2csv(StringIO.StringIO(data))

Serve the output

In order to get the data into Charted, you will need to serve the generated CSV content through a public URL. For the purposes of this tutorial, I chose to accomplish that with Flask, a simple Python library for building server-side web applications.

In a Flask application, you use a Python decorator to associate a function with a URL route. I chose to create two routes, one that exposes the content in JSON format and one that exposes it in CSV format. The latter simply wraps the output of the former:

@app.route("/quakes")
def quakesJSON():
    conn = r.connect()
    output = r.db("quake").table("quakes") \
        .group(r.epoch_time(r.row["properties"]["time"] / 1000).date()) \
        .ungroup().merge({"count": r.row["reduction"].count()}).run(conn)

    conn.close();
    return json.dumps([OrderedDict([
        ["date", item["group"].strftime("%D")],
        ["count", item["count"]]]) for item in output])

@app.route("/quakes/csv")
def quakesCSV():
    return json2csv(StringIO.StringIO(quakesJSON()))

Now that you have a running server that outputs your data set in CSV format, you can take the URL and provide it to Charted. If you intend to use the public instance of Charted that is hosted at Charted.co, you will need to make sure that your Flask application server is publicly accessible. You might want to consider using a tool like ngrok to make a Flask application running on your local system accessible to the rest of the Internet. If you don't want to publicly expose your data, you could also optionally run your own local instance of Charted.

You can find a complete 50-line example by visiting this gist on GitHub. Install RethinkDB to try it for yourself.

For additional information, you can refer to:

A tasty RethinkDB video roundup for Thanksgiving

Thanksgiving is almost here: it is time to configure your dinner tables for family clusters and prepare some turkey for batch insertion of stuffing. To show how thankful we are for our amazing community, we put together this tasty video playlist with our best leftovers from October and November. It will help keep you entertained while you try not to succumb to the inevitable post-turkey tryptophan coma. Enjoy!


RethinkDB on FLOSS Weekly

RethinkDB co-founder Slava Akhmechet participated in a recent episode of TWiT's FLOSS Weekly video podcast. The hour-long interview includes a lengthy discussion about RethinkDB's origins, open source values, and suitability for real-time application development. Slava also shared lessons learned during RethinkDB development and talked about some future plans for the project.


Scale up RethinkDB apps on AWS with Docker

Climb Amazon's Elastic Beanstalk with RethinkDB co-founder Michael Glukhovsky in a presentation filmed at this month's Docker meetup. During the 20-minute talk, Michael demonstrated how to deploy a RethinkDB application on AWS with Docker. Learn how Docker containers and RethinkDB changefeeds make it easy to scale real-time apps in the cloud.


Build realtime location-aware apps with RethinkDB

RethinkDB Developer Evangelist Ryan Paul shook up a crowd last month with a short presentation about earthquake mapping. Ryan demonstrated how to use geospatial queries in RethinkDB to plot earthquake data on a map. Ryan also demonstrated how location-aware applications can take advantage of RethinkDB changefeeds to deliver real-time updates.


Pub/Sub made easy with RethinkDB

RethinkDB engineer Josh Kuhn made Gotham's streets a little safer during a five-minute presentation at last month's RethinkDB meetup. He demonstrated how to track comic book superhero match-ups in real-time using repubsub, a lightweight pub/sub library that uses RethinkDB as a message exchange.


RethinkDB hosting webinar with Compose

Our friends at Compose now offer a managed RethinkDB hosting service in the cloud. In a webinar last month, RethinkDB co-founder Slava Akhmechet and Compose CEO Kurt Mackey demonstrated how to use the service and discussed how it works.