@antirez, thank you for Redis!

Yesterday, Salvatore posted an amazing write-up on implementing the really neat HyperLogLog data structure in Redis. We keep being amazed by Redis and Salvatore — he's taught us a great deal about good APIs, usability, and software development in general. We learned so much from Redis, we drew this to celebrate his work. Thanks for your hard work, @antirez!

RethinkDB 1.12: simplified map/reduce, ARM port, new caching infrastructure

Today, we're happy to announce RethinkDB 1.12 (). Download it now!

With over 200 enhancements, the 1.12 release is one of the biggest releases to date. This release includes:

  • Dramatically simplified map/reduce and aggregation commands.
  • Big improvements to caching that do away with long-standing stability and performance limitations.
  • A port to the ARM architecture.
  • Four new ReQL commands for object and string manipulation.
  • Dozens of bug fixes, stability enhancements, and performance improvements.

Upgrading to 1.12? Make sure to migrate your data before upgrading to RethinkDB 1.12. →

Please note a breaking change: the 1.12 release replaces the commands group_by and grouped_map_reduce with a single new command group. You will have to adapt your applications to this change when you upgrade. See the 1.12 migration guide for details.

Simplified map/reduce and aggregation

Let's say you have a table plays where you keep track of gameplay outcomes for users of your game:

[{ play_id: 1, player: 'coffeemug', score: 100 },
 { play_id: 2, player: 'mlucy', score: 1000 },
 { play_id: 3, player: 'mlucy', score: 1200 },
 { play_id: 4, player: 'coffeemug', score: 200 }]

In RethinkDB, you could always count the number of games in the table by running a count command:

> r.table('plays').count().run(conn)
4

The built-in count command is a shortcut for a map/reduce query:

> r.table('plays').map(lambda x: 1).reduce(lambda x, y: x + y).run(conn)
4

The new release removes the old group_by and grouped_map_reduce commands, and replaces them with a single, much more powerful new command called group. This command breaks up a sequence of documents into groups. Any commands chained after group are called on each group individually, rather than all the documents in the sequence.

Let's say we want to count the number of games for each player:

> r.table('plays').group('player').count().run(conn)
{ 'mlucy': 2, 'coffeemug': 2 }

Of course instead of using the shortcut, you could write out the full map/reduce query with the group command:

> r.table('plays').group('player').map(lambda x: 1).reduce(lambda x, y: x + y).run(conn)
{ 'mlucy': 2, 'coffeemug': 2 }

In addition to the already available aggregators like count, sum, and avg, the 1.12 release adds new aggregators min and max. You can now run all five aggregators on any sequence of documents or on groups, resulting in a unified, powerful API for data aggregation.

Chaining after group isn't limited to built-in aggregators. We can chain any command, or series of commands after the group command. For example, let's try to get a random sample of two games from each player:

> r.table('plays').group('player').sample(2).run(conn)

These examples only scratch the surface of what's possible with group. Read more about the group command and the new map/reduce infrastructure.

Big improvements to caching

1.12 includes a lot of improvements to the caching infrastructure. The biggest user-facing change is that you no longer have to manually specify cache sizes for tables to prevent running over memory and into swap. Instead RethinkDB will adjust cache sizes for you on the fly, based on usage statistics for different tables and the amount of memory available on your system.

We've also made a lot of changes under the hood to help with various stability problems users have been reporting. Strenuous workloads and exotic cluster configurations are much less likely to cause stability problems in 1.12.

A port to ARM

Four months ago David Thomas (@davidthomas426 on GitHub) contributed a pull request with the changes necessary to compile and run RethinkDB on ARM. After months of testing and various additional fixes, the ARM port has been merged into RethinkDB mainline.

You shouldn't have to do anything special. Just run ./configure and make as you normally would:

$ ./configure --allow-fetch
$ make

Note that ARM support is experimental, and there are still some issues (such as #239) to work out.

Special thanks to David for the port, and to the many folks who did the testing that made the merge possible!

Object and string manipulation commands

The 1.12 release includes new commands for string manipulation and object creation.

Firstly, ReQL now includes commands for changing the case of strings:

> r.expr('Hello World').downcase().run(conn)
'hello world'

> r.expr('Hello World').upcase().run(conn)
'HELLO WORLD'

We also added a split command for breaking up strings, which behaves similarly to the native Python split:

> r.expr('Hello World').split().run(conn)
['Hello', 'World']

> r.expr('Hello, World').split(',').run(conn)
['Hello', ' World']

Finally, the 1.12 release includes an object command that allows programmatically creating JSON objects from key-value pairs:

> r.object('a', 1, 'b', 2).run(conn)
{ 'a': 1, 'b': 2 }

You can learn more about the commands in the API documentation.

Performance and stability improvements

In addition to stability work by almost everyone on the RethinkDB team, for the past four months @danielmewes dedicated his time almost entirely to stability and performance improvements. He uncovered and fixed dozens of latency and memory problems, stability issues with long running clusters, and slowdowns during highly concurrent workloads.

Here is a very small sample of the stability fixes that ship with the 1.12 release:

See the full list of enhancements, and take the new release for a spin!

Help work on the 1.13 release: RethinkDB is hiring.

RethinkDB raises an $8M Series A

Today we're delighted to announce our Series A! We've raised $8M to fund development, grow the RethinkDB community, and ultimately make database tools feel indistinguishable from magic.

Here's what we're planning to accomplish with this new development budget:

  • Get to a long-term support release: Since we first shipped the beta version of RethinkDB we've been humbled by the feedback, support, and encouragement from our users. Over the past thirteen months we've shipped ten new releases, fixed hundreds of bugs, and added dozens of new features. We're now hard at work polishing the final rough edges in preparation for the upcoming LTS release.

  • Grow the community: An amazing community has sprung up around RethinkDB over the past year. We're incredibly grateful to all the people who've spent time developing and supporting client drivers for 15 different languages, building frameworks, ORMs, and designing new admin tools. We're now in a position to give back to our community by sponsoring conferences, promoting community projects, improving documentation, and introducing more developers to RethinkDB.

  • Offer commercial support: Thousands of developers are already building applications backed by RethinkDB; from mobile games to gene sequence analysis, and everything in between. In the coming months we'll be adding commercial support options to give these teams everything they need to confidently use RethinkDB in production deployments.

We're also excited to welcome Peter Bell from Highland Capital Partners to our board. Peter has a deep background in infrastructure technology and led the investment, joined by Josh Stein from DFJ, and the amazing team at Webb Investment Network (Maynard Webb's early stage investment fund) — all incredibly supportive early seed investors in RethinkDB. We're excited to continue working with them to build a long-term open-source technology company.

In the meantime, you can look forward to an open and rapid development process, new features, and steady improvements to performance and reliability. The team is already hard at work on the upcoming 1.12 release — check out the GitHub milestone, and send us your feedback!

Help us get there: RethinkDB is hiring.

RethinkDB 1.11: query profiler, new streaming algorithm, devops enhancements

Today, we're happy to announce RethinkDB 1.11 (), which improves the experience of operating live RethinkDB deployments. Download it now!

The 1.11 release features more than 70 enhancements, including:

  • A new query profiler to analyze the performance of ReQL queries.
  • An improved streaming algorithm that reduces query latency.
  • DevOps enhancements, including new ReQL commands designed for operations

Upgrading to 1.11? Make sure to migrate your data before upgrading to RethinkDB 1.11. →

Query profiler

Prior to RethinkDB 1.11, there was no way to analyze the performance of queries. Optimizing queries was a process of trial and error. As of the 1.11 release, RethinkDB includes a developer preview of a query profiler that will make this process a lot easier.

You can enable the query profiler on a given query by passing the profile=True option to run, or by using the Profile tab in the Data Explorer.

r.table('foo').sample(1).run(profile=True)

When you run a query with profiling, we return the query's result, along with a trace of its execution.

[
  {
    "description": "Evaluating sample.",
    "duration(ms)": 1.320703,
    "sub_tasks": [
      {
        "description": "Evaluating datum.",
        "duration(ms)": 0.001529,
        "sub_tasks": []
      },
      {
        "description": "Evaluating table.",
        "duration(ms)": 0.097089,
        "sub_tasks": [
            ...
        ]
      },
      {
        "description": "Sampling elements.",
        "mean_duration(ms)": 0.160003,
        "n_samples": 7
      }
    ]
  }
]

The trace includes a breakdown of operations performed on the cluster, the time for each operation, and information about which parts of the query were performed in parallel.

The query profiler is included in this release as a developer preview, so it is limited in scope. Coming releases will add important metrics like memory and disk usage, and will improve readability for complex ReQL commands.

Latency improvements

One of the goals of the 1.11 release was to reduce query latency. Much work has gone into identifying, understanding, and removing the sources of slowdowns. To better understand the behavior of the system under load we ran dozens of benchmarks, implemented a new coroutine profiler, and expanded backtraces to span over coroutine boundaries.

We were able to greatly improve the responsiveness of the server in many situations, such as while creating new indexes or during periods of high cluster traffic. Below are some of the important changes we made:

New streaming algorithm

Prior to version 1.11, RethinkDB used a fixed batch size for all types of documents. The batch size was fixed at 1MB for communication between the nodes in the cluster, and 1000 documents between the the server and the client. This implementation skewed the system towards high throughput at a significant cost to realtime latency.

In the 1.11 release, the batching is significantly improved. The new batching algorithm adjusts the size of each batch dependening on the document size and the query latency. In practice, this results in significantly speedups. Latency for queries returning very large documents is often reduced by more than 100x.

Improved write operations

In addition to the new streaming algorithm, RethinkDB 1.11 includes many other changes that improve performance for various workloads.

  • In previous versions, every write transaction caused at least three separate disk writes. For most transactions, we reduced the number of writes to two.
  • We added a new algorithm that merges certain separate writes (such as index writes) into a single operation. The algorithm is designed to improve the performance of RethinkDB on rotational drives, but it also improves latency and throughput on SSDs.
  • At the disk level, we introduced more parallelism that allows RethinkDB to read more data at once from the disk.
  • Clients can now request that rows be returned as JSON to bypass slow protobuf implementations (the official clients for Python, JavaScript, and Ruby now do this).

DevOps enhancements

One of the immediate goals for the development team is to improve the experience of running RethinkDB on live deployments. Our work toward this goal is guided by two simple principles:

  • The administrator should always be able to answer any question about the state of the cluster.
  • No single query or set of queries should be able to monopolize cluster resources.

We've added the following enhancements to the 1.11 release in pursuit of this goal.

Determining secondary index status

As of the 1.11 release, you no longer have to guess whether a newly created secondary index is ready for use. We added two new commands for observing the status of index creation: indexStatus and indexWait. As the names suggest, indexStatus allows determining the status of a newly created secondary index, and indexWait allows the client to wait until the secondary index is successfully created. See the API reference for more details.

Better control over soft durability

Prior to the 1.11 release users who chose to use soft durability (off by default, of course), had no way to ensure that the data they'd inserted in soft durability mode had been committed to disk. We've now added a new sync command for flushing soft durability writes to disk. Calling the sync command ensures that any data written in soft durability mode on a given table has been flushed to disk before the command returns. See the API reference for more details.

Getting to an LTS release

We're still hard at work on our first LTS release, due in early 2014. In pursuit of that, our next few releases will continue to focus on performance and stability.

As previously mentioned, here is how the LTS release will be different from the beta releases we've been shipping until now:

  • It will go through a longer QA process that will include rigorous automated and manual testing
  • All known high impact bugs and performance issues will be solved
  • We'll publish results of our tests on high demand, large scale workloads in a clustered environment
  • The LTS release will have an additional margin of safety since it will first be battle tested by our pilot customers
  • We will be offering commercial support, training, and consulting options

If you have feedback or questions about this process, we'd love to hear from you!

Help us get there: RethinkDB is hiring.

RethinkDB 1.10: multi-indexes and serialization improvements

We are happy to announce RethinkDB 1.10 (). Download it now!

This release lets you index a row by multiple values at once, which can make entire classes of queries much faster (see below). It also includes major improvements to the way we serialize small values, which should increase the performance of many disk/network-bound workloads.

Take a look at the full list of improvements, or watch Daniel Mewes (@danielmewes), an engineer at RethinkDB, talk about the 1.10 release in this two-minute video:

Upgrading to 1.10? Make sure to migrate your data before upgrading to RethinkDB 1.10. →

Multi-indexes

Prior to RethinkDB 1.10, there was no way to index a single document by multiple values. This made certain workloads incurably slow. For example, if you had a table of GitHub issues with a field tags:

{
  ...,
  "tags": ["feature", "ReQL", "secondary_indexes"],
  ...
}

There was no fast way to retrieve all the github issues with the tag feature. Now, you can create a multi-index to do exactly that:

r.table('github_issues').index_create('tags', multi=True).run(conn)
r.table('github_isses').get_all('feature', index='tags')

Want to start using multi-indexes right away? Head to our API page.

Serialization improvements

We've made several improvements to the way we serialize values, especially small values. For example, RethinkDB stores all numbers as doubles internally, and previously we would serialize a full double to disk every time we wrote a number. Now, in the (common) case where documents contain small integers, we use a variable-length integer encoding instead. After lots of small improvements in this vein (which you can find in the release notes), we're seeing up to 30% improvements in the serialized size of previously problematic JSON documents.

Getting to an LTS release

We're still hard at work on our first LTS release, due (hopefully) near the end of the year. In pursuit of that, our next few releases will continue to focus on performance and stability.

As previously mentioned, here is how the LTS release will be different from the beta releases we've been shipping until now:

  • It will go through a longer QA process that will include rigorous automated and manual testing
  • All known high impact bugs and performance issues will be solved
  • We'll publish results of our tests on high demand, large scale workloads in a clustered environment
  • The LTS release will have an additional margin of safety since it will first be battle tested by our pilot customers
  • We will be offering commercial support, training, and consulting options

If you have feedback or questions about this process, we'd love to hear from you!

Get a free RethinkDB T-shirt

We'd love to hear what you've built with RethinkDB, so we're handing out shirts for stories. Tell us how you're using RethinkDB, and we'll send you a swanky RethinkDB T-shirts (American Apparel 50/50, super soft, super awesome).