RethinkDB 1.9: indexing and query runtime performance

We are happy to announce RethinkDB 1.9 (Kagemusha). Download it now!

This release includes massive enhancements that improve performance of many queries by orders of magnitude. See the full list of enhancements, or watch Joe Doliner (@jdoliner), an engineer at RethinkDB, talk about the 1.9 release in this two-minute video:

Upgrading to RethinkDB 1.9? Make sure to migrate your data before upgrading to RethinkDB 1.9.

Secondary index performance

Prior to RethinkDB 1.9, the secondary indexes implementation stored a full copy of the table for each index. This architecture worked quite well for small documents, but users whose documents were larger than 250 bytes encountered four problems when using secondary indexes:

  • The write throughput dropped drastically with the creation of every additional index
  • Conversely, the latency of every write operation increased
  • Disk space usage quickly got out of hand
  • Every copy of the document had to be loaded into the cache, significantly decreasing cache efficiency and decreasing read performance

RethinkDB 1.9 makes the necessary changes to the secondary index implementation to avoid copying data. All secondary indexes now share a reference to the relevant documents on disk without making copies of the documents.

Users that have multiple secondary indexes on tables with documents larger than 250 bytes will see massive improvements in write throughput, a decrease in latency for write operations, drastic improvements in disk space utilization, and improved cache memory efficiency which results in large speedups for read queries.

As of RethinkDB 1.9 you can create dozens of secondary indexes with minimum performance impact, regardless of the size of your documents. We ran some basic tests on some of our users’s workloads and saw an order of magnitude improvement on write operations for heavily indexed tables (note that these benchmarks aren’t rigorous or scientific, but you should be able to see similar improvements on your workloads if they’re heavy on secondary indexes).

Query language runtime performance

The 1.9 release introduced significant changes to the query language runtime layer. Our benchmarking of CPU intensive workloads (typically read-heavy workloads where the active data set fits into RAM) has shown that in-memory object copying was a significant performance bottleneck. The 1.9 release introduced a change that minimizes much of the unnecessary copying of objects in memory during query evaluation. This results in significant increases in throughput and decreases in latency for CPU-bound workloads.

There is still some work to be done to minimize unnecessary object copying in the query language runtime, but the current change is a great first step in that direction. If you have a read-heavy workload where the active data set fits into RAM, you should be able to reap the benefits starting with the 1.9 release.

Counting documents

One significant limitation of RethinkDB prior to the 1.9 release was the disk-heavy implementation of the count command. Previous versions of RethinkDB required count to read all data into memory, including reading in the documents themselves. On large data sets this had significant implications on performance when counting documents.

As of RethinkDB 1.9, the count command no longer loads documents into RAM when executed directly on the table. Queries of the type table('foo').count() now execute orders of magnitude faster than in prior versions of RethinkDB. On a quick test with a 10GB data set, the performance of the query went from nearly two minutes to mere forty milliseconds.

There are still more improvements to make to count when it’s executed on more complex queries (for example, following get_all). We’re brainstorming different designs to make this possible, and will be incorporating additional changes into upcoming releases.

Getting to an LTS release

We’re hard at work on additional optimizations, benchmarking and improving performance on high demand workloads, testing clustered environments, and improving stablity. We will be making point releases to make these enhancements available as soon as they’re implemented and tested.

Our goal is to get to a long term support (LTS) release of RethinkDB by the end of the year. Here is how the LTS release will be different from the beta releases we’ve been shipping until now:

  • It will go through a longer QA process that will include rigorous automated and manual testing
  • All known high impact bugs and performance issues will be solved
  • We’ll publish results of our tests on high demand, large scale workloads in a clustered environment
  • The LTS release will have an additional margin of safety since it will first be battle tested by our pilot customers
  • We will be offering commercial support, training, and consulting options

If you have feedback or questions about this process, we’d love to hear from you!