Understanding how RethinkDB parallelizes queries can improve the performance of your applications—sometimes significantly.
The basic rule is:
Processing happens where the data is until an operation needs to combine it.
In other words, ReQL queries that involve multiple shards will be processed on those shards whenever possible.
Let’s follow the processing of a simple query. (This example uses JavaScript, but the commands are virtually identical in other languages.)
r.table('users').filter({role: 'admin'}).run(conn, callback);
RethinkDB will process this query with the following steps:
users
table.filter
is sent from the shards to the query server and combined.However, an orderBy query will be executed differently.
r.table('users').orderBy('username').run(conn, callback);
orderBy
operation is performed on the query server.An orderBy
operation (without an index) can’t be distributed across the shards for parallel execution—it needs all the data in the table to perform a sort.
The following commands can be distributed across shards:
between
, get_all
, filter
map
, concat_map
, reduce
group
pluck
, with_field
, count
, eq_join
order_by
with indexesThe order in which you chain ReQL commands can affect performance. For an example, imagine combining the previous two queries to return an ordered list of names of admin users. The filter
operation can be distributed across shards, but the orderBy
operation cannot. So this query:
r.table('users').filter({role: 'admin'}).orderBy('name').run(conn, callback);
Is preferable to this query:
r.table('users').orderBy('name').filter({role: 'admin'}).run(conn, callback);
Commands that stop subsequent commands from being parallelized include:
order_by
(with or without indexes)distinct
eq_join
reduce
, fold
limit
, skip
, slice
max
, min
, avg
Any command that requires the results from the shards to be combined on the server executing the query will finish executing on that server rather than being distributed. Optimize your queries by putting commands that can execute in parallel before commands that combine the result set whenever possible.
RethinkDB’s defaults tend to prioritize safety over performance. One of those defaults is that queries will be sent to the primary replicas for shards, which will always have current data (although that data may be returned to a query before it’s been committed to disk).
You can increase the performance of a query by using the outdated
read mode, which allows the cluster to return values from memory on arbitrarily-selected replicas.
r.table('users', {readMode: 'outdated'}).
filter({role: 'admin'}).run(conn, callback);
While outdated
reads are faster, they are the least consistent. For more information on this option, read “Balancing safety and performance” in the Consistency guarantees documentation.
Starting RethinkDB with the proxy
command turns a server into a proxy node, which acts as a query router. This increases cluster performance by reducing intracluster traffic and, if you’re using changefeeds, de-duplicating feed messages.
For more information about proxy nodes, read “Running a proxy node” under Scaling, sharding and replication.