Inside QuestDB's Query Engine: Tracing Three Queries
Jun 08, 2026I’ve been exploring the QuestDB codebase, a time series columnar database, and, specifically, trying to learn how its query engine works. And whenever we discuss a query engine execution, certain kinds of dimensions and options within those dimensions arise in the conversation. For example:
| Dimension | Options |
|---|---|
| Dataflow | Pull-based vs. Push-based |
| Processing Granularity | Tuple-at-a-time vs. Vectorized (batch-at-a-time) |
| Code Generation | Interpreted vs. Compiled |
| Parallelism Model | Single-threaded vs. Operator Parallelism vs. Morsel-Driven Parallelism |
This taxonomy is not necessarily something that is agreed on in the literature or in the database industry. It’s just how I usually think about it based on my research, and also for the purposes of this blog post we are mostly focused on the execution part of a query engine, so we’ll ignore the parsing, optimization, and most of the planning.
You might think that those dimensions are orthogonal to each other and a query engine is essentially composed of one option of each dimension. But in reality, things are much more nuanced, and a query engine can express a unique mix of those options.
In this blog post, I explore how QuestDB expresses these dimensions and try to build a strong mental model of its query engine execution. This blog post assumes some understanding of these dimensions. Also, I try not to get too distracted by details of the codebase and implementation and focus more on what techniques are applied and in which context they are applied.
My exploration was much messier than how I make it look in this blog post. It took a couple of weeks and lots of back and forth, especially because I was not familiar with how things such as vectorization, compiled queries, and query parallelism were applied in a real database. And I had a lot of catching up to do to have a reasonable understanding of what is involved when a query is executed inside QuestDB.
I ended up with 3 queries that I think are good enough to get a gist of what is going on:
SELECT symbol, SUM(amount) FROM trades GROUP BY symbol;
SELECT symbol, side, SUM(amount) FROM trades GROUP BY symbol, side;
SELECT symbol, side, SUM(amount) FROM trades WHERE side != 'SELL' GROUP BY symbol, side;
I’ve created a script that creates a trades table with 500,000 trades, and I added a custom query tracer, on a branch of mine, that captures specific events on certain spots that helped me answer some questions.
Dataflow
When I started, one of the first things that I noticed was the presence of classes extending AbstractRecordCursorFactory everywhere. For example, AsyncFilteredRecordCursorFactory, GroupByRecordCursorFactory , and AsyncGroupByRecordCursorFactory.
AbstractRecordCursorFactory is an abstract class that implements RecordCursorFactory and from a Javadoc we figure out the dataflow:
/**
* Factory for creating a SQL execution plan.
* Queries may be executed more than once without changing execution plan.
* <p>
* Interfaces which extend Closeable are not optionally-closeable.
* close() method must be called after other calls are complete.
* <p>
* Example:
* <pre>
* final SqlExecutionContextImpl ctx = new SqlExecutionContextImpl(engine, 1);
* try (SqlCompiler compiler = new SqlCompiler(engine)) {
* try (RecordCursorFactory factory = compiler.compile("abc", ctx).getRecordCursorFactory()) {
* try (RecordCursor cursor = factory.getCursor(ctx)) {
* final Record record = cursor.getRecord();
* while (cursor.hasNext()) {
* // access 'record' instance for field values
* }
* }
* }
* }
* </pre>
*/
SqlCompiler returns a cursor in which one record at a time is returned. From that we infer that QuestDB is implementing a pull-based tuple-at-a-time dataflow à la Volcano. You might be thinking, how in the world could there be vectorization, or JIT, or parallelism underneath that? That’s what we explore next.
First query
Tracing our first query, SELECT symbol, SUM(amount) FROM trades GROUP BY symbol, we get the following trace log:
[shared-network_7] > vect.GroupByRecordCursorFactory.getCursor [vectorized parallel aggregate, aggregates=1 shards=14]
[shared-network_7] < vect.GroupByRecordCursorFactory.getCursor
[shared-network_7] > vect.GroupByRecordCursorFactory.getCursor [vectorized parallel aggregate, aggregates=1 shards=14]
[shared-network_7] < vect.GroupByRecordCursorFactory.getCursor
[shared-network_7] . vect.GroupByRecordCursorFactory.buildMaps [frames=5 aggregates=1 => total tasks dispatched=5]
[shared-network_7] . vect.GroupByRecordCursorFactory.runWhatsLeft [dispatcher (workerId=7) drained queue cursor=12 after dispatch finished]
[shared-network_7] . SumDoubleVectorAggregateFunction.aggregate [native keyedIntSumDouble kernel processing 100000 rows (one SIMD-friendly C++ loop over key + value columns)]
[shared-query_3] . GroupByVectorAggregateJob.doRun [worker=3 cursor=13 (executes Vect.sumDouble etc on a page-frame column)]
[shared-query_3] . SumDoubleVectorAggregateFunction.aggregate [native keyedIntSumDouble kernel processing 100000 rows (one SIMD-friendly C++ loop over key + value columns)]
[shared-query_8] . GroupByVectorAggregateJob.doRun [worker=8 cursor=14 (executes Vect.sumDouble etc on a page-frame column)]
[shared-query_12] . GroupByVectorAggregateJob.doRun [worker=12 cursor=15 (executes Vect.sumDouble etc on a page-frame column)]
[shared-query_12] . SumDoubleVectorAggregateFunction.aggregate [native keyedIntSumDouble kernel processing 100000 rows (one SIMD-friendly C++ loop over key + value columns)]
[shared-query_8] . SumDoubleVectorAggregateFunction.aggregate [native keyedIntSumDouble kernel processing 100000 rows (one SIMD-friendly C++ loop over key + value columns)]
[shared-query_1] . GroupByVectorAggregateJob.doRun [worker=1 cursor=16 (executes Vect.sumDouble etc on a page-frame column)]
[shared-query_1] . SumDoubleVectorAggregateFunction.aggregate [native keyedIntSumDouble kernel processing 100000 rows (one SIMD-friendly C++ loop over key + value columns)]
[shared-network_7] . vect.GroupByRecordCursorFactory.merge [vaf=0 fold pRosti[3] (size=10) into pRostiBig (size=10) via native kernel]
[shared-network_7] . vect.GroupByRecordCursorFactory.merge [vaf=0 fold pRosti[7] (size=10) into pRostiBig (size=10) via native kernel]
[shared-network_7] . vect.GroupByRecordCursorFactory.merge [vaf=0 fold pRosti[8] (size=10) into pRostiBig (size=10) via native kernel]
[shared-network_7] . vect.GroupByRecordCursorFactory.merge [vaf=0 fold pRosti[12] (size=10) into pRostiBig (size=10) via native kernel]
The first thing to notice is the thread name. That already hints to us the parallelism going on under the hood. shared-network_7 is one worker, from the shared-network pool of workers, that is responsible for starting the execution of the query and returning the results. The shared-query workers are part of another worker pool executing tasks in parallel.
The operator that represents our query is RostiRecordCursor created by vect.GroupByRecordCursorFactory and when hasNext (we don’t trace the hasNext call because that would pollute our log), it tries to build the internal hash map by calling buildMaps. The operator has access to the storage through PageFrameCursor that returns chunks of the data called frames.
Before running our query, we set up, manually, in server.conf the following config: cairo.sql.page.frame.max.rows=100000. Meaning our frames will have at most 100,000 rows.
So 5 frames are delivered to our operator, and it creates 5 tasks, dispatched on a queue for shared-query workers to grab and compute parts of the hash table in parallel.
If you look closely, you’ll see only 4 doRun logs, as if only 4 tasks were executed by shared-query workers. What is interesting about QuestDB’s parallelism engine is that the dispatcher can also execute tasks. And that is seen in the trace the runWhatsLeft log. The shared-network_7 dispatcher thread dispatched 5 tasks, and instead of waiting idly for them to finish, it took 1 task and executed it. There’s also a thing called work stealing, not captured in this trace. If the dispatcher tries to dispatch but the queue is full, it steals work from the queue and executes the task, freeing up a slot.
From my research on query parallelism, these techniques resemble a lot of morsel-driven parallelism from Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age; frames (morsels) being the unit of dispatch, and workers working in parallel on those units. Hopefully, in a future blog post, we can explore more of the parallelism: how publishing/consuming tasks to/from a lock-free ring queue works.
And what is being executed on doRun? It executes a SIMD-friendly C++ vectorized double sum on 100,000 rows. Then the dispatcher thread merges the partial results into a final hash table. The cursor hasNext is essentially iterating over that hash table.
Ok, with this simple query, we could see some interesting things happening under the hood. Even though QuestDB offers a tuple-at-a-time external interface, it implements vectorized operations and an interesting parallelism mechanism.
Second query
Before tracing our next query, SELECT symbol, side, SUM(amount) FROM trades GROUP BY symbol, side;, let’s compare its query plan to the query plan of the former query.
// First query
GroupBy vectorized: true workers: 14
keys: [symbol]
values: [sum(amount)]
PageFrame
Row forward scan
Frame forward scan on: trades
// Second query
Async Group By workers: 14
keys: [symbol,side]
values: [sum(amount)]
filter: null
PageFrame
Row forward scan
Frame forward scan on: trades
The SQL difference between the queries is that another key was added. And you can see that they output different query plans. I could trace the place in the code that makes that decision. In SqlCodeGenerator.java, one key column builds GroupByRecordCursorFactory and multi-key falls through and builds, eventually, the AsyncGroupByRecordCursorFactory
if (tempKeyIndexesInBase.size() == 1) {
...
return generateFill(
model,
new GroupByRecordCursorFactory(
executionContext.getCairoEngine(),
configuration,
factory,
meta,
arrayColumnTypes,
executionContext.getSharedQueryWorkerCount(),
tempVaf,
tempKeyIndexesInBase.getQuick(0),
tempKeyIndex.getQuick(0),
tempSymbolSkewIndexes
),
executionContext
);
I don’t fully understand why they made that decision. My guess is that the one key column path leads to a more optimized execution so a different cursor is needed. Let’s analyze, then, the trace log of the second query:
[shared-network_9] > AsyncGroupByRecordCursorFactory.getCursor [workers=14]
[shared-network_9] < AsyncGroupByRecordCursorFactory.getCursor
[shared-network_9] > AsyncGroupByRecordCursor.buildMap
[shared-network_9] > UnorderedPageFrameSequence.dispatchAndAwait [frames=5]
[shared-network_9] . UnorderedPageFrameReduceJob.consumeQueue [worker=-1 frame=0 stealing=true]
[shared-network_9] . AsyncGroupByRecordCursorFactory.aggregate [worker=-1 frame=0 rows=100000]
[shared-query_12] . UnorderedPageFrameReduceJob.consumeQueue [worker=12 frame=1 stealing=false]
[shared-query_3] . UnorderedPageFrameReduceJob.consumeQueue [worker=3 frame=2 stealing=false]
[shared-query_12] . AsyncGroupByRecordCursorFactory.aggregate [worker=12 frame=1 rows=100000]
[shared-query_5] . UnorderedPageFrameReduceJob.consumeQueue [worker=5 frame=3 stealing=false]
[shared-query_3] . AsyncGroupByRecordCursorFactory.aggregate [worker=3 frame=2 rows=100000]
[shared-query_5] . AsyncGroupByRecordCursorFactory.aggregate [worker=5 frame=3 rows=100000]
[shared-query_8] . UnorderedPageFrameReduceJob.consumeQueue [worker=8 frame=4 stealing=false]
[shared-query_8] . AsyncGroupByRecordCursorFactory.aggregate [worker=8 frame=4 rows=100000]
[shared-network_9] < UnorderedPageFrameSequence.dispatchAndAwait
[shared-network_9] . AsyncGroupByRecordCursor.buildMap [merge sharded=false]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[0].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[1].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[2].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[3].map (size=20) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[4].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[5].map (size=20) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[6].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[7].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[8].map (size=20) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[9].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[10].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[11].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[12].map (size=20) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] . GroupByShardingContext.mergeOwnerMap [fold worker[13].map (size=0) into destMap (size=20) via per-key Java callbacks]
[shared-network_9] < AsyncGroupByRecordCursor.buildMap
We can see that AsyncGroupByRecordCursor is the cursor that is built, and after the buildMap call a dispatchAndAwait call on UnorderedPageFrameSequence is called. From the thread names we can see the execution happening in parallel: shared-query_12 got frame 1, shared-query_3 got frame 2, shared-query_5got frame 3, and shared-query_8 got frame 4. What is interesting is that the dispatcher work-stole frame 0. I thought that was weird because the queue should not be full when the dispatcher tries to publish frame 0. But it seems that is just the same behavior as runWhatsLeft that we saw earlier.
Within aggregate (I didn’t trace this to avoid a longer trace log) computeKeyedBatch from SumDoubleGroupByFunction is called on a batch of 2048 rows. And you can see it’s a for-loop over the rows updating the value stored in the address of the hash map.
Lastly, the results are merged on the dispatcher side by calling GroupByShardingContext.mergeOwnerMap. The reason it is called 14 times is that we have 14 workers (14 CPUs in my laptop), and 14 slots, one for each worker, were pre-allocated to store the partial results.
We can see how a simple change in a query can change the query plan and the execution. Conceptually, the queries are very similar, but one aggregated using C++ SIMD code and the other using batched rows in plain Java. Also, although the parallelism machinery is the same in both cases, two kinds of jobs were used to execute the task: GroupByVectorAggregateJob and UnorderedPageFrameReduceJob. From my analysis UnorderedPageFrameReduceJob is a much more generic kind of job that can execute different kinds of operations. You can tell that just by the name, in which GroupByVectorAggregateJob is much more specific.
Third query
Now let’s take a look at our last query: SELECT symbol, side, SUM(amount) FROM trades WHERE side != 'SELL' GROUP BY symbol, side;
Here’s the query plan:
Async JIT Group By workers: 14
keys: [symbol,side]
values: [sum(amount)]
filter: side!='SELL'
PageFrame
Row forward scan
Frame forward scan on: trades
It’s similar to the second query, but now we have the JIT and the filter. Let’s see the trace log:
[shared-network_13] > AsyncGroupByRecordCursorFactory.getCursor [workers=14]
[shared-network_13] < AsyncGroupByRecordCursorFactory.getCursor
[shared-network_13] > AsyncGroupByRecordCursor.buildMap
[shared-network_13] > UnorderedPageFrameSequence.dispatchAndAwait [frames=5]
[shared-network_13] . UnorderedPageFrameReduceJob.consumeQueue [worker=-1 frame=0 stealing=true]
[shared-query_4] . UnorderedPageFrameReduceJob.consumeQueue [worker=4 frame=3 stealing=false]
[shared-query_5] . UnorderedPageFrameReduceJob.consumeQueue [worker=5 frame=1 stealing=false]
[shared-query_2] . UnorderedPageFrameReduceJob.consumeQueue [worker=2 frame=4 stealing=false]
[shared-query_13] . UnorderedPageFrameReduceJob.consumeQueue [worker=13 frame=2 stealing=false]
[shared-network_13] . AsyncGroupByRecordCursorFactory.filterAndAggregate [worker=-1 frame=0 rows=100000 (stolen filter fused with aggregate)]
[shared-query_4] . AsyncGroupByRecordCursorFactory.filterAndAggregate [worker=4 frame=3 rows=100000 (stolen filter fused with aggregate)]
[shared-query_5] . AsyncGroupByRecordCursorFactory.filterAndAggregate [worker=5 frame=1 rows=100000 (stolen filter fused with aggregate)]
[shared-query_13] . AsyncGroupByRecordCursorFactory.filterAndAggregate [worker=13 frame=2 rows=100000 (stolen filter fused with aggregate)]
[shared-query_2] . AsyncGroupByRecordCursorFactory.filterAndAggregate [worker=2 frame=4 rows=100000 (stolen filter fused with aggregate)]
[shared-network_13] < UnorderedPageFrameSequence.dispatchAndAwait
[shared-network_13] . AsyncGroupByRecordCursor.buildMap [merge sharded=false]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[0].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[1].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[2].map (size=10) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[3].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[4].map (size=10) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[5].map (size=10) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[6].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[7].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[8].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[9].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[10].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[11].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[12].map (size=0) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] . GroupByShardingContext.mergeOwnerMap [fold worker[13].map (size=10) into destMap (size=10) via per-key Java callbacks]
[shared-network_13] < AsyncGroupByRecordCursor.buildMap
The trace log is similar to the last query, except for filterAndAggregate. What I found interesting here is that, since we added a new operator to our query, I was expecting some kind of cursor composition where the group-by cursor receives the filter cursor as a child. I found out that what is happening here is something called filter stealing. The group-by cursor “steals” the filter, and the filter operation becomes embedded in the group-by cursor without the need for another cursor.
It turns out that a cursor called AsyncJitFilteredRecordCursorFactory exists and it has a method called supportsFilterStealing. I made it return false, and here’s the query plan I got:
GroupBy vectorized: false
keys: [symbol,side]
values: [sum(amount)]
Async JIT Filter workers: 14
filter: side!='SELL'
PageFrame
Row forward scan
Frame forward scan on: trades
Now we have a composition between the group-by cursor and the async JIT filter cursor. Also, note the vectorized set to false. This observation answers something I’ve been wondering about since the first time I noticed the cursor pattern was being used. How can parallelism work in a query plan with composed cursors if the cursor returns only one record at a time? I think the answer is it can’t. From my observations of the QuestDB codebase, the parallelism only works on cursors at the leaf level that have access to the frames and support the page frame cursor. Anytime a cursor materializes some query results and it needs to pass to the next cursor, the cursor interface does not allow the next cursor to parallelize its operation because it’s getting records one at a time. That’s why the filter stealing is such an interesting technique, because you avoid that composition by putting the filter inside the group-by. And now, this new group-by with a filter has direct access to the page frame cursor and can be parallelized.
To confirm the above, let’s look at SqlCodeGenerator.java. In L8828,
boolean supportsParallelism = factory.supportsPageFrameCursor();
we see if the AsyncJitFilteredRecordCursorFactory supports parallelism. The only cursor that supports parallelism is the leaf PageFrameRecordCursorFactory, so supportsParallelism is false. In L8829-L8850:
CompiledFilter compiledFilter = null;
if (!supportsParallelism && factory.supportsFilterStealing()) {
..
compiledFilter = filterFactory.getCompiledFilter();
..
supportsParallelism = true;
..
}
Because AsyncJitFilteredRecordCursorFactory supports filter stealing, we enter the branch. Then we get the compiled filter and switch supportsParallelism to true indicating that group by can go parallel.
A few lines below, AsyncGroupByRecordCursorFactory is constructed and the stolen filter is passed in:
return new AsyncGroupByRecordCursorFactory(
...,
compiledFilter,
...
);
And inside AsyncGroupByRecordCursorFactory constructor on lines L133-L140, the FILTER_AND_AGGREGATE is picked because a filter was passed in.
this.frameSequence = new UnorderedPageFrameSequence<>(
..
filter != null ? FILTER_AND_AGGREGATE : AGGREGATE,
..
);
Now to the JIT. From what I could verify, only WHERE clauses are JIT’d. If JIT is enabled and the expression is JIT-able, the expression is compiled to native code, and a function pointer is stored in CompiledFilter.fnAddress. And we saw how the compiled filter was passed to AsyncGroupByRecordCursorFactory, AsyncGroupByRecordCursorFactory.filterAndAggregate calls that function to filter the rows of a frame before aggregating them. I don’t understand how the full JIT machinery works, but it’s interesting to see the technique being applied for certain contexts.
Wrapping up
In this blog post, we investigated 3 queries that gave us a good sense of what is going on in QuestDB’s engine in terms of the dimensions we talked about. We clearly see it’s an engine that applies different techniques according to the context of the query. Although it is a pull-based tuple-at-a-time engine, internally, leaf nodes can parallelize the query execution using a morsel-driven-like design. The data is broken into frames, and jobs are queued for workers to process. In some cases it outsources operations to native code, as in the case of SUM, using SIMD to calculate the aggregate, and WHERE, using JIT’d expressions to filter rows.
I had lots of fun doing this, and I’ll probably keep exploring different paths and techniques by tracing different queries and see that I can learn from there. Also, do similar stuff on different databases to see how these techniques compare.
I’d like to point out that there could be wrong interpretations of my analysis, and I am sure I am missing many details. Furthermore, this is a current snapshot; things will evolve, and this analysis can become completely obsolete in the future.
#databases #query-engine #questdb #vectorization #jit #parallelism #simd