Benchmarks#
There are two kinds of benchmarks in this project.
Local Benchmarks#
It’s recommended to run these benchmarks locally when contributing to ensure there are no performance regressions.
Generating Test Data#
First, a TPCH dataset must be generated:
cd benchmarks
SCALE_FACTOR=10 ./gen-tpch.sh
This might take a while.
Running Benchmarks#
After generating the data, it’s recommended to use the run.sh script to run the benchmarks.
A good setup is to run 8 workers throttled at 2 physical threads per worker. This provides a relatively
accurate benchmarking environment for a distributed system locally.
WORKERS=8 ./benchmarks/run.sh --threads 2 --path benchmarks/data/tpch_sf10
Subsequent runs will compare results against the previous one, so a useful trick to measure the impact of a PR
is to first run the benchmarks on main, and then on the PR branch.
More information about these benchmarks can be found in the benchmarks README.
Remote Benchmarks#
These benchmarks run on a remote EC2 cluster against parquet files stored in S3. These are the most realistic benchmarks, but also the most expensive to run in terms of development iteration cycles (it requires AWS CDK deploys for every code change) and cost, as it uses a real EC2 cluster.
For running these benchmarks, refer to the CDK benchmarks README.