# Benchmarks

There are two kinds of benchmarks in this project.

## Local Benchmarks

It's recommended to run these benchmarks locally when contributing to ensure there are no performance regressions.

### Generating Test Data

First, a TPCH dataset must be generated:

```bash
cd benchmarks
SCALE_FACTOR=10 ./gen-tpch.sh
```

This might take a while.

### Running Benchmarks

After generating the data, it's recommended to use the `run.sh` script to run the benchmarks.
A good setup is to run 8 workers throttled at 2 physical threads per worker. This provides a relatively
accurate benchmarking environment for a distributed system locally.

```bash
WORKERS=8 ./benchmarks/run.sh --threads 2 --path benchmarks/data/tpch_sf10
```

Subsequent runs will compare results against the previous one, so a useful trick to measure the impact of a PR
is to first run the benchmarks on `main`, and then on the PR branch.

More information about these benchmarks can be found in the [benchmarks README](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/benchmarks/README.md).

## Remote Benchmarks

These benchmarks run on a remote EC2 cluster against parquet files stored in S3. These are the most realistic
benchmarks, but also the most expensive to run in terms of development iteration cycles (it requires AWS CDK deploys for
every code change) and cost, as it uses a real EC2 cluster.

For running these benchmarks, refer to the [CDK benchmarks README](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/benchmarks/cdk/README.md).