Building a TaskEstimator#
The TaskEstimator trait controls how many distributed tasks the planner allocates to each stage of the query plan.
The number of tasks is assigned to the different stages in a bottom-up fashion. You can refer to the
Plan Annotation docs
for an explanation on how this works. A TaskEstimator is what hints this process how many tasks should be
used.
While a default implementation exists for file-based DataSourceExec nodes (those backed by FileScanConfig), you
can provide custom TaskEstimator implementations for your own ExecutionPlan types.
Providing your own TaskEstimator#
Providing a TaskEstimator allows you to do two things:
Tell the distributed planner how many tasks should be used for your own
ExecutionPlans.Tell the distributed planner how to “scale up” your
ExecutionPlanin order to account for it running in multiple distributed tasks.
If your custom nodes will execute in a distributed manner, you must handle this during execution. When your
TaskEstimator specifies N tasks for a node, your execution logic must respond to the
DistributedTaskContext
present in DataFusion’s TaskContext to determine which subset of data this task should process.
There’s an example of how to do that in the examples/ folder:
custom_execution_plan.rs - A complete example showing how to implement a custom execution plan (
numbers(start, end)table function) that works with distributed DataFusion, including a custom codec and TaskEstimator.