Skip to content Skip to sidebar Skip to footer

Repeated Task Execution Using The Distributed Dask Scheduler

I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute(). When the number of tasks is say 20 (a number

Solution 1:

Correct, if a task is allocated to one worker and another worker becomes free it may choose to steal excess tasks from its peers. There is a chance that it will steal a task that has just started to run, in which case the task will run twice.

The clean way to handle this problem is to ensure that your tasks are idempotent, that they return the same result even if run twice. This might mean handling your database error within your task.

This is one of those policies that are great for data intensive computing workloads but terrible for data engineering workloads. It's tricky to design a system that satisfies both needs simultaneously.

Post a Comment for "Repeated Task Execution Using The Distributed Dask Scheduler"