Skip to content Skip to sidebar Skip to footer

Collecting Attributes From Dask Dataframe Providers

TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection. I currently have a proprietary file format i'm using to feed into

Solution 1:

There are a few potential questions here:

  • Q: How do I load data from many files in a custom format into a single dask dataframe
  • A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.

  • Q: How do I store arbitrary metadata onto a dask.dataframe?

  • A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.

  • Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?

  • A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.

Post a Comment for "Collecting Attributes From Dask Dataframe Providers"