Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement
Solution 1:
You can either use fillna
(fast) or you can use apply
(slow but flexible)
Fillna
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['z'] = ddf.y.fillna((100 + ddf.x))
>>> df
x y
010.20012 NaN
230.345340.400450.150>>> ddf.compute()
x y z
010.2000.20012 NaN 102.000230.3450.345340.4000.400450.1500.150
Of course in this case though because your function uses y
if y
is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.
Use apply
As any Pandas expert will tell you, using apply
comes with a 10x to 100x slowdown penalty. Please beware.
That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:
In [1]: import pandas as pd
...:
...: import dask.dataframe as dd
...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
...: ddf = dd.from_pandas(df, npartitions=2)
...:
In [2]: deffunc(row):
...: if pd.isnull(row['y']):
...: return row['x'] + 100
...: else:
...: return row['y']
...:
In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` isnot specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
In [4]: ddf.compute()
Out[4]:
x y z
010.2000.20012 NaN 102.000230.3450.345340.4000.400450.1500.150
In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)
In [6]: ddf.compute()
Out[6]:
x y z
010.2000.20012 NaN 102.000230.3450.345340.4000.400450.1500.150
Solution 2:
I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".
deffuncUpdate(row):
try:
returnround((1 + row['x'])/(1+ 1/row['y']),4)
except:
return row['y']
Is a possible workaround but you would need to run data validation before hand.
Post a Comment for "Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement"