Remove Duplicates From Rows And Columns (cell) In A Dataframe, Python
I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this: Index x y 1 1 ec, us, us, gbr, lst 2 5 ec, us, us, us
Solution 1:
Split
and apply set
and join
i.e
df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object
Update based on comment :
df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join
Solution 2:
Solution 3:
If you don't care about item order, and assuming the data type of everything in column y
is a string, you can use the following snippet:
df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))
The set()
conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.
Solution 4:
use the apply
method on the dataframe.
# change this function according to your needsdefdedup(row):
returnlist(set(row.y))
df['deduped'] = df.apply(dedup, axis=1)
Post a Comment for "Remove Duplicates From Rows And Columns (cell) In A Dataframe, Python"