Difference Of Sets Of Multiple Values For Single Column In Pandas
I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference
Solution 1:
Setup
df
Dyad Participant Timestep Tokens
01 A 1 apple,banana
11 B 1 apple,orange
21 A 2 banana
31 B 2 orange,kumquat
41 A 3 orange
51 B 3 orange,pear
62 A 1 orange,pear
72 B 1 apple,banana,pear
82 A 2 banana,persimmon
92 B 2 apple
102 A 3 banana
112 B 3 apple
tokens = df.Tokens.str.split(',', expand=False).apply(frozenset)
tokens
0 (apple, banana)
1 (orange, apple)
2 (banana)
3 (orange, kumquat)
4 (orange)
5 (orange, pear)
6 (orange, pear)
7 (apple, banana, pear)
8 (persimmon, banana)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object
# union logic - https://stackoverflow.com/a/46402781/4909087
df = df.assign(Tokens=tokens)\
.groupby(['Dyad', 'Participant']).apply(\
lambda x: (x.Tokens.str.len() -
x.Tokens.diff().str.len()) \
/ pd.Series([len(k[0].union(k[1]))
for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
.reset_index(level=[0, 1], name='TokenOverlap')\
.assign(Timestep=df.Timestep, Tokens=df.Tokens)\
.sort_values(['Dyad', 'Timestep', 'Participant'])\
.fillna('(no value)')\
[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]
df
Dyad Participant Timestep Tokens TokenOverlap
01 A 1 apple,banana (no value)
11 B 1 apple,orange (no value)
21 A 2 banana 0.531 B 2 orange,kumquat 0.33333341 A 3 orange 051 B 3 orange,pear 0.33333362 A 1 orange,pear (no value)
72 B 1 apple,banana,pear (no value)
82 A 2 banana,persimmon 092 B 2 apple 0.333333102 A 3 banana 0.5112 B 3 apple 1
In a nutshell, what this code is doing is, grouping by Dyad
and Participant
, and then finding pairwise ratio. This needs some complicated groupby
and apply
, since we need to do a few set union
and difference
operations. The The core logic is inside the groupby.apply
, while the rest is just prettification.
This code runs in:
10 loops, best of 3: 19.2 ms per loop
Breakdown
df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
lambda x: (x.Tokens.str.len() -
x.Tokens.diff().str.len()) \
/ pd.Series([len(k[0].union(k[1]))
for k inzip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line
df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]
Post a Comment for "Difference Of Sets Of Multiple Values For Single Column In Pandas"