Skip to content Skip to sidebar Skip to footer

Difference Of Sets Of Multiple Values For Single Column In Pandas

I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference

Solution 1:

Setup

df
    Dyad Participant  Timestep             Tokens
01           A         1       apple,banana
11           B         1       apple,orange
21           A         2             banana
31           B         2     orange,kumquat
41           A         3             orange
51           B         3        orange,pear
62           A         1        orange,pear
72           B         1  apple,banana,pear
82           A         2   banana,persimmon
92           B         2              apple
102           A         3             banana
112           B         3              apple

tokens = df.Tokens.str.split(',', expand=False).apply(frozenset) 

tokens
0           (apple, banana)
1           (orange, apple)
2                  (banana)
3         (orange, kumquat)
4                  (orange)
5            (orange, pear)
6            (orange, pear)
7     (apple, banana, pear)
8       (persimmon, banana)
9                   (apple)
10                 (banana)
11                  (apple)
Name: Tokens, dtype: object

# union logic - https://stackoverflow.com/a/46402781/4909087
df =  df.assign(Tokens=tokens)\
        .groupby(['Dyad', 'Participant']).apply(\
               lambda x: (x.Tokens.str.len() - 
                      x.Tokens.diff().str.len()) \
                    / pd.Series([len(k[0].union(k[1])) 
   for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
        .reset_index(level=[0, 1], name='TokenOverlap')\
        .assign(Timestep=df.Timestep, Tokens=df.Tokens)\
        .sort_values(['Dyad', 'Timestep', 'Participant'])\
        .fillna('(no value)')\
         [['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

df

    Dyad Participant  Timestep             Tokens TokenOverlap
01           A         1       apple,banana   (no value)
11           B         1       apple,orange   (no value)
21           A         2             banana          0.531           B         2     orange,kumquat     0.33333341           A         3             orange            051           B         3        orange,pear     0.33333362           A         1        orange,pear   (no value)
72           B         1  apple,banana,pear   (no value)
82           A         2   banana,persimmon            092           B         2              apple     0.333333102           A         3             banana          0.5112           B         3              apple            1

In a nutshell, what this code is doing is, grouping by Dyad and Participant, and then finding pairwise ratio. This needs some complicated groupby and apply, since we need to do a few set union and difference operations. The The core logic is inside the groupby.apply, while the rest is just prettification.

This code runs in:

10 loops, best of 3: 19.2 ms per loop

Breakdown

df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
                   lambda x: (x.Tokens.str.len() - 
                          x.Tokens.diff().str.len()) \
                        / pd.Series([len(k[0].union(k[1])) 
       for k inzip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line

df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')    
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')    
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

Post a Comment for "Difference Of Sets Of Multiple Values For Single Column In Pandas"