Difference Of Sets Of Multiple Values For Single Column In Pandas

January 31, 2024 Post a Comment

I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference

Solution 1:

Setup

df
    Dyad Participant  Timestep             Tokens
01           A         1       apple,banana
11           B         1       apple,orange
21           A         2             banana
31           B         2     orange,kumquat
41           A         3             orange
51           B         3        orange,pear
62           A         1        orange,pear
72           B         1  apple,banana,pear
82           A         2   banana,persimmon
92           B         2              apple
102           A         3             banana
112           B         3              apple

tokens = df.Tokens.str.split(',', expand=False).apply(frozenset) 

tokens
0           (apple, banana)
1           (orange, apple)
2                  (banana)
3         (orange, kumquat)
4                  (orange)
5            (orange, pear)
6            (orange, pear)
7     (apple, banana, pear)
8       (persimmon, banana)
9                   (apple)
10                 (banana)
11                  (apple)
Name: Tokens, dtype: object

# union logic - https://stackoverflow.com/a/46402781/4909087
df =  df.assign(Tokens=tokens)\
        .groupby(['Dyad', 'Participant']).apply(\
               lambda x: (x.Tokens.str.len() - 
                      x.Tokens.diff().str.len()) \
                    / pd.Series([len(k[0].union(k[1])) 
   for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
        .reset_index(level=[0, 1], name='TokenOverlap')\
        .assign(Timestep=df.Timestep, Tokens=df.Tokens)\
        .sort_values(['Dyad', 'Timestep', 'Participant'])\
        .fillna('(no value)')\
         [['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

df

    Dyad Participant  Timestep             Tokens TokenOverlap
01           A         1       apple,banana   (no value)
11           B         1       apple,orange   (no value)
21           A         2             banana          0.531           B         2     orange,kumquat     0.33333341           A         3             orange            051           B         3        orange,pear     0.33333362           A         1        orange,pear   (no value)
72           B         1  apple,banana,pear   (no value)
82           A         2   banana,persimmon            092           B         2              apple     0.333333102           A         3             banana          0.5112           B         3              apple            1

In a nutshell, what this code is doing is, grouping by Dyad and Participant, and then finding pairwise ratio. This needs some complicated groupby and apply, since we need to do a few set union and difference operations. The The core logic is inside the groupby.apply, while the rest is just prettification.

This code runs in:

Baca Juga

10 loops, best of 3: 19.2 ms per loop

Breakdown

df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
                   lambda x: (x.Tokens.str.len() - 
                          x.Tokens.diff().str.len()) \
                        / pd.Series([len(k[0].union(k[1])) 
       for k inzip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line

df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')    
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')    
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

Python Manual

Difference Of Sets Of Multiple Values For Single Column In Pandas

Solution 1:

Post a Comment for "Difference Of Sets Of Multiple Values For Single Column In Pandas"