Skip to content Skip to sidebar Skip to footer

Combine Consecutive Rows With The Same Column Values

I have something that looks like this. How do I go from this: 0 d 0 The DT 1 Skoll ORGANIZATION 2 Foundation ORGANIZATION 3 , , 4

Solution 1:

@rfan's answer of course works, as an alternative, here's an approach using pandas groupby.

The .groupby() groups the data by the 'b' column - the sort=False is necessary to keep the order intact. The .apply() applies a function to each group of b data, in this case joining the string together separated by spaces.

In[67]: df.groupby('b', sort=False)['a'].apply(' '.join)
Out[67]: 

bDTTheOrgSkollFoundation
,                          ,
VBNbasedINinLocationSiliconValleyName: a, dtype: object

EDIT:

To handle the more general case (repeated non-consecutive values) - an approach would be to first add a sentinel column that tracks which group of consecutive data each row applies to, like this:

df['key'] = (df['b'] != df['b'].shift(1)).astype(int).cumsum()

Then add the key to the groupby and it should work even with repeated values. For example, with this dummy data with repeats:

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley', 'A', 'Foundation'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location', 'Org', 'Org']})

Applying the groupby:

In[897]: df.groupby(['key', 'b'])['a'].apply(' '.join)
Out[897]: 
keyb1DTThe2OrgSkollFoundation3    ,                          ,
4VBNbased5INin6LocationSiliconValley7OrgAFoundationName: a, dtype: object

Solution 2:

I actually think the groupby solution by @chrisb is better, but you would need to create another groupby key variable to track non-consecutive repeated values if those are potentially present. This works as a quick-and-dirty for smaller problems though.


I think this is a situation where it's easier to work with basic iterators, rather than try to use pandas functions. I can imagine a situation using groupby, but it seems difficult to maintain the consecutive condition if the second variable repeats.

This can probably be cleaned up, but a sample:

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location']})

# Initialize result lists with the first row of df
result1 = [df['a'][0]]  
result2 = [df['b'][0]]

# Use zip() to iterate over the two columns of df simultaneously,# making sure to skip the first row which is already addedfor a, b in zip(df['a'][1:], df['b'][1:]):
    if b == result2[-1]:        # If b matches the last value in result2,
        result1[-1] += " " + a  # add a to the last value of result1else:  # Otherwise add a new row with the values
        result1.append(a)
        result2.append(b)

# Create a new dataframe using these result listsdf = DataFrame({'a': result1, 'b': result2})

Post a Comment for "Combine Consecutive Rows With The Same Column Values"