Skip to content Skip to sidebar Skip to footer

Removing Duplicates From A Series Based On A Symmetric Matrix In Pandas

I am new to Pandas and have been unable to find a succinct solution to the following problem. Say I have a Series of data based on a symmetric (distance)matrix, what is the most

Solution 1:

The following code runs faster than the currently accepted answer:

import numpy as np

defdm_to_series1(df):
    df = df.astype(float)
    df.values[np.triu_indices_from(df, k=1)] = np.nan
    return df.unstack().dropna()

The type of the DataFrame is converted to float so that elements can be nulled with np.nan. In practice, a distance matrix would probably already store floats so this step may not be strictly necessary. The upper triangle (excluding the diagonal) is nulled and these entries are removed after converting the DataFrame to a Series.

I adapted the currently accepted solution in order to compare runtimes. Note that I updated it to use a set instead of a list for faster runtime:

def dm_to_series2(df):
    ser = df.stack()

    seen = set()
    for tup in ser.index.tolist():
        if tup[::-1] in seen:
            continue
        seen.add(tup)

    return ser[seen]

Testing the two solutions on the original example dataset:

import pandas as pd

df = pd.DataFrame([[0, 1, 2],
                   [1, 0, 3],
                   [2, 3, 0]], 
                  index=['a', 'b', 'c'], 
                  columns=['a', 'b', 'c'])

My solution:

In [4]: %timeit dm_to_series1(df)
1000 loops, best of 3: 538 µs per loop

@Marius' solution:

In [5]: %timeit dm_to_series2(df)
1000 loops, best of 3: 816 µs per loop

I also tested against a larger distance matrix by randomly generating a 50x50 matrix using scikit-bio's skbio.stats.distance.randdm function and converting that to a DataFrame:

from skbio.stats.distance importranddmbig_dm= randdm(50)
big_df = pd.DataFrame(big_dm.data, index=big_dm.ids, columns=big_dm.ids)

My solution:

In [7]: %timeit dm_to_series1(big_df)
1000 loops, best of 3: 649 µs per loop

@Marius' solution:

In [8]: %timeit dm_to_series2(big_df)
100 loops, best of 3: 3.61 ms per loop

Note that my solution may not be as memory-efficient as @Marius' solution because I'm creating a copy of the input DataFrame and making modifications to it. If it is acceptable to modify the input DataFrame, the code could be updated to be more memory-efficient by using in-place DataFrame operations.

Note: my solution was inspired by the answers in this SO question.

Solution 2:

I'm not sure about how efficient this is, but this works:

seen = []

for tup in ser.index.tolist():
    if tup[::-1] in seen:
        continue
    seen.append(tup)

ser_reduced = ser[seen]

ser_reduced
Out[9]: 
a  a    0
   b    1
   c    2
b  b    0
   c    3
c  c    0
dtype: int64

Post a Comment for "Removing Duplicates From A Series Based On A Symmetric Matrix In Pandas"