Removing Duplicates From A Series Based On A Symmetric Matrix In Pandas
Solution 1:
The following code runs faster than the currently accepted answer:
import numpy as np
defdm_to_series1(df):
df = df.astype(float)
df.values[np.triu_indices_from(df, k=1)] = np.nan
return df.unstack().dropna()
The type of the DataFrame
is converted to float
so that elements can be nulled with np.nan
. In practice, a distance matrix would probably already store floats so this step may not be strictly necessary. The upper triangle (excluding the diagonal) is nulled and these entries are removed after converting the DataFrame
to a Series
.
I adapted the currently accepted solution in order to compare runtimes. Note that I updated it to use a set instead of a list for faster runtime:
def dm_to_series2(df):
ser = df.stack()
seen = set()
for tup in ser.index.tolist():
if tup[::-1] in seen:
continue
seen.add(tup)
return ser[seen]
Testing the two solutions on the original example dataset:
import pandas as pd
df = pd.DataFrame([[0, 1, 2],
[1, 0, 3],
[2, 3, 0]],
index=['a', 'b', 'c'],
columns=['a', 'b', 'c'])
My solution:
In [4]: %timeit dm_to_series1(df)
1000 loops, best of 3: 538 µs per loop
@Marius' solution:
In [5]: %timeit dm_to_series2(df)
1000 loops, best of 3: 816 µs per loop
I also tested against a larger distance matrix by randomly generating a 50x50 matrix using scikit-bio's skbio.stats.distance.randdm
function and converting that to a DataFrame
:
from skbio.stats.distance importranddmbig_dm= randdm(50)
big_df = pd.DataFrame(big_dm.data, index=big_dm.ids, columns=big_dm.ids)
My solution:
In [7]: %timeit dm_to_series1(big_df)
1000 loops, best of 3: 649 µs per loop
@Marius' solution:
In [8]: %timeit dm_to_series2(big_df)
100 loops, best of 3: 3.61 ms per loop
Note that my solution may not be as memory-efficient as @Marius' solution because I'm creating a copy of the input DataFrame
and making modifications to it. If it is acceptable to modify the input DataFrame
, the code could be updated to be more memory-efficient by using in-place DataFrame
operations.
Note: my solution was inspired by the answers in this SO question.
Solution 2:
I'm not sure about how efficient this is, but this works:
seen = []
for tup in ser.index.tolist():
if tup[::-1] in seen:
continue
seen.append(tup)
ser_reduced = ser[seen]
ser_reduced
Out[9]:
a a 0
b 1
c 2
b b 0
c 3
c c 0
dtype: int64
Post a Comment for "Removing Duplicates From A Series Based On A Symmetric Matrix In Pandas"