Convert Dataframe Rows To Python Set
I have this dataset: import pandas as pd import itertools A = ['A','B','C'] M = ['1','2','3'] F = ['plus','minus','square'] df = pd.DataFrame(list(itertools.product(A,M,F)), col
Solution 1:
A full implementation of what you want can be found here:
series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))
Solution 2:
You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -
def jaccard_similarity_score(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
This is already faster, but let's see if we can get even faster.
Using senderle's fast cartesian_product
-
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
Post a Comment for "Convert Dataframe Rows To Python Set"