Handling Ragged Csv Columns In Pandas
Solution 1:
For your first problem, you can pass a names=...
parameter to read_csv
:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder
. If you are looking to convert each column to a one hot encoding, you may use it.
Solution 2:
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers 2.) This is for a problem with 120 classes 3.) You want a matrix with 1s and 0s for each class 4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlapsfor k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as @cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.
Post a Comment for "Handling Ragged Csv Columns In Pandas"