Skip to content Skip to sidebar Skip to footer

Handling Ragged Csv Columns In Pandas

I have a CSV file containing data: (just the first ten rows of data are listed) 0,11,31,65,67 1,31,33,67 2,33,43,67 3,31,33,67 4,24,31,33,65,67,68,71,75,76,93,97 5,31,33,67 6,65,93

Solution 1:

For your first problem, you can pass a names=... parameter to read_csv:

df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')

As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.

Solution 2:

I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following

1.) You have a csv of numbers 2.) This is for a problem with 120 classes 3.) You want a matrix with 1s and 0s for each class 4.) Example a csv such as:

1, 3
2, 3, 6

would be the feature matrix

Column:
1, 2, 3, 6

1, 0, 1, 0
0, 1, 1, 1

Thus this code achieves that, but it is surely not optimized:

df = pd.read_csv(file, header=None, names=range(121), sep=',')      

one_hot = []
for k in df.columns:
    one_hot.append(pd.get_dummies(df[k]))

for n, l in enumerate(one_hot):
    if n == 0:
        df = one_hot[n]
    else:
        df = func(df1=df, df2=one_hot[n])

def func(df1, df2):
    # We can't join if columns overlap. Use set operations to identify
    non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
    overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))

    # Join where possible
    df2_join = df2[non_overlapping_columns]
    df3 = df1.join(df2_join)

    # Manually add columns for overlapsfor k in overlapping_columns:
        df3[k] = df3[k]+df2[k]

    return df3

From here you could feed it into sklean onehot, as @cᴏʟᴅsᴘᴇᴇᴅ noted.

That would look like this:

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)

I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.

I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.

Post a Comment for "Handling Ragged Csv Columns In Pandas"