Skip to content Skip to sidebar Skip to footer

What Is A More Efficient Way To Load 1 Column With 1 000 000+ Rows Than Pandas Read_csv()?

I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is

Solution 1:

You can try something like this:

samples = []
sums = []

withopen('file.txt','r') as f:
    for i,line inenumerate(f):
        columns = line.strip().split('\t')[10:] #from column 10 onwardif i == 0: #supposing the sample_name is the first row of each column
            samples = columns #save sample names
            sums = [0for s in samples] #init the sums to 0else:
            for n,v inenumerate(columns):
                sums[n] += float(v)

result = dict(zip(samples,sums)) #{sample_name:sum, ...}

I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need. Mind that this code does not deal with missing values.

The else block can be improved using numpy:

import numpy as np
...
else:
    sums = np.add(sums, map(float,columns))

Post a Comment for "What Is A More Efficient Way To Load 1 Column With 1 000 000+ Rows Than Pandas Read_csv()?"