What Is A More Efficient Way To Load 1 Column With 1 000 000+ Rows Than Pandas Read_csv()?

September 03, 2023 Post a Comment

I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is

Solution 1:

You can try something like this:

samples = []
sums = []

withopen('file.txt','r') as f:
    for i,line inenumerate(f):
        columns = line.strip().split('\t')[10:] #from column 10 onwardif i == 0: #supposing the sample_name is the first row of each column
            samples = columns #save sample names
            sums = [0for s in samples] #init the sums to 0else:
            for n,v inenumerate(columns):
                sums[n] += float(v)

result = dict(zip(samples,sums)) #{sample_name:sum, ...}

I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need. Mind that this code does not deal with missing values.

The else block can be improved using numpy:

import numpy as np
...
else:
    sums = np.add(sums, map(float,columns))

Python Manual

What Is A More Efficient Way To Load 1 Column With 1 000 000+ Rows Than Pandas Read_csv()?

Solution 1:

Post a Comment for "What Is A More Efficient Way To Load 1 Column With 1 000 000+ Rows Than Pandas Read_csv()?"