Only Read Certain Rows In A Csv File With Python

August 20, 2024 Post a Comment

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point. Lets say i have a csv

Solution 1:

An option would be to use Pandas. For example:

import pandas as pd
# Select file 
infile = r'path/file'# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)

Solution 2:

You can use chunksize

import pandas as pd

chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Solution 3:

If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.

As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:

withopen(file_name) as f:
    for line,_ inzip(f, range(50)):
        pass
    lines = [line for line,_ inzip(f, range(10))]

...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.

However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:

from itertools import accumulate
withopen(file_name) as f:
    cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))

This would allow you to seek to any line number in the file without processing the whole thing ever again:

defseek_line(path, line_num, cum_lens):
    withopen(path) as f:
        f.seek(cum_lens[line_num], 0)
        return f.readline()

classLineX:
    """A file reading object that can quickly obtain any line number."""def__init__(self, path, cum_lens):
        self.cum_lens = cum_lens
        self.path = path
    def__getitem__(self, i):
        return seek_line(self.path, i, self.cum_lens)

linex = LineX(file_name, cum_lens)
line50 = linex[50]

But at this point, you might be better off loading the file contents into some kind of database. I depends on what you're trying to do, and what kind of data the file contains.

Solution 4:

As others are saying the most obvious solution is to use pandas read csv ! The method has a parameter called skiprows:

from the doc there is what is said :

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can have something like this :

import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x notinrange(50, 60))

Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial

he said :

The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

You can try this and iterate over the chunk to retrieve only the rows you are looking for.

The function should return true if the row number is in the specified list

Solution 5:

its that easy:

withopen("file.csv", "r") as file:
    print(file.readlines()[50:60])

Python Manual