Only Read Certain Rows In A Csv File With Python
Solution 1:
An option would be to use Pandas. For example:
import pandas as pd
# Select file
infile = r'path/file'# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)
Solution 2:
You can use chunksize
import pandas as pd
chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Solution 3:
If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.
As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:
withopen(file_name) as f:
for line,_ inzip(f, range(50)):
pass
lines = [line for line,_ inzip(f, range(10))]
...or to read a character at a time using seek
, and count new line characters. But it is certainly MUCH more convenient to do the first.
However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict
of the line lengths ahead of time:
from itertools import accumulate
withopen(file_name) as f:
cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))
This would allow you to seek to any line number in the file without processing the whole thing ever again:
defseek_line(path, line_num, cum_lens):
withopen(path) as f:
f.seek(cum_lens[line_num], 0)
return f.readline()
classLineX:
"""A file reading object that can quickly obtain any line number."""def__init__(self, path, cum_lens):
self.cum_lens = cum_lens
self.path = path
def__getitem__(self, i):
return seek_line(self.path, i, self.cum_lens)
linex = LineX(file_name, cum_lens)
line50 = linex[50]
But at this point, you might be better off loading the file contents into some kind of database. I depends on what you're trying to do, and what kind of data the file contains.
Solution 4:
As others are saying the most obvious solution is to use pandas read csv ! The method has a parameter called skiprows:
from the doc there is what is said :
skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
You can have something like this :
import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x notinrange(50, 60))
Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial
he said :
The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
You can try this and iterate over the chunk to retrieve only the rows you are looking for.
The function should return true if the row number is in the specified list
Solution 5:
its that easy:
withopen("file.csv", "r") as file:
print(file.readlines()[50:60])
Post a Comment for "Only Read Certain Rows In A Csv File With Python"