Parsing Big Text File Using Regex
I have a huge text file (1 GB), where each 'line' is separated by ##. For example: ## sentence 1 ## sentence 2 ## sentence 3 I'm trying to print the file according to the ## separ
Solution 1:
This will read the file in chunks (of chunksize
bytes) thus avoiding memory issues related to reading too much of the file all at once:
import re
def open_delimited(filename, delimiter, *args, **kwargs):
"""
http://stackoverflow.com/a/17508761/190597
"""
with open(filename, *args, **kwargs) as infile:
chunksize = 10000
remainder = ''
for chunk in iter(lambda: infile.read(chunksize), ''):
pieces = re.split(delimiter, remainder + chunk)
for piece in pieces[:-1]:
yield piece
remainder = pieces[-1]
if remainder:
yield remainder
filename = 'post.txt'
for chunk in open_delimited(filename, '##', 'r'):
print(chunk)
print('-'*80)
Solution 2:
You can use islice
.
from itertools import islice
file = open('file.txt', 'r')
while True:
slice = islice(file, buffer)
to_process = []
for line in slice:
to_process.append(line)
if not to_process:
break
#process to_process list
file.close()
buffer
is the number of lines you want to read at a time (you have to define the int).
Post a Comment for "Parsing Big Text File Using Regex"