Fastest Way To Re-read A File In Python?
Solution 1:
You can use the mmap module to load that file into memory, then iterate.
Example:
import mmap
# write a simple example filewithopen("hello.txt", "wb") as f:
f.write(b"Hello Python!\n")
withopen("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methodsprint(mm.readline()) # prints b"Hello Python!\n"# read content via slice notationprint(mm[:5]) # prints b"Hello"# update content using slice notation;# note that new content must have same size
mm[6:] = b" world!\n"# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"# close the map
mm.close()
Solution 2:
Maybe switch your loops around? Make iterating over the file the outer loop, and iterating over the name list the inner loop.
name_and_positions = [
("name_a", 10, 45),
("name_b", 2, 500),
("name_c", 96, 243),
]
withopen("somefile.txt") as f:
for line in f:
value=int(line.split()[2])
for name, start, endin name_and_positions:
if start<=value<=end:
print("matched {} with {}".format(name, value))
Solution 3:
It seems to me that your problem is not so much re-reading files, but matching slices of a long list with a short list. As other answers have pointed out, you can use plain lists or memory-mapped files to speed up your program.
If you care to use specific data structures for further speed up, then I would advise you to look into blist, specifically because it has a better performance in slicing lists than the standard Python list: they claim O(log n) instead of O(n).
I have measured a speedup of almost 4x on lists of ~10MB:
import random
from blist import blist
LINE_NUMBER = 1000000defwrite_files(line_length=LINE_NUMBER):
withopen('haystack.txt', 'w') as infile:
for _ inrange(line_length):
infile.write('an example\n')
withopen('needles.txt', 'w') as infile:
for _ inrange(line_length / 100):
first_rand = random.randint(0, line_length)
second_rand = random.randint(first_rand, line_length)
needle = random.choice(['an example', 'a sample'])
infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))
defread_files():
withopen('haystack.txt', 'r') as infile:
normal_list = []
for line in infile:
normal_list.append(line.strip())
enhanced_list = blist(normal_list)
return normal_list, enhanced_list
defmatch_over(list_structure):
matches = 0
total = len(list_structure)
withopen('needles.txt', 'r') as infile:
for line in infile:
needle, start, end = line.split('\t')
start, end = int(start), int(end)
if needle in list_structure[start:end]:
matches += 1returnfloat(matches) / float(total)
As measured by IPython's %time
command, the blist
takes 12 s where the plain list
takes 46 s:
In [1]:importmainIn [3]:main.write_files()In [4]:!ls-lh*.txt10Mhaystack.txt233Kneedles.txtIn [5]:normal_list,enhanced_list=main.read_files()In [8]:%timemain.match_over(normal_list)CPU times:user44.9s,sys:1.47s,total:46.4sWall time:46.4sOut[8]:0.005032In [9]:%timemain.match_over(enhanced_list)CPU times:user12.6s,sys:33.7ms,total:12.6sWall time:12.6sOut[9]:0.005032
Post a Comment for "Fastest Way To Re-read A File In Python?"