Skip to content Skip to sidebar Skip to footer

Recording Data In A Long Running Python Simulation

I am running a simulation from which I need to record some small numpy arrays every cycle. My current solution is to load, write then save as follows: existing_data = np.load('exis

Solution 1:

I have found a good working solution using the h5py library. Performance is far better as there is no reading data and I have cut down on the number of nump array append operations. A short example:

with h5py.File("logfile_name", "a") as f:
  ds = f.create_dataset("weights", shape=(3,2,100000), maxshape=(3, 2, None))
  ds[:,:,cycle_num] = weight_matrix

I am not sure if the numpy style slicing means the matrix gets copied but there is a write_direct(source, source_sel=None, dest_sel=None) function to avoid this happening which could be useful for larger matrices.

Solution 2:

I think one solution is using a memory mapped file through numpy.memmap. The code can be found below. The documentation contains important information to understand the code.

import numpy as np
from os.path import getsize

from time import time

filename = "data.bin"# Datatype used for memmap
dtype = np.int32

# Create memmap for the first time (w+). Arbitrary shape. Probably good to try and guess the correct size.
mm = np.memmap(filename, dtype=dtype, mode='w+', shape=(1, ))
print("File has {} bytes".format(getsize(filename)))


N = 20
num_data_per_loop = 10**7# Main loop to append datafor i inrange(N):

    # will extend the file because mode='r+'
    starttime = time()
    mm = np.memmap(filename,
                   dtype=dtype,
                   mode='r+',
                   offset=np.dtype(dtype).itemsize*num_data_per_loop*i,
                   shape=(num_data_per_loop, ))
    mm[:] = np.arange(start=num_data_per_loop*i, stop=num_data_per_loop*(i+1))
    mm.flush()
    endtime = time()
    print("{:3d}/{:3d} ({:6.4f} sec): File has {} bytes".format(i, N, endtime-starttime, getsize(filename)))

A = np.array(np.memmap(filename, dtype=dtype, mode='r'))
if np.array_equal(A, np.arange(num_data_per_loop*N, dtype=dtype)):
    print("Correct")

The output I get is:

File has 4bytes0/ 20 (0.2167 sec): File has 40000000bytes1/ 20 (0.2200 sec): File has 80000000bytes2/ 20 (0.2131 sec): File has 120000000bytes3/ 20 (0.2180 sec): File has 160000000bytes4/ 20 (0.2215 sec): File has 200000000bytes5/ 20 (0.2141 sec): File has 240000000bytes6/ 20 (0.2187 sec): File has 280000000bytes7/ 20 (0.2138 sec): File has 320000000bytes8/ 20 (0.2137 sec): File has 360000000bytes9/ 20 (0.2227 sec): File has 400000000bytes10/ 20 (0.2168 sec): File has 440000000bytes11/ 20 (0.2141 sec): File has 480000000bytes12/ 20 (0.2150 sec): File has 520000000bytes13/ 20 (0.2144 sec): File has 560000000bytes14/ 20 (0.2190 sec): File has 600000000bytes15/ 20 (0.2186 sec): File has 640000000bytes16/ 20 (0.2210 sec): File has 680000000bytes17/ 20 (0.2146 sec): File has 720000000bytes18/ 20 (0.2178 sec): File has 760000000bytes19/ 20 (0.2182 sec): File has 800000000bytes
Correct

The time is approximately constant over the iterations because of the offsets used for memmap. Also the amount of RAM needed (apart from loading the whole memmap for the check at the end) is constant.

I hope this solves your performance issues

kind regards

Lukas

Edit 1: It seems the poster has solved his own question. I leave this answer up as an alternative.

Post a Comment for "Recording Data In A Long Running Python Simulation"