Skip to content Skip to sidebar Skip to footer

Creating Reference To Hdf Dataset In H5py Using Astype

From the h5py docs, I see that I can cast a HDF dataset as another type using astype method for the datasets. This returns a contextmanager which performs the conversion on-the-fly

Solution 1:

d.astype() returns an AstypeContext object. If you look at the source for AstypeContext you'll get a better idea of what's going on:

classAstypeContext(object):def__init__(self, dset, dtype):
        self._dset = dset
        self._dtype = numpy.dtype(dtype)

    def__enter__(self):
        self._dset._local.astype = self._dtype

    def__exit__(self, *args):
        self._dset._local.astype = None

When you enter the AstypeContext, the ._local.astype attribute of your dataset gets updated to the new desired type, and when you exit the context it gets changed back to its original value.

You can therefore get more or less the behaviour you're looking for like this:

defget_dataset_as_type(d, dtype='float32'):

    # creates a new Dataset instance that points to the same HDF5 identifier
    d_new = HDF.Dataset(d.id)

    # set the ._local.astype attribute to the desired output type
    d_new._local.astype = np.dtype(dtype)

    return d_new

When you now read from d_new, you will get float32 numpy arrays back rather than uint16:

d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')

print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81.,  65.,  33.,  22.,  67.,  57.,  94.,  63.,  89.,  68.], dtype=float32)

print(d.dtype, d_new.dtype)
# uint16, uint16

Note that this doesn't update the .dtype attribute of d_new (which seems to be immutable). If you also wanted to change the dtype attribute, you'd probably need to subclass h5py.Dataset in order to do so.

Solution 2:

The docs of astype seem to imply reading it all into a new location is its purpose. Thus your return d[:] is the most reasonable if you are to reuse the float-casting with many functions at seperate occasions.

If you know what you need the casting for and only need it once, you could switch things around and do something like:

defget_dataset_as_float(intdata, *funcs):
    with HDF.File('data.h5', 'w') as hf:
        d = hf.create_dataset('data', data=intdata)
        with d.astype('float32'):
            d2 = d[...]
            returntuple(f(d2) for f in funcs)

In any case, you want to make sure that hf is closed before leaving the function or else you will run into problems later on.

In general, I would suggest separating the casting and the loading/creating of the data-set entirely and passing the dataset as one of the function's parameters.

Above can be called as follows:

In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)

Post a Comment for "Creating Reference To Hdf Dataset In H5py Using Astype"