Split .tfrecords File Into Many .tfrecords Files

May 30, 2024 Post a Comment

Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?

Solution 1:

In tensorflow 2.0.0, this will work:

import tensorflow as tf

raw_dataset = tf.data.TFRecordDataset("input_file.tfrecord")

shards = 10for i in range(shards):
    writer = tf.data.experimental.TFRecordWriter(f"output_file-part-{i}.tfrecord")
    writer.write(raw_dataset.shard(shards, i))

Solution 2:

You can use a function like this:

import tensorflow as tf

defsplit_tfrecord(tfrecord_path, split_size):
    with tf.Graph().as_default(), tf.Session() as sess:
        ds = tf.data.TFRecordDataset(tfrecord_path).batch(split_size)
        batch = ds.make_one_shot_iterator().get_next()
        part_num = 0whileTrue:
            try:
                records = sess.run(batch)
                part_path = tfrecord_path + '.{:03d}'.format(part_num)
                with tf.python_io.TFRecordWriter(part_path) as writer:
                    for record in records:
                        writer.write(record)
                part_num += 1except tf.errors.OutOfRangeError: break

For example, to split the file my_records.tfrecord into parts of 100 records each, you would do:

split_tfrecord(my_records.tfrecord, 100)

This would create multiple smaller record files my_records.tfrecord.000, my_records.tfrecord.001, etc.

Solution 3:

Using `.batch()` instead of `.shard()` to avoid iterating over dataset multiple times

A more performant approach (compared to using tf.data.Dataset.shard()) would be to use batching:

import tensorflow as tf

ITEMS_PER_FILE = 100# Assuming we are saving 100 items per .tfrecord file


raw_dataset = tf.data.TFRecordDataset('in.tfrecord')

batch_idx = 0for batch in raw_dataset.batch(ITEMS_PER_FILE):

    # Converting `batch` back into a `Dataset`, assuming batch is a `tuple` of `tensors`
    batch_ds = tf.data.Dataset.from_tensor_slices(tuple([*batch]))
    filename = f'out.tfrecord.{batch_idx:03d}'

    writer = tf.data.experimental.TFRecordWriter(filename)
    writer.write(batch_ds)

    batch_idx += 1

Solution 4:

Very efficient way for TensorFlow 2.x

As mentioned by @yongjieyongjie you should use .batch() instead of .shard() to avoid iterating more often over the dataset as needed. But in case you have a very large dataset, too big for memory, it will fail (but no error), just giving you a few files and a fraction of your original dataset.

First you should batch your dataset, and use as batch size the amount of records you want to have per file (I assume your dataset is already in serialized format, otherwise see here).

dataset = dataset.batch(ITEMS_PER_FILE)

Next thing you want to do, is to use a generator to avoid running out of memory.

defwrite_generator():
    i = 0
    iterator = iter(dataset)
    optional = iterator.get_next_as_optional()
    while optional.has_value().numpy():
        ds = optional.get_value()
        optional = iterator.get_next_as_optional()
        batch_ds = tf.data.Dataset.from_tensor_slices(ds)
        writer = tf.data.experimental.TFRecordWriter(save_to + "\\" + name + "-" + str(i) + ".tfrecord", compression_type='GZIP')#compression_type='GZIP'
        i += 1yield batch_ds, writer, i
    return

Now simply use the generator in a normal for-loop

for data, wri, i in write_generator():
    start_time = time.time()
    wri.write(data)
    print("Time needed: ", time.time() - start_time, "s", "\t", NAME_OF_FILES + "-" + str(i) + ".tfrecord")

As long one single file fits raw in memory, this should just work fine.

Solution 5:

Uneven splits

Most of the other answers work if you want to split evenly into files of equal size. This will work with uneven splits:

# `splits` is a list of the number of records you want in each output file
def split_files(filename: str, splits: List[int]) -> None:
    dataset: tf.data.Dataset = tf.data.TFRecordDataset(filename)
    rec_counter: int = 0# An extra iteration over the data to get the size
    total_records: int = len([r for r in dataset])
    print(f"Found {total_records} records in source file.")

    if sum(splits) != total_records:
        raise ValueError(f"Sum of splits {sum(splits)} does not equal "
                         f"total number of records {total_records}")

    rec_iter:Iterator = iter(dataset)
    split: intfor split_idx, split in enumerate(splits):
        outfile: str = filename + f".{split_idx}-{split}"
        with tf.io.TFRecordWriter(outfile) as writer:
            for out_idx in range(split):
                rec: tf.Tensor = next(rec_iter, None)
                rec_counter +=1
                writer.write(rec.numpy())
        print(f"Finished writing {split} records to file {split_idx}")

Though I suppose technically the OP asked without writing back each Dataset example (which is what this does), this at least is doing it without deserializing each example.

It is a bit slow for very large files. There is probably a way to modify some of the other batching-based answers in order to use batched input reading but still write uneven splits, but I haven't tried.

Python Manual

Split .tfrecords File Into Many .tfrecords Files

Solution 1:

Solution 2:

Solution 3:

Using `.batch()` instead of `.shard()` to avoid iterating over dataset multiple times

Solution 4:

Very efficient way for TensorFlow 2.x

Solution 5:

Uneven splits

Post a Comment for "Split .tfrecords File Into Many .tfrecords Files"

Split .tfrecords File Into Many .tfrecords Files

Solution 1:

Solution 2:

Solution 3:

Using .batch() instead of .shard() to avoid iterating over dataset multiple times

Solution 4:

Very efficient way for TensorFlow 2.x

Solution 5:

Uneven splits

Post a Comment for "Split .tfrecords File Into Many .tfrecords Files"

Using `.batch()` instead of `.shard()` to avoid iterating over dataset multiple times