%global _empty_manifest_terminate_build 0 Name: python-tfrecord Version: 1.14.3 Release: 1 Summary: TFRecord reader License: MIT URL: https://github.com/vahidk/tfrecord Source0: https://mirrors.nju.edu.cn/pypi/web/packages/4e/16/f61a9612e938c8e95571d878d3b1bcf48eaa46ded35b62137f5f98b79290/tfrecord-1.14.3.tar.gz BuildArch: noarch Requires: python3-numpy Requires: python3-protobuf Requires: python3-crc32c %description # TFRecord reader and writer This library allows reading and writing tfrecord files efficiently in python. The library also provides an IterableDataset reader of tfrecord files for PyTorch. Currently uncompressed and compressed gzip TFRecords are supported. ## Installation ```pip3 install tfrecord``` ## Usage It's recommended to create an index file for each TFRecord file. Index file must be provided when using multiple workers, otherwise the loader may return duplicate records. You can create an index file for an individual tfrecord file with this utility program: ``` python3 -m tfrecord.tools.tfrecord2idx ``` To create "*.tfidnex" files for all "*.tfrecord" files in a directory run: ``` tfrecord2idx ``` ## Reading & Writing tf.train.Example ### Reading tf.Example records in PyTorch Use TFRecordDataset to read TFRecord files in PyTorch. ```python import torch from tfrecord.torch.dataset import TFRecordDataset tfrecord_path = "/tmp/data.tfrecord" index_path = None description = {"image": "byte", "label": "float"} dataset = TFRecordDataset(tfrecord_path, index_path, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Use MultiTFRecordDataset to read multiple TFRecord files. This class samples from given tfrecord files with given probability. ```python import torch from tfrecord.torch.dataset import MultiTFRecordDataset tfrecord_pattern = "/tmp/{}.tfrecord" index_pattern = "/tmp/{}.index" splits = { "dataset1": 0.8, "dataset2": 0.2, } description = {"image": "byte", "label": "int"} dataset = MultiTFRecordDataset(tfrecord_pattern, index_pattern, splits, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` ### Infinite and finite PyTorch dataset By default, `MultiTFRecordDataset` is infinite, meaning that it samples the data forever. You can make it finite by providing the appropriate flag ``` dataset = MultiTFRecordDataset(..., infinite=False) ``` ### Shuffling the data Both TFRecordDataset and MultiTFRecordDataset automatically shuffle the data when you provide a queue size. ``` dataset = TFRecordDataset(..., shuffle_queue_size=1024) ``` ### Transforming input data You can optionally pass a function as `transform` argument to perform post processing of features before returning. This can for example be used to decode images or normalize colors to a certain range or pad variable length sequence. ```python import tfrecord import cv2 def decode_image(features): # get BGR image from bytes features["image"] = cv2.imdecode(features["image"], -1) return features description = { "image": "bytes", } dataset = tfrecord.torch.TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=description, transform=decode_image) data = next(iter(dataset)) print(data) ``` ### Writing tf.Example records in Python ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({ "image": (image_bytes, "byte"), "label": (label, "float"), "index": (index, "int") }) writer.close() ``` ### Reading tf.Example records in Python ```python import tfrecord loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, { "image": "byte", "label": "float", "index": "int" }) for record in loader: print(record["label"]) ``` ## Reading & Writing tf.train.SequenceExample SequenceExamples can be read and written using the same methods shown above with an extra argument (`sequence_description` for reading and `sequence_datum` for writing) which cause the respective read/write functions to treat the data as a SequenceExample. ### Writing SequenceExamples to file ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')}) writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')}) writer.close() ``` ### Reading SequenceExamples in python Reading from a SequenceExample yeilds a tuple containing two elements. ```python import tfrecord context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int", "seq_labels": "int"} loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, context_description, sequence_description=sequence_description) for context, sequence_feats in loader: print(context["label"]) print(sequence_feats["seq_labels"]) ``` ### Read SequenceExamples in PyTorch As described in the section on `Transforming Input`, one can pass a function as the `transform` argument to perform post processing of features. This should be used especially for the sequence features as these are variable length sequence and need to be padded out before being batched. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset PAD_WIDTH = 5 def pad_sequence_feats(data): context, features = data for k, v in features.items(): features[k] = np.pad(v, ((0, PAD_WIDTH - len(v)), (0, 0)), 'constant') return (context, features) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Alternatively, you could choose to implement a custom `collate_fn` in order to assemble the batch, for example, to perform dynamic padding. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset def collate_fn(batch): from torch.utils.data._utils import collate from torch.nn.utils import rnn context, feats = zip(*batch) feats_ = {k: [torch.Tensor(d[k]) for d in feats] for k in feats[0]} return (collate.default_collate(context), {k: rnn.pad_sequence(f, True) for (k, f) in feats_.items()}) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32, collate_fn=collate_fn) data = next(iter(loader)) print(data) ``` %package -n python3-tfrecord Summary: TFRecord reader Provides: python-tfrecord BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-tfrecord # TFRecord reader and writer This library allows reading and writing tfrecord files efficiently in python. The library also provides an IterableDataset reader of tfrecord files for PyTorch. Currently uncompressed and compressed gzip TFRecords are supported. ## Installation ```pip3 install tfrecord``` ## Usage It's recommended to create an index file for each TFRecord file. Index file must be provided when using multiple workers, otherwise the loader may return duplicate records. You can create an index file for an individual tfrecord file with this utility program: ``` python3 -m tfrecord.tools.tfrecord2idx ``` To create "*.tfidnex" files for all "*.tfrecord" files in a directory run: ``` tfrecord2idx ``` ## Reading & Writing tf.train.Example ### Reading tf.Example records in PyTorch Use TFRecordDataset to read TFRecord files in PyTorch. ```python import torch from tfrecord.torch.dataset import TFRecordDataset tfrecord_path = "/tmp/data.tfrecord" index_path = None description = {"image": "byte", "label": "float"} dataset = TFRecordDataset(tfrecord_path, index_path, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Use MultiTFRecordDataset to read multiple TFRecord files. This class samples from given tfrecord files with given probability. ```python import torch from tfrecord.torch.dataset import MultiTFRecordDataset tfrecord_pattern = "/tmp/{}.tfrecord" index_pattern = "/tmp/{}.index" splits = { "dataset1": 0.8, "dataset2": 0.2, } description = {"image": "byte", "label": "int"} dataset = MultiTFRecordDataset(tfrecord_pattern, index_pattern, splits, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` ### Infinite and finite PyTorch dataset By default, `MultiTFRecordDataset` is infinite, meaning that it samples the data forever. You can make it finite by providing the appropriate flag ``` dataset = MultiTFRecordDataset(..., infinite=False) ``` ### Shuffling the data Both TFRecordDataset and MultiTFRecordDataset automatically shuffle the data when you provide a queue size. ``` dataset = TFRecordDataset(..., shuffle_queue_size=1024) ``` ### Transforming input data You can optionally pass a function as `transform` argument to perform post processing of features before returning. This can for example be used to decode images or normalize colors to a certain range or pad variable length sequence. ```python import tfrecord import cv2 def decode_image(features): # get BGR image from bytes features["image"] = cv2.imdecode(features["image"], -1) return features description = { "image": "bytes", } dataset = tfrecord.torch.TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=description, transform=decode_image) data = next(iter(dataset)) print(data) ``` ### Writing tf.Example records in Python ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({ "image": (image_bytes, "byte"), "label": (label, "float"), "index": (index, "int") }) writer.close() ``` ### Reading tf.Example records in Python ```python import tfrecord loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, { "image": "byte", "label": "float", "index": "int" }) for record in loader: print(record["label"]) ``` ## Reading & Writing tf.train.SequenceExample SequenceExamples can be read and written using the same methods shown above with an extra argument (`sequence_description` for reading and `sequence_datum` for writing) which cause the respective read/write functions to treat the data as a SequenceExample. ### Writing SequenceExamples to file ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')}) writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')}) writer.close() ``` ### Reading SequenceExamples in python Reading from a SequenceExample yeilds a tuple containing two elements. ```python import tfrecord context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int", "seq_labels": "int"} loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, context_description, sequence_description=sequence_description) for context, sequence_feats in loader: print(context["label"]) print(sequence_feats["seq_labels"]) ``` ### Read SequenceExamples in PyTorch As described in the section on `Transforming Input`, one can pass a function as the `transform` argument to perform post processing of features. This should be used especially for the sequence features as these are variable length sequence and need to be padded out before being batched. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset PAD_WIDTH = 5 def pad_sequence_feats(data): context, features = data for k, v in features.items(): features[k] = np.pad(v, ((0, PAD_WIDTH - len(v)), (0, 0)), 'constant') return (context, features) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Alternatively, you could choose to implement a custom `collate_fn` in order to assemble the batch, for example, to perform dynamic padding. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset def collate_fn(batch): from torch.utils.data._utils import collate from torch.nn.utils import rnn context, feats = zip(*batch) feats_ = {k: [torch.Tensor(d[k]) for d in feats] for k in feats[0]} return (collate.default_collate(context), {k: rnn.pad_sequence(f, True) for (k, f) in feats_.items()}) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32, collate_fn=collate_fn) data = next(iter(loader)) print(data) ``` %package help Summary: Development documents and examples for tfrecord Provides: python3-tfrecord-doc %description help # TFRecord reader and writer This library allows reading and writing tfrecord files efficiently in python. The library also provides an IterableDataset reader of tfrecord files for PyTorch. Currently uncompressed and compressed gzip TFRecords are supported. ## Installation ```pip3 install tfrecord``` ## Usage It's recommended to create an index file for each TFRecord file. Index file must be provided when using multiple workers, otherwise the loader may return duplicate records. You can create an index file for an individual tfrecord file with this utility program: ``` python3 -m tfrecord.tools.tfrecord2idx ``` To create "*.tfidnex" files for all "*.tfrecord" files in a directory run: ``` tfrecord2idx ``` ## Reading & Writing tf.train.Example ### Reading tf.Example records in PyTorch Use TFRecordDataset to read TFRecord files in PyTorch. ```python import torch from tfrecord.torch.dataset import TFRecordDataset tfrecord_path = "/tmp/data.tfrecord" index_path = None description = {"image": "byte", "label": "float"} dataset = TFRecordDataset(tfrecord_path, index_path, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Use MultiTFRecordDataset to read multiple TFRecord files. This class samples from given tfrecord files with given probability. ```python import torch from tfrecord.torch.dataset import MultiTFRecordDataset tfrecord_pattern = "/tmp/{}.tfrecord" index_pattern = "/tmp/{}.index" splits = { "dataset1": 0.8, "dataset2": 0.2, } description = {"image": "byte", "label": "int"} dataset = MultiTFRecordDataset(tfrecord_pattern, index_pattern, splits, description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` ### Infinite and finite PyTorch dataset By default, `MultiTFRecordDataset` is infinite, meaning that it samples the data forever. You can make it finite by providing the appropriate flag ``` dataset = MultiTFRecordDataset(..., infinite=False) ``` ### Shuffling the data Both TFRecordDataset and MultiTFRecordDataset automatically shuffle the data when you provide a queue size. ``` dataset = TFRecordDataset(..., shuffle_queue_size=1024) ``` ### Transforming input data You can optionally pass a function as `transform` argument to perform post processing of features before returning. This can for example be used to decode images or normalize colors to a certain range or pad variable length sequence. ```python import tfrecord import cv2 def decode_image(features): # get BGR image from bytes features["image"] = cv2.imdecode(features["image"], -1) return features description = { "image": "bytes", } dataset = tfrecord.torch.TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=description, transform=decode_image) data = next(iter(dataset)) print(data) ``` ### Writing tf.Example records in Python ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({ "image": (image_bytes, "byte"), "label": (label, "float"), "index": (index, "int") }) writer.close() ``` ### Reading tf.Example records in Python ```python import tfrecord loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, { "image": "byte", "label": "float", "index": "int" }) for record in loader: print(record["label"]) ``` ## Reading & Writing tf.train.SequenceExample SequenceExamples can be read and written using the same methods shown above with an extra argument (`sequence_description` for reading and `sequence_datum` for writing) which cause the respective read/write functions to treat the data as a SequenceExample. ### Writing SequenceExamples to file ```python import tfrecord writer = tfrecord.TFRecordWriter("/tmp/data.tfrecord") writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')}) writer.write({'length': (3, 'int'), 'label': (1, 'int')}, {'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')}) writer.close() ``` ### Reading SequenceExamples in python Reading from a SequenceExample yeilds a tuple containing two elements. ```python import tfrecord context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int", "seq_labels": "int"} loader = tfrecord.tfrecord_loader("/tmp/data.tfrecord", None, context_description, sequence_description=sequence_description) for context, sequence_feats in loader: print(context["label"]) print(sequence_feats["seq_labels"]) ``` ### Read SequenceExamples in PyTorch As described in the section on `Transforming Input`, one can pass a function as the `transform` argument to perform post processing of features. This should be used especially for the sequence features as these are variable length sequence and need to be padded out before being batched. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset PAD_WIDTH = 5 def pad_sequence_feats(data): context, features = data for k, v in features.items(): features[k] = np.pad(v, ((0, PAD_WIDTH - len(v)), (0, 0)), 'constant') return (context, features) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32) data = next(iter(loader)) print(data) ``` Alternatively, you could choose to implement a custom `collate_fn` in order to assemble the batch, for example, to perform dynamic padding. ```python import torch import numpy as np from tfrecord.torch.dataset import TFRecordDataset def collate_fn(batch): from torch.utils.data._utils import collate from torch.nn.utils import rnn context, feats = zip(*batch) feats_ = {k: [torch.Tensor(d[k]) for d in feats] for k in feats[0]} return (collate.default_collate(context), {k: rnn.pad_sequence(f, True) for (k, f) in feats_.items()}) context_description = {"length": "int", "label": "int"} sequence_description = {"tokens": "int ", "seq_labels": "int"} dataset = TFRecordDataset("/tmp/data.tfrecord", index_path=None, description=context_description, transform=pad_sequence_feats, sequence_description=sequence_description) loader = torch.utils.data.DataLoader(dataset, batch_size=32, collate_fn=collate_fn) data = next(iter(loader)) print(data) ``` %prep %autosetup -n tfrecord-1.14.3 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-tfrecord -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue May 30 2023 Python_Bot - 1.14.3-1 - Package Spec generated