diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-10 08:53:17 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-10 08:53:17 +0000 |
| commit | 236d5c608edf789f4ec80ba07f12225eb8c4824f (patch) | |
| tree | 66f9e8e956432cf2493ed6572cd69652add11182 /python-lunas.spec | |
| parent | 86c363f6fa300af95b061c29b3eaf5edc7b7aff0 (diff) | |
automatic import of python-lunas
Diffstat (limited to 'python-lunas.spec')
| -rw-r--r-- | python-lunas.spec | 788 |
1 files changed, 788 insertions, 0 deletions
diff --git a/python-lunas.spec b/python-lunas.spec new file mode 100644 index 0000000..8e07067 --- /dev/null +++ b/python-lunas.spec @@ -0,0 +1,788 @@ +%global _empty_manifest_terminate_build 0 +Name: python-Lunas +Version: 0.5.1 +Release: 1 +Summary: Building customisable data processing pipeline and data iterators for machine learning. +License: LICENSE +URL: https://github.com/pluiez/lunas +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/78/a5/4e4e93c0a8814fb3b0e83f420b3ff33c81599b55b2bc971edfe2cb8fe387/Lunas-0.5.1.tar.gz +BuildArch: noarch + +Requires: python3-numpy +Requires: python3-torch + +%description +# Lunas + +[](https://github.com/pluiez/lunas) + +**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data +processing pipeline for arbitrary datasets. + +The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion. + +## License + +This project uses [MIT](LICENSE) license. + +## Features + +A `Dataset` represents a dataset and optionally holds custom operations on dataset elements. + +The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed. + +### Datasets + +Currently the following datasets are supported: + +1. `TextLine`: iterates through a text file in read mode line by line. +2. `Stdin`: wraps the standard input as a dataset. +3. `Array`: wraps an iterable object as a dataset. +4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`. +5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`. +6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets. +7. `Concat`: concatenates multiple datasets as one dataset. +8. `Group`: group several samples together. +9. `Flatten`: flattens a sample into multiple samples. +10. `Glob`: wraps the standard `glob.glob` as a dataset. +11. `Map`: transforms elements by a given mapping function. +12. `Where`: filters elements by a given predicate function. +13. `Repeat`: repeats the dataset for multiple epochs. +14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets. +15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation. +16. `Sort`: sorts the dataset. +17. `Slice`: slices the dataset. +18. `Shard`: shards the dataset into different partitions. +19. `Sampling`: draws samples from several datasets given a sampling distribution. + +Additionally, chaining-style dataset operation is available for following datasets: +`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`. + +For example, a dataset can invoke the following to create a new dataset: + +```python +ds = lunas.Range(100) +.map(lambda x: 2 * x) +.where(lambda x: x < 50) +.shuffle(buffer_size=100) + +print(list(ds)) +``` + +### Batch Iterators + +The batch iterators are provided to generate batches from a given dataset, currently including: + +1. `ConstantIterator`: generates batches with a constant number of samples. +2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function. +3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features. + +### Persistence + +Both datasets and batch iterators support persistence using `state()` and `load()` interface. +`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given +checkpoint. + +## Requirements + +- Python >= 3.7 +- numpy +- pytorch >= 1.5.0 + +## Installation + +Install using pip: + +```shell +pip install -U lunas +``` + +## Basics + +1. Create a dataset and iterate through it: + + ```python + from lunas import Range + + ds = Range(1000).shuffle(buffer_size=100) + for x in ds: # epoch 1 + print(x) + for x in ds: # epoch 2 + print(x) + + ds = Range(1000).shuffle(buffer_size=100).repeat(2) + for x in ds: # 2 epochs + print(x) + ``` + + - A dataset can be scanned through for several epochs. + - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation, + but rather begins when trying to access an element from the dataset. + - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice. + +2. Build a data processing pipeline: + + ```python + from lunas import * + ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0) + ``` + + - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset. + +3. Deal with multiple data sources: + + ```python + from lunas import * + + ds1 = Range(10) + ds2 = Range(start=10, stop=20, step=1) + ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True) + + ds3 = Range(10) + ds4 = Range(100) + ds5 = Range(1000) + ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True) + ``` + + - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets, + that is `ds1` and `ds2`. + + - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of + different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset + or truncating bigger dataset. + +4. Example usage in a more complicated distributed multilingual Language Modeling training case: + + ```python + from lunas import * + + + corpus_paths = ['train.zh', 'train.en', 'train.ru'] + sampling_weights = [0.3, 0.4, 0.3] + + # Shards a dataset so that each worker holds a unique shard of the original corpus. + # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker. + datasets = [] + for corpus in corpus_paths: + ds = TextLine(corpus) \ + .shard(dist_word_size, dist_local_rank) \ + .shuffle(buffer_size=10000) + # Tokenizes plain text into token ids + ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)}) + # Group consecutive 128 samples together, then concat and split the samples in that group into the same length + # to reduce padding. Finally, flatten the samples group into separate samples. + ds = ds.group(group_size=128) \ + .map(lambda xs: concat_and_split(xs, target_length=1024)) \ + .flatten() + + datasets.append(ds) + # Defines a sampling strategy from the datasets + ds = Sampling(datasets, sampling_weights, virtual_size=1000000) + + batch_itr = BucketIterator( + ds, + # each batch size has at most 4096 tokens + batch_size=4096, + # size for each sample is measured in number of tokens in target language + get_length_fn=lambda x: len(x), + bucket_boundaries=get_bucket_boundaries() + ) + + dataloader = DataLoader( + batch_itr, + num_workers=6, + collate_fn=collate_fn, + ) + + for epoch in range(max_epoch): + for bathc in dataloader: + ... + ``` + +5. Resume iteration: + + ```python + import pickle + # Stops at the 10-th element + for i, x in enumerate(it): + if i == 10: + break + pickle.dump(it.state(), open('state.pkl', 'wb')) + # ... + state = pickle.load(open('state.pkl', 'rb')) + it.load(state) + # Starts from the 11-th element + for i, x in enumerate(it): + ... + ``` + + - `it` here can be a dataset or batch iterator object. + - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration. + - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a + counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such + as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer. + +6. Extend the dataset: + + - You can refer to the implementation of `TextLine` to customize your own data dataset. + +## Known issues + +1. Parallel processing is not yet supported due to Python's limited support for parallelization. + + Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data + processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations, + which further introduce complexity in the use of the library. + + Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure + multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable + objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors ( + for example, a segment fault), the parent process will never be notified the termination of the child. It thus + requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does + not come to use. + + We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did. + +2. Stdin dataset cannot be used in potential multiprocessing context. + + multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial + implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so + the problem is simply left aside. + + + + +%package -n python3-Lunas +Summary: Building customisable data processing pipeline and data iterators for machine learning. +Provides: python-Lunas +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-Lunas +# Lunas + +[](https://github.com/pluiez/lunas) + +**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data +processing pipeline for arbitrary datasets. + +The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion. + +## License + +This project uses [MIT](LICENSE) license. + +## Features + +A `Dataset` represents a dataset and optionally holds custom operations on dataset elements. + +The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed. + +### Datasets + +Currently the following datasets are supported: + +1. `TextLine`: iterates through a text file in read mode line by line. +2. `Stdin`: wraps the standard input as a dataset. +3. `Array`: wraps an iterable object as a dataset. +4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`. +5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`. +6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets. +7. `Concat`: concatenates multiple datasets as one dataset. +8. `Group`: group several samples together. +9. `Flatten`: flattens a sample into multiple samples. +10. `Glob`: wraps the standard `glob.glob` as a dataset. +11. `Map`: transforms elements by a given mapping function. +12. `Where`: filters elements by a given predicate function. +13. `Repeat`: repeats the dataset for multiple epochs. +14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets. +15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation. +16. `Sort`: sorts the dataset. +17. `Slice`: slices the dataset. +18. `Shard`: shards the dataset into different partitions. +19. `Sampling`: draws samples from several datasets given a sampling distribution. + +Additionally, chaining-style dataset operation is available for following datasets: +`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`. + +For example, a dataset can invoke the following to create a new dataset: + +```python +ds = lunas.Range(100) +.map(lambda x: 2 * x) +.where(lambda x: x < 50) +.shuffle(buffer_size=100) + +print(list(ds)) +``` + +### Batch Iterators + +The batch iterators are provided to generate batches from a given dataset, currently including: + +1. `ConstantIterator`: generates batches with a constant number of samples. +2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function. +3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features. + +### Persistence + +Both datasets and batch iterators support persistence using `state()` and `load()` interface. +`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given +checkpoint. + +## Requirements + +- Python >= 3.7 +- numpy +- pytorch >= 1.5.0 + +## Installation + +Install using pip: + +```shell +pip install -U lunas +``` + +## Basics + +1. Create a dataset and iterate through it: + + ```python + from lunas import Range + + ds = Range(1000).shuffle(buffer_size=100) + for x in ds: # epoch 1 + print(x) + for x in ds: # epoch 2 + print(x) + + ds = Range(1000).shuffle(buffer_size=100).repeat(2) + for x in ds: # 2 epochs + print(x) + ``` + + - A dataset can be scanned through for several epochs. + - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation, + but rather begins when trying to access an element from the dataset. + - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice. + +2. Build a data processing pipeline: + + ```python + from lunas import * + ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0) + ``` + + - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset. + +3. Deal with multiple data sources: + + ```python + from lunas import * + + ds1 = Range(10) + ds2 = Range(start=10, stop=20, step=1) + ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True) + + ds3 = Range(10) + ds4 = Range(100) + ds5 = Range(1000) + ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True) + ``` + + - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets, + that is `ds1` and `ds2`. + + - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of + different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset + or truncating bigger dataset. + +4. Example usage in a more complicated distributed multilingual Language Modeling training case: + + ```python + from lunas import * + + + corpus_paths = ['train.zh', 'train.en', 'train.ru'] + sampling_weights = [0.3, 0.4, 0.3] + + # Shards a dataset so that each worker holds a unique shard of the original corpus. + # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker. + datasets = [] + for corpus in corpus_paths: + ds = TextLine(corpus) \ + .shard(dist_word_size, dist_local_rank) \ + .shuffle(buffer_size=10000) + # Tokenizes plain text into token ids + ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)}) + # Group consecutive 128 samples together, then concat and split the samples in that group into the same length + # to reduce padding. Finally, flatten the samples group into separate samples. + ds = ds.group(group_size=128) \ + .map(lambda xs: concat_and_split(xs, target_length=1024)) \ + .flatten() + + datasets.append(ds) + # Defines a sampling strategy from the datasets + ds = Sampling(datasets, sampling_weights, virtual_size=1000000) + + batch_itr = BucketIterator( + ds, + # each batch size has at most 4096 tokens + batch_size=4096, + # size for each sample is measured in number of tokens in target language + get_length_fn=lambda x: len(x), + bucket_boundaries=get_bucket_boundaries() + ) + + dataloader = DataLoader( + batch_itr, + num_workers=6, + collate_fn=collate_fn, + ) + + for epoch in range(max_epoch): + for bathc in dataloader: + ... + ``` + +5. Resume iteration: + + ```python + import pickle + # Stops at the 10-th element + for i, x in enumerate(it): + if i == 10: + break + pickle.dump(it.state(), open('state.pkl', 'wb')) + # ... + state = pickle.load(open('state.pkl', 'rb')) + it.load(state) + # Starts from the 11-th element + for i, x in enumerate(it): + ... + ``` + + - `it` here can be a dataset or batch iterator object. + - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration. + - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a + counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such + as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer. + +6. Extend the dataset: + + - You can refer to the implementation of `TextLine` to customize your own data dataset. + +## Known issues + +1. Parallel processing is not yet supported due to Python's limited support for parallelization. + + Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data + processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations, + which further introduce complexity in the use of the library. + + Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure + multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable + objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors ( + for example, a segment fault), the parent process will never be notified the termination of the child. It thus + requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does + not come to use. + + We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did. + +2. Stdin dataset cannot be used in potential multiprocessing context. + + multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial + implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so + the problem is simply left aside. + + + + +%package help +Summary: Development documents and examples for Lunas +Provides: python3-Lunas-doc +%description help +# Lunas + +[](https://github.com/pluiez/lunas) + +**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data +processing pipeline for arbitrary datasets. + +The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion. + +## License + +This project uses [MIT](LICENSE) license. + +## Features + +A `Dataset` represents a dataset and optionally holds custom operations on dataset elements. + +The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed. + +### Datasets + +Currently the following datasets are supported: + +1. `TextLine`: iterates through a text file in read mode line by line. +2. `Stdin`: wraps the standard input as a dataset. +3. `Array`: wraps an iterable object as a dataset. +4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`. +5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`. +6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets. +7. `Concat`: concatenates multiple datasets as one dataset. +8. `Group`: group several samples together. +9. `Flatten`: flattens a sample into multiple samples. +10. `Glob`: wraps the standard `glob.glob` as a dataset. +11. `Map`: transforms elements by a given mapping function. +12. `Where`: filters elements by a given predicate function. +13. `Repeat`: repeats the dataset for multiple epochs. +14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets. +15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation. +16. `Sort`: sorts the dataset. +17. `Slice`: slices the dataset. +18. `Shard`: shards the dataset into different partitions. +19. `Sampling`: draws samples from several datasets given a sampling distribution. + +Additionally, chaining-style dataset operation is available for following datasets: +`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`. + +For example, a dataset can invoke the following to create a new dataset: + +```python +ds = lunas.Range(100) +.map(lambda x: 2 * x) +.where(lambda x: x < 50) +.shuffle(buffer_size=100) + +print(list(ds)) +``` + +### Batch Iterators + +The batch iterators are provided to generate batches from a given dataset, currently including: + +1. `ConstantIterator`: generates batches with a constant number of samples. +2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function. +3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features. + +### Persistence + +Both datasets and batch iterators support persistence using `state()` and `load()` interface. +`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given +checkpoint. + +## Requirements + +- Python >= 3.7 +- numpy +- pytorch >= 1.5.0 + +## Installation + +Install using pip: + +```shell +pip install -U lunas +``` + +## Basics + +1. Create a dataset and iterate through it: + + ```python + from lunas import Range + + ds = Range(1000).shuffle(buffer_size=100) + for x in ds: # epoch 1 + print(x) + for x in ds: # epoch 2 + print(x) + + ds = Range(1000).shuffle(buffer_size=100).repeat(2) + for x in ds: # 2 epochs + print(x) + ``` + + - A dataset can be scanned through for several epochs. + - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation, + but rather begins when trying to access an element from the dataset. + - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice. + +2. Build a data processing pipeline: + + ```python + from lunas import * + ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0) + ``` + + - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset. + +3. Deal with multiple data sources: + + ```python + from lunas import * + + ds1 = Range(10) + ds2 = Range(start=10, stop=20, step=1) + ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True) + + ds3 = Range(10) + ds4 = Range(100) + ds5 = Range(1000) + ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True) + ``` + + - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets, + that is `ds1` and `ds2`. + + - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of + different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset + or truncating bigger dataset. + +4. Example usage in a more complicated distributed multilingual Language Modeling training case: + + ```python + from lunas import * + + + corpus_paths = ['train.zh', 'train.en', 'train.ru'] + sampling_weights = [0.3, 0.4, 0.3] + + # Shards a dataset so that each worker holds a unique shard of the original corpus. + # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker. + datasets = [] + for corpus in corpus_paths: + ds = TextLine(corpus) \ + .shard(dist_word_size, dist_local_rank) \ + .shuffle(buffer_size=10000) + # Tokenizes plain text into token ids + ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)}) + # Group consecutive 128 samples together, then concat and split the samples in that group into the same length + # to reduce padding. Finally, flatten the samples group into separate samples. + ds = ds.group(group_size=128) \ + .map(lambda xs: concat_and_split(xs, target_length=1024)) \ + .flatten() + + datasets.append(ds) + # Defines a sampling strategy from the datasets + ds = Sampling(datasets, sampling_weights, virtual_size=1000000) + + batch_itr = BucketIterator( + ds, + # each batch size has at most 4096 tokens + batch_size=4096, + # size for each sample is measured in number of tokens in target language + get_length_fn=lambda x: len(x), + bucket_boundaries=get_bucket_boundaries() + ) + + dataloader = DataLoader( + batch_itr, + num_workers=6, + collate_fn=collate_fn, + ) + + for epoch in range(max_epoch): + for bathc in dataloader: + ... + ``` + +5. Resume iteration: + + ```python + import pickle + # Stops at the 10-th element + for i, x in enumerate(it): + if i == 10: + break + pickle.dump(it.state(), open('state.pkl', 'wb')) + # ... + state = pickle.load(open('state.pkl', 'rb')) + it.load(state) + # Starts from the 11-th element + for i, x in enumerate(it): + ... + ``` + + - `it` here can be a dataset or batch iterator object. + - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration. + - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a + counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such + as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer. + +6. Extend the dataset: + + - You can refer to the implementation of `TextLine` to customize your own data dataset. + +## Known issues + +1. Parallel processing is not yet supported due to Python's limited support for parallelization. + + Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data + processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations, + which further introduce complexity in the use of the library. + + Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure + multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable + objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors ( + for example, a segment fault), the parent process will never be notified the termination of the child. It thus + requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does + not come to use. + + We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did. + +2. Stdin dataset cannot be used in potential multiprocessing context. + + multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial + implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so + the problem is simply left aside. + + + + +%prep +%autosetup -n Lunas-0.5.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-Lunas -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.1-1 +- Package Spec generated |
