automatic import of python-lunas

author: CoprDistGit <infra@openeuler.org> 2023-05-10 08:53:17 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-10 08:53:17 +0000
commit: 236d5c608edf789f4ec80ba07f12225eb8c4824f (patch)
tree: 66f9e8e956432cf2493ed6572cd69652add11182 /python-lunas.spec
parent: 86c363f6fa300af95b061c29b3eaf5edc7b7aff0 (diff)
1 files changed, 788 insertions, 0 deletions
diff --git a/python-lunas.spec b/python-lunas.spec
new file mode 100644
index 0000000..8e07067
--- /dev/null
+++ b/python-lunas.spec
@@ -0,0 +1,788 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-Lunas
+Version:	0.5.1
+Release:	1
+Summary:	Building customisable data processing pipeline and data iterators for machine learning.
+License:	LICENSE
+URL:		https://github.com/pluiez/lunas
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/78/a5/4e4e93c0a8814fb3b0e83f420b3ff33c81599b55b2bc971edfe2cb8fe387/Lunas-0.5.1.tar.gz
+BuildArch:	noarch
+
+Requires:	python3-numpy
+Requires:	python3-torch
+
+%description
+# Lunas
+
+[![PyPI version](https://img.shields.io/badge/pypi-v0.5.1-limegreen.svg)](https://github.com/pluiez/lunas)
+
+**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data
+processing pipeline for arbitrary datasets.
+
+The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion.
+
+## License
+
+This project uses [MIT](LICENSE) license.
+
+## Features
+
+A `Dataset` represents a dataset and optionally holds custom operations on dataset elements.
+
+The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed.
+
+### Datasets
+
+Currently the following datasets are supported:
+
+1. `TextLine`: iterates through a text file in read mode line by line.
+2. `Stdin`: wraps the standard input as a dataset.
+3. `Array`: wraps an iterable object as a dataset.
+4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`.
+5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`.
+6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets.
+7. `Concat`: concatenates multiple datasets as one dataset.
+8. `Group`: group several samples together.
+9. `Flatten`: flattens a sample into multiple samples.
+10. `Glob`: wraps the standard `glob.glob` as a dataset.
+11. `Map`: transforms elements by a given mapping function.
+12. `Where`: filters elements by a given predicate function.
+13. `Repeat`: repeats the dataset for multiple epochs.
+14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets.
+15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation.
+16. `Sort`: sorts the dataset.
+17. `Slice`: slices the dataset.
+18. `Shard`: shards the dataset into different partitions.
+19. `Sampling`: draws samples from several datasets given a sampling distribution.
+
+Additionally, chaining-style dataset operation is available for following datasets:
+`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`.
+
+For example, a dataset can invoke the following to create a new dataset:
+
+```python
+ds = lunas.Range(100)
+.map(lambda x: 2 * x)
+.where(lambda x: x < 50)
+.shuffle(buffer_size=100)
+
+print(list(ds))
+```
+
+### Batch Iterators
+
+The batch iterators are provided to generate batches from a given dataset, currently including:
+
+1. `ConstantIterator`: generates batches with a constant number of samples.
+2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function.
+3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features.
+
+### Persistence
+
+Both datasets and batch iterators support persistence using `state()` and `load()` interface.
+`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given
+checkpoint.
+
+## Requirements
+
+- Python >= 3.7
+- numpy
+- pytorch >= 1.5.0
+
+## Installation
+
+Install using pip:
+
+```shell
+pip install -U lunas
+```
+
+## Basics
+
+1. Create a dataset and iterate through it:
+
+   ```python
+   from lunas import Range
+
+   ds = Range(1000).shuffle(buffer_size=100)
+   for x in ds: # epoch 1
+       print(x)
+   for x in ds: # epoch 2
+       print(x)
+
+   ds = Range(1000).shuffle(buffer_size=100).repeat(2)
+   for x in ds: # 2 epochs
+       print(x)
+   ```
+
+    - A dataset can be scanned through for several epochs.
+    - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation,
+      but rather begins when trying to access an element from the dataset.
+    - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice.
+
+2. Build a data processing pipeline:
+
+   ```python
+   from lunas import *
+   ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0)
+   ```
+
+    - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset.
+
+3. Deal with multiple data sources:
+
+   ```python
+   from lunas import *
+
+   ds1 = Range(10)
+   ds2 = Range(start=10, stop=20, step=1)
+   ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True)
+
+   ds3 = Range(10)
+   ds4 = Range(100)
+   ds5 = Range(1000)
+   ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True)
+   ```
+
+    - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets,
+      that is `ds1` and `ds2`.
+
+    - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of
+      different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset
+      or truncating bigger dataset.
+
+4. Example usage in a more complicated distributed multilingual Language Modeling training case:
+
+   ```python
+   from lunas import *
+
+
+   corpus_paths = ['train.zh', 'train.en', 'train.ru']
+   sampling_weights = [0.3, 0.4, 0.3]
+
+   # Shards a dataset so that each worker holds a unique shard of the original corpus.
+   # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker.
+   datasets = []
+   for corpus in corpus_paths:
+       ds = TextLine(corpus) \
+           .shard(dist_word_size, dist_local_rank) \
+           .shuffle(buffer_size=10000)
+       # Tokenizes plain text into token ids
+       ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)})
+       # Group consecutive 128 samples together, then concat and split the samples in that group into the same length
+       # to reduce padding. Finally, flatten the samples group into separate samples.
+       ds = ds.group(group_size=128) \
+           .map(lambda xs: concat_and_split(xs, target_length=1024)) \
+           .flatten()
+
+       datasets.append(ds)
+   # Defines a sampling strategy from the datasets
+   ds = Sampling(datasets, sampling_weights, virtual_size=1000000)
+
+   batch_itr = BucketIterator(
+       ds,
+       # each batch size has at most 4096 tokens
+       batch_size=4096,
+       # size for each sample is measured in number of tokens in target language
+       get_length_fn=lambda x: len(x),
+       bucket_boundaries=get_bucket_boundaries()
+   )
+
+   dataloader = DataLoader(
+       batch_itr,
+       num_workers=6,
+       collate_fn=collate_fn,
+   )
+
+   for epoch in range(max_epoch):
+       for bathc in dataloader:
+           ...
+   ```
+
+5. Resume iteration:
+
+   ```python
+   import pickle
+   # Stops at the 10-th element
+   for i, x in enumerate(it):
+       if i == 10:
+           break
+   pickle.dump(it.state(), open('state.pkl', 'wb'))
+   # ...
+   state = pickle.load(open('state.pkl', 'rb'))
+   it.load(state)
+   # Starts from the 11-th element
+   for i, x in enumerate(it):
+       ...
+   ```
+
+    - `it` here can be a dataset or batch iterator object.
+    - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration.
+    - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a
+      counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such
+      as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer.
+
+6. Extend the dataset:
+
+    - You can refer to the implementation of `TextLine` to customize your own data dataset.
+
+## Known issues
+
+1. Parallel processing is not yet supported due to Python's limited support for parallelization.
+
+   Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data
+   processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations,
+   which further introduce complexity in the use of the library.
+
+   Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure
+   multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable
+   objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors (
+   for example, a segment fault), the parent process will never be notified the termination of the child. It thus
+   requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does
+   not come to use.
+
+   We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did.
+
+2. Stdin dataset cannot be used in potential multiprocessing context.
+
+   multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial
+   implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so
+   the problem is simply left aside.
+
+
+
+
+%package -n python3-Lunas
+Summary:	Building customisable data processing pipeline and data iterators for machine learning.
+Provides:	python-Lunas
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-Lunas
+# Lunas
+
+[![PyPI version](https://img.shields.io/badge/pypi-v0.5.1-limegreen.svg)](https://github.com/pluiez/lunas)
+
+**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data
+processing pipeline for arbitrary datasets.
+
+The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion.
+
+## License
+
+This project uses [MIT](LICENSE) license.
+
+## Features
+
+A `Dataset` represents a dataset and optionally holds custom operations on dataset elements.
+
+The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed.
+
+### Datasets
+
+Currently the following datasets are supported:
+
+1. `TextLine`: iterates through a text file in read mode line by line.
+2. `Stdin`: wraps the standard input as a dataset.
+3. `Array`: wraps an iterable object as a dataset.
+4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`.
+5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`.
+6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets.
+7. `Concat`: concatenates multiple datasets as one dataset.
+8. `Group`: group several samples together.
+9. `Flatten`: flattens a sample into multiple samples.
+10. `Glob`: wraps the standard `glob.glob` as a dataset.
+11. `Map`: transforms elements by a given mapping function.
+12. `Where`: filters elements by a given predicate function.
+13. `Repeat`: repeats the dataset for multiple epochs.
+14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets.
+15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation.
+16. `Sort`: sorts the dataset.
+17. `Slice`: slices the dataset.
+18. `Shard`: shards the dataset into different partitions.
+19. `Sampling`: draws samples from several datasets given a sampling distribution.
+
+Additionally, chaining-style dataset operation is available for following datasets:
+`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`.
+
+For example, a dataset can invoke the following to create a new dataset:
+
+```python
+ds = lunas.Range(100)
+.map(lambda x: 2 * x)
+.where(lambda x: x < 50)
+.shuffle(buffer_size=100)
+
+print(list(ds))
+```
+
+### Batch Iterators
+
+The batch iterators are provided to generate batches from a given dataset, currently including:
+
+1. `ConstantIterator`: generates batches with a constant number of samples.
+2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function.
+3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features.
+
+### Persistence
+
+Both datasets and batch iterators support persistence using `state()` and `load()` interface.
+`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given
+checkpoint.
+
+## Requirements
+
+- Python >= 3.7
+- numpy
+- pytorch >= 1.5.0
+
+## Installation
+
+Install using pip:
+
+```shell
+pip install -U lunas
+```
+
+## Basics
+
+1. Create a dataset and iterate through it:
+
+   ```python
+   from lunas import Range
+
+   ds = Range(1000).shuffle(buffer_size=100)
+   for x in ds: # epoch 1
+       print(x)
+   for x in ds: # epoch 2
+       print(x)
+
+   ds = Range(1000).shuffle(buffer_size=100).repeat(2)
+   for x in ds: # 2 epochs
+       print(x)
+   ```
+
+    - A dataset can be scanned through for several epochs.
+    - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation,
+      but rather begins when trying to access an element from the dataset.
+    - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice.
+
+2. Build a data processing pipeline:
+
+   ```python
+   from lunas import *
+   ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0)
+   ```
+
+    - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset.
+
+3. Deal with multiple data sources:
+
+   ```python
+   from lunas import *
+
+   ds1 = Range(10)
+   ds2 = Range(start=10, stop=20, step=1)
+   ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True)
+
+   ds3 = Range(10)
+   ds4 = Range(100)
+   ds5 = Range(1000)
+   ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True)
+   ```
+
+    - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets,
+      that is `ds1` and `ds2`.
+
+    - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of
+      different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset
+      or truncating bigger dataset.
+
+4. Example usage in a more complicated distributed multilingual Language Modeling training case:
+
+   ```python
+   from lunas import *
+
+
+   corpus_paths = ['train.zh', 'train.en', 'train.ru']
+   sampling_weights = [0.3, 0.4, 0.3]
+
+   # Shards a dataset so that each worker holds a unique shard of the original corpus.
+   # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker.
+   datasets = []
+   for corpus in corpus_paths:
+       ds = TextLine(corpus) \
+           .shard(dist_word_size, dist_local_rank) \
+           .shuffle(buffer_size=10000)
+       # Tokenizes plain text into token ids
+       ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)})
+       # Group consecutive 128 samples together, then concat and split the samples in that group into the same length
+       # to reduce padding. Finally, flatten the samples group into separate samples.
+       ds = ds.group(group_size=128) \
+           .map(lambda xs: concat_and_split(xs, target_length=1024)) \
+           .flatten()
+
+       datasets.append(ds)
+   # Defines a sampling strategy from the datasets
+   ds = Sampling(datasets, sampling_weights, virtual_size=1000000)
+
+   batch_itr = BucketIterator(
+       ds,
+       # each batch size has at most 4096 tokens
+       batch_size=4096,
+       # size for each sample is measured in number of tokens in target language
+       get_length_fn=lambda x: len(x),
+       bucket_boundaries=get_bucket_boundaries()
+   )
+
+   dataloader = DataLoader(
+       batch_itr,
+       num_workers=6,
+       collate_fn=collate_fn,
+   )
+
+   for epoch in range(max_epoch):
+       for bathc in dataloader:
+           ...
+   ```
+
+5. Resume iteration:
+
+   ```python
+   import pickle
+   # Stops at the 10-th element
+   for i, x in enumerate(it):
+       if i == 10:
+           break
+   pickle.dump(it.state(), open('state.pkl', 'wb'))
+   # ...
+   state = pickle.load(open('state.pkl', 'rb'))
+   it.load(state)
+   # Starts from the 11-th element
+   for i, x in enumerate(it):
+       ...
+   ```
+
+    - `it` here can be a dataset or batch iterator object.
+    - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration.
+    - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a
+      counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such
+      as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer.
+
+6. Extend the dataset:
+
+    - You can refer to the implementation of `TextLine` to customize your own data dataset.
+
+## Known issues
+
+1. Parallel processing is not yet supported due to Python's limited support for parallelization.
+
+   Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data
+   processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations,
+   which further introduce complexity in the use of the library.
+
+   Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure
+   multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable
+   objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors (
+   for example, a segment fault), the parent process will never be notified the termination of the child. It thus
+   requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does
+   not come to use.
+
+   We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did.
+
+2. Stdin dataset cannot be used in potential multiprocessing context.
+
+   multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial
+   implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so
+   the problem is simply left aside.
+
+
+
+
+%package help
+Summary:	Development documents and examples for Lunas
+Provides:	python3-Lunas-doc
+%description help
+# Lunas
+
+[![PyPI version](https://img.shields.io/badge/pypi-v0.5.1-limegreen.svg)](https://github.com/pluiez/lunas)
+
+**Lunas** is a Python based library that mimics TensorFlow's `dataset` API and also its logics to build a data
+processing pipeline for arbitrary datasets.
+
+The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion.
+
+## License
+
+This project uses [MIT](LICENSE) license.
+
+## Features
+
+A `Dataset` represents a dataset and optionally holds custom operations on dataset elements.
+
+The evaluation of operations are performed lazily, hence it's a trade-off for memory against speed.
+
+### Datasets
+
+Currently the following datasets are supported:
+
+1. `TextLine`: iterates through a text file in read mode line by line.
+2. `Stdin`: wraps the standard input as a dataset.
+3. `Array`: wraps an iterable object as a dataset.
+4. `Range`: wraps a range of integers as a dataset, simulating builtin `range`.
+5. `Enumerate`: wraps a dataset with index for each element, simulating builtin `enumerate`.
+6. `Zip`: wraps multiple datasets as one dataset and supports custom padding for varying-sized datasets.
+7. `Concat`: concatenates multiple datasets as one dataset.
+8. `Group`: group several samples together.
+9. `Flatten`: flattens a sample into multiple samples.
+10. `Glob`: wraps the standard `glob.glob` as a dataset.
+11. `Map`: transforms elements by a given mapping function.
+12. `Where`: filters elements by a given predicate function.
+13. `Repeat`: repeats the dataset for multiple epochs.
+14. `Interleave`: maps a dataset into multiple datasets and interleave between the datasets.
+15. `Shuffle`: shuffles a dataset using a buffer for memory-efficient randomisation.
+16. `Sort`: sorts the dataset.
+17. `Slice`: slices the dataset.
+18. `Shard`: shards the dataset into different partitions.
+19. `Sampling`: draws samples from several datasets given a sampling distribution.
+
+Additionally, chaining-style dataset operation is available for following datasets:
+`Map`, `Where`, `Repeat`, `Shard`, `Shuffle`, `Sort`, `Slice`, `Enumerate`, `Group`, `Flatten` and `Concat`.
+
+For example, a dataset can invoke the following to create a new dataset:
+
+```python
+ds = lunas.Range(100)
+.map(lambda x: 2 * x)
+.where(lambda x: x < 50)
+.shuffle(buffer_size=100)
+
+print(list(ds))
+```
+
+### Batch Iterators
+
+The batch iterators are provided to generate batches from a given dataset, currently including:
+
+1. `ConstantIterator`: generates batches with a constant number of samples.
+2. `BucketIterator`: generates varying-sized batches with sample size determined by a custom function.
+3. `DataLoader`: wraps PyTorch's `torch.utils.data.DataLoader` to provide multiprocessing data-loading features.
+
+### Persistence
+
+Both datasets and batch iterators support persistence using `state()` and `load()` interface.
+`state()` takes a checkpoint of current iteration state, while `load()` restores iteration state from a given
+checkpoint.
+
+## Requirements
+
+- Python >= 3.7
+- numpy
+- pytorch >= 1.5.0
+
+## Installation
+
+Install using pip:
+
+```shell
+pip install -U lunas
+```
+
+## Basics
+
+1. Create a dataset and iterate through it:
+
+   ```python
+   from lunas import Range
+
+   ds = Range(1000).shuffle(buffer_size=100)
+   for x in ds: # epoch 1
+       print(x)
+   for x in ds: # epoch 2
+       print(x)
+
+   ds = Range(1000).shuffle(buffer_size=100).repeat(2)
+   for x in ds: # 2 epochs
+       print(x)
+   ```
+
+    - A dataset can be scanned through for several epochs.
+    - Dataset.shuffle() performs a buffered shuffling. The shuffling does not happen immediately at dataset creation,
+      but rather begins when trying to access an element from the dataset.
+    - Alternatively, `Dataset.repeat(2)` creates another dataset that iterates through the original dataset twice.
+
+2. Build a data processing pipeline:
+
+   ```python
+   from lunas import *
+   ds = Range(10).map(lambda x: x * 2).where(lambda x: x % 2 == 0)
+   ```
+
+    - The chaining calls of a `Dataset` object defines a processing pipeline on the original dataset.
+
+3. Deal with multiple data sources:
+
+   ```python
+   from lunas import *
+
+   ds1 = Range(10)
+   ds2 = Range(start=10, stop=20, step=1)
+   ds = Zip([ds1, ds2]).map(lambda x, y: (x + y), unpack_args=True)
+
+   ds3 = Range(10)
+   ds4 = Range(100)
+   ds5 = Range(1000)
+   ds = Zip([ds3, ds4, ds5], mode='>', padding=True).map(lambda x, y, z: (x + y + z), unpack_args=True)
+   ```
+
+    - Two datasets here are zipped as a `Zip` dataset. A `Zip` dataset returns a tuple from the internal child-datasets,
+      that is `ds1` and `ds2`.
+
+    - `Zip` requires strictly the datasets to be aligned by default. It also allows zipping multiple datasets of
+      different sizes by providing additional `mode` and `paddinng` argument to indicate either padding smaller dataset
+      or truncating bigger dataset.
+
+4. Example usage in a more complicated distributed multilingual Language Modeling training case:
+
+   ```python
+   from lunas import *
+
+
+   corpus_paths = ['train.zh', 'train.en', 'train.ru']
+   sampling_weights = [0.3, 0.4, 0.3]
+
+   # Shards a dataset so that each worker holds a unique shard of the original corpus.
+   # Sharding should be done before shuffling to avoid unnecessary shuffling efforts in each worker.
+   datasets = []
+   for corpus in corpus_paths:
+       ds = TextLine(corpus) \
+           .shard(dist_word_size, dist_local_rank) \
+           .shuffle(buffer_size=10000)
+       # Tokenizes plain text into token ids
+       ds = ds.map(lambda x: {'input': tokenizer.tokenize(x)})
+       # Group consecutive 128 samples together, then concat and split the samples in that group into the same length
+       # to reduce padding. Finally, flatten the samples group into separate samples.
+       ds = ds.group(group_size=128) \
+           .map(lambda xs: concat_and_split(xs, target_length=1024)) \
+           .flatten()
+
+       datasets.append(ds)
+   # Defines a sampling strategy from the datasets
+   ds = Sampling(datasets, sampling_weights, virtual_size=1000000)
+
+   batch_itr = BucketIterator(
+       ds,
+       # each batch size has at most 4096 tokens
+       batch_size=4096,
+       # size for each sample is measured in number of tokens in target language
+       get_length_fn=lambda x: len(x),
+       bucket_boundaries=get_bucket_boundaries()
+   )
+
+   dataloader = DataLoader(
+       batch_itr,
+       num_workers=6,
+       collate_fn=collate_fn,
+   )
+
+   for epoch in range(max_epoch):
+       for bathc in dataloader:
+           ...
+   ```
+
+5. Resume iteration:
+
+   ```python
+   import pickle
+   # Stops at the 10-th element
+   for i, x in enumerate(it):
+       if i == 10:
+           break
+   pickle.dump(it.state(), open('state.pkl', 'wb'))
+   # ...
+   state = pickle.load(open('state.pkl', 'rb'))
+   it.load(state)
+   # Starts from the 11-th element
+   for i, x in enumerate(it):
+       ...
+   ```
+
+    - `it` here can be a dataset or batch iterator object.
+    - `state()` returns a picklable dictionary, which can be loaded by `it.load()` to resume the iteration.
+    - lunas provides limited support for resumable iteration. Specifically, the iteration state is maintained by a
+      counting pointer in `Dataset`. For those dataset implementations that manage iteration by internal buffering, such
+      as `Shuffle`, `Sort` and `BucketIterator`, `load()` would loss content in the buffer.
+
+6. Extend the dataset:
+
+    - You can refer to the implementation of `TextLine` to customize your own data dataset.
+
+## Known issues
+
+1. Parallel processing is not yet supported due to Python's limited support for parallelization.
+
+   Multi-threading can be helpful for resource-intensive data loading operations, but not for CPU-intensive data
+   processing operations. Whereas multi-processing is facilitates CPU-intensive scenarios, there are a few limitations,
+   which further introduce complexity in the use of the library.
+
+   Although it won't cause any difference for lunas APIs, the users will have to pay more attention in order to ensure
+   multi-processing work correctly. For example, multi-processing does not accept lambda expressions and any unpicklable
+   objects as arguments. The more severe problem is that once the child-process terminated with certain fatal errors (
+   for example, a segment fault), the parent process will never be notified the termination of the child. It thus
+   requires extra efforts on accounting the states of child processes and the standard `multiprocessing` library does
+   not come to use.
+
+   We are likely to opt to C++ based implementation for parallelization features just as TensorFlow did.
+
+2. Stdin dataset cannot be used in potential multiprocessing context.
+
+   multiprocessing can mess up standard input since we can't distribute /dev/stdin to multiple processes with trivial
+   implementation. Furthermore, there seems to be little preferential needs to spread stdin to multiple processes, so
+   the problem is simply left aside.
+
+
+
+
+%prep
+%autosetup -n Lunas-0.5.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-Lunas -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.1-1
+- Package Spec generated
author	CoprDistGit <infra@openeuler.org>	2023-05-10 08:53:17 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-10 08:53:17 +0000
commit	236d5c608edf789f4ec80ba07f12225eb8c4824f (patch)
tree	66f9e8e956432cf2493ed6572cd69652add11182 /python-lunas.spec
parent	86c363f6fa300af95b061c29b3eaf5edc7b7aff0 (diff)