summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-youtokentome.spec994
-rw-r--r--sources1
3 files changed, 996 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..fbc7335 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/youtokentome-1.0.6.tar.gz
diff --git a/python-youtokentome.spec b/python-youtokentome.spec
new file mode 100644
index 0000000..5b960b0
--- /dev/null
+++ b/python-youtokentome.spec
@@ -0,0 +1,994 @@
+%global _empty_manifest_terminate_build 0
+Name: python-youtokentome
+Version: 1.0.6
+Release: 1
+Summary: Unsupervised text tokenizer focused on computational efficiency
+License: MIT
+URL: https://github.com/vkcom/youtokentome
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/ae/f8b0d15696766eb35dda6cf84a23d42ae7f3ba37aa30e5e2287fd94ac053/youtokentome-1.0.6.tar.gz
+BuildArch: noarch
+
+Requires: python3-Click
+
+%description
+
+![PyPI](https://img.shields.io/pypi/v/youtokentome.svg)
+[![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
+![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg)
+[![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe)
+
+# YouTokenToMe
+
+YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
+Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
+ and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster.
+ Check out our [benchmark](benchmark.md) results.
+
+Key advantages:
+
+* Multithreading for training and tokenization
+* The algorithm has `O(N)` complexity, where `N` is the length of training data
+* Highly efficient implementation in C++
+* Python wrapper and command-line interface
+
+Extra features:
+* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
+
+As well as in the algorithm from the original paper, ours does not consider tokens
+that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
+
+For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into
+
+`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`
+
+## Installation
+
+```bash
+pip install youtokentome
+```
+## Python interface
+
+### Example
+Let's start with a self-contained example.
+
+```python
+import random
+
+import youtokentome as yttm
+
+train_data_path = "train_data.txt"
+model_path = "example.model"
+
+# Generating random file with training data
+# 10000 lines with 100 characters in each line
+n_lines = 10000
+n_characters = 100
+with open(train_data_path, "w") as fout:
+ for _ in range(n_lines):
+ print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)
+
+# Generating random text
+test_text = "".join([random.choice("abcde ") for _ in range(100)])
+
+# Training model
+yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)
+
+# Loading model
+bpe = yttm.BPE(model=model_path)
+
+# Two types of tokenization
+print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
+print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
+```
+
+ 
+### Training model
+```python
+youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
+```
+Trains BPE model and saves to file.
+
+**Args:**
+
+* `data`: string, path to file with training data
+* `model`: string, path to where the trained model will be saved
+* `vocab_size`: int, number of tokens in the final vocabulary
+* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
+* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)).
+* `pad_id`: int, reserved id for padding
+* `unk_id`: int, reserved id for unknown symbols
+* `bos_id`: int, reserved id for begin of sentence token
+* `eos_id`: int, reserved id for end of sentence token
+
+**Returns**: Class `youtokentome.BPE` with the loaded model.
+
+
+ 
+
+### Model loading
+
+```python
+youtokentome.BPE(model, n_threads=-1)
+```
+
+Class constructor. Loads the trained model.
+
+* `model`: string, path to the trained model
+* `n_threads`: int, number of parallel threads used to run.
+ If equal to -1, then the maximum number of threads available will be used.
+
+ 
+
+### Methods
+Class `youtokentome.BPE` has the following methods:
+#### encode
+```python
+encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
+```
+
+**Args:**
+
+* `sentences`: list of strings, sentences for tokenization.
+* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords.
+* `bos`: bool, if True then token “beginning of sentence” will be added
+* `eos`: bool, if True then token “end of sentence” will be added
+* `reverse`: bool, if True the output sequence of tokens will be reversed
+* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].
+
+
+**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD`
+ then a list of lists of integers or list of lists of strings will be returned
+respectively.
+
+ 
+#### vocab
+
+```python
+vocab(self)
+```
+
+**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds
+ to i-th subword.
+
+ 
+#### vocab_size
+
+```python
+vocab_size(self)
+```
+
+**Returns:** int. Size of vocabulary.
+
+ 
+#### subword_to_id
+
+```python
+subword_to_id(self, subword)
+```
+**Args:**
+* `subword`: string.
+
+**Returns:**
+Integer from the range [0, vocab_size-1]. Id of subword or,
+ if there is no such subword in the vocabulary, `unk_id` will be
+returned.
+
+ 
+#### id_to_subword
+
+```python
+id_to_subword(self, id)
+```
+**Args:**
+* `id`: int, must be in the range [0, vocab_size-1]
+
+**Returns:** string. Subword from vocabulary by id.
+
+ 
+#### decode
+```python
+decode(self, ids, ignore_ids=None)
+```
+Convert each id to subword and concatenate with space symbol.
+
+**Args:**
+
+ * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1]
+ * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]
+
+
+**Returns:** List of strings.
+
+## Command line interface
+
+### Example
+
+```bash
+$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
+$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
+```
+
+
+### Supported commands
+
+`YouTokenToMe` supports the following commands:
+
+```
+$ yttm --help
+
+Usage: yttm [OPTIONS] COMMAND [ARGS]...
+
+Options:
+ --help Show this message and exit.
+
+Commands:
+ bpe Train BPE model.
+ decode Decode ids to text.
+ encode Encode text to ids or subwords.
+ vocab Print list of learned subwords.
+```
+
+Command `bpe` allows you to train Byte Pair Encoding model based on a text file.
+
+```
+$ yttm bpe --help
+
+Usage: yttm bpe [OPTIONS]
+
+ Train BPE model.
+
+Options:
+ --data PATH Training data file path. [required]
+ --model PATH Output model file path. [required]
+ --vocab_size INTEGER Number of tokens in the final vocabulary. [required]
+ --coverage FLOAT Fraction of characters covered by the model. [default: 1.0]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --pad_id INTEGER Padding token id. [default: 0]
+ --unk_id INTEGER Unknown token id. [default: 1]
+ --bos_id INTEGER 'Begin of sentence' token id. [default: 2]
+ --eos_id INTEGER 'End of sentence' token id. [default: 3]
+ --help Show this message and exit.
+```
+
+
+Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.
+
+By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
+8 (see [benchmark](benchmark.md#number-of-threads)).
+
+With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
+ Each sentence will be tokenized and written to the `stdout` before the next sentence is read.
+
+
+```
+$ yttm encode --help
+
+Usage: yttm encode [OPTIONS]
+
+ Encode text to ids or subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --output_type TEXT 'id' or 'subword'. [required]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --bos Add tab 'begin of sentence'.
+ --eos Add tab 'end of sentence'.
+ --reverse Reverse output sequence of tokens.
+ --stream Process each line before reading the next one.
+ --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0]
+ --help Show this message and exit.
+```
+
+Print vocabulary. This can be useful for understanding the model.
+
+```
+$ yttm vocab --help
+
+Usage: yttm vocab [OPTIONS]
+
+ Print list of learned subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --verbose Add merging rules.
+ --help Show this message and exit.
+```
+
+Convert ids back to text. Use `stdin` for input and `stdout` for output.
+
+```
+$ yttm decode --help
+
+Usage: yttm decode [OPTIONS]
+
+ Decode ids to text.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
+ --help Show this message and exit.
+```
+
+
+
+
+
+
+
+
+
+
+
+%package -n python3-youtokentome
+Summary: Unsupervised text tokenizer focused on computational efficiency
+Provides: python-youtokentome
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-youtokentome
+
+![PyPI](https://img.shields.io/pypi/v/youtokentome.svg)
+[![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
+![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg)
+[![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe)
+
+# YouTokenToMe
+
+YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
+Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
+ and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster.
+ Check out our [benchmark](benchmark.md) results.
+
+Key advantages:
+
+* Multithreading for training and tokenization
+* The algorithm has `O(N)` complexity, where `N` is the length of training data
+* Highly efficient implementation in C++
+* Python wrapper and command-line interface
+
+Extra features:
+* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
+
+As well as in the algorithm from the original paper, ours does not consider tokens
+that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
+
+For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into
+
+`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`
+
+## Installation
+
+```bash
+pip install youtokentome
+```
+## Python interface
+
+### Example
+Let's start with a self-contained example.
+
+```python
+import random
+
+import youtokentome as yttm
+
+train_data_path = "train_data.txt"
+model_path = "example.model"
+
+# Generating random file with training data
+# 10000 lines with 100 characters in each line
+n_lines = 10000
+n_characters = 100
+with open(train_data_path, "w") as fout:
+ for _ in range(n_lines):
+ print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)
+
+# Generating random text
+test_text = "".join([random.choice("abcde ") for _ in range(100)])
+
+# Training model
+yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)
+
+# Loading model
+bpe = yttm.BPE(model=model_path)
+
+# Two types of tokenization
+print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
+print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
+```
+
+&nbsp;
+### Training model
+```python
+youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
+```
+Trains BPE model and saves to file.
+
+**Args:**
+
+* `data`: string, path to file with training data
+* `model`: string, path to where the trained model will be saved
+* `vocab_size`: int, number of tokens in the final vocabulary
+* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
+* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)).
+* `pad_id`: int, reserved id for padding
+* `unk_id`: int, reserved id for unknown symbols
+* `bos_id`: int, reserved id for begin of sentence token
+* `eos_id`: int, reserved id for end of sentence token
+
+**Returns**: Class `youtokentome.BPE` with the loaded model.
+
+
+&nbsp;
+
+### Model loading
+
+```python
+youtokentome.BPE(model, n_threads=-1)
+```
+
+Class constructor. Loads the trained model.
+
+* `model`: string, path to the trained model
+* `n_threads`: int, number of parallel threads used to run.
+ If equal to -1, then the maximum number of threads available will be used.
+
+&nbsp;
+
+### Methods
+Class `youtokentome.BPE` has the following methods:
+#### encode
+```python
+encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
+```
+
+**Args:**
+
+* `sentences`: list of strings, sentences for tokenization.
+* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords.
+* `bos`: bool, if True then token “beginning of sentence” will be added
+* `eos`: bool, if True then token “end of sentence” will be added
+* `reverse`: bool, if True the output sequence of tokens will be reversed
+* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].
+
+
+**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD`
+ then a list of lists of integers or list of lists of strings will be returned
+respectively.
+
+&nbsp;
+#### vocab
+
+```python
+vocab(self)
+```
+
+**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds
+ to i-th subword.
+
+&nbsp;
+#### vocab_size
+
+```python
+vocab_size(self)
+```
+
+**Returns:** int. Size of vocabulary.
+
+&nbsp;
+#### subword_to_id
+
+```python
+subword_to_id(self, subword)
+```
+**Args:**
+* `subword`: string.
+
+**Returns:**
+Integer from the range [0, vocab_size-1]. Id of subword or,
+ if there is no such subword in the vocabulary, `unk_id` will be
+returned.
+
+&nbsp;
+#### id_to_subword
+
+```python
+id_to_subword(self, id)
+```
+**Args:**
+* `id`: int, must be in the range [0, vocab_size-1]
+
+**Returns:** string. Subword from vocabulary by id.
+
+&nbsp;
+#### decode
+```python
+decode(self, ids, ignore_ids=None)
+```
+Convert each id to subword and concatenate with space symbol.
+
+**Args:**
+
+ * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1]
+ * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]
+
+
+**Returns:** List of strings.
+
+## Command line interface
+
+### Example
+
+```bash
+$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
+$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
+```
+
+
+### Supported commands
+
+`YouTokenToMe` supports the following commands:
+
+```
+$ yttm --help
+
+Usage: yttm [OPTIONS] COMMAND [ARGS]...
+
+Options:
+ --help Show this message and exit.
+
+Commands:
+ bpe Train BPE model.
+ decode Decode ids to text.
+ encode Encode text to ids or subwords.
+ vocab Print list of learned subwords.
+```
+
+Command `bpe` allows you to train Byte Pair Encoding model based on a text file.
+
+```
+$ yttm bpe --help
+
+Usage: yttm bpe [OPTIONS]
+
+ Train BPE model.
+
+Options:
+ --data PATH Training data file path. [required]
+ --model PATH Output model file path. [required]
+ --vocab_size INTEGER Number of tokens in the final vocabulary. [required]
+ --coverage FLOAT Fraction of characters covered by the model. [default: 1.0]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --pad_id INTEGER Padding token id. [default: 0]
+ --unk_id INTEGER Unknown token id. [default: 1]
+ --bos_id INTEGER 'Begin of sentence' token id. [default: 2]
+ --eos_id INTEGER 'End of sentence' token id. [default: 3]
+ --help Show this message and exit.
+```
+
+
+Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.
+
+By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
+8 (see [benchmark](benchmark.md#number-of-threads)).
+
+With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
+ Each sentence will be tokenized and written to the `stdout` before the next sentence is read.
+
+
+```
+$ yttm encode --help
+
+Usage: yttm encode [OPTIONS]
+
+ Encode text to ids or subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --output_type TEXT 'id' or 'subword'. [required]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --bos Add tab 'begin of sentence'.
+ --eos Add tab 'end of sentence'.
+ --reverse Reverse output sequence of tokens.
+ --stream Process each line before reading the next one.
+ --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0]
+ --help Show this message and exit.
+```
+
+Print vocabulary. This can be useful for understanding the model.
+
+```
+$ yttm vocab --help
+
+Usage: yttm vocab [OPTIONS]
+
+ Print list of learned subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --verbose Add merging rules.
+ --help Show this message and exit.
+```
+
+Convert ids back to text. Use `stdin` for input and `stdout` for output.
+
+```
+$ yttm decode --help
+
+Usage: yttm decode [OPTIONS]
+
+ Decode ids to text.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
+ --help Show this message and exit.
+```
+
+
+
+
+
+
+
+
+
+
+
+%package help
+Summary: Development documents and examples for youtokentome
+Provides: python3-youtokentome-doc
+%description help
+
+![PyPI](https://img.shields.io/pypi/v/youtokentome.svg)
+[![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
+![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg)
+[![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe)
+
+# YouTokenToMe
+
+YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
+Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
+ and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster.
+ Check out our [benchmark](benchmark.md) results.
+
+Key advantages:
+
+* Multithreading for training and tokenization
+* The algorithm has `O(N)` complexity, where `N` is the length of training data
+* Highly efficient implementation in C++
+* Python wrapper and command-line interface
+
+Extra features:
+* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
+
+As well as in the algorithm from the original paper, ours does not consider tokens
+that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
+
+For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into
+
+`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`
+
+## Installation
+
+```bash
+pip install youtokentome
+```
+## Python interface
+
+### Example
+Let's start with a self-contained example.
+
+```python
+import random
+
+import youtokentome as yttm
+
+train_data_path = "train_data.txt"
+model_path = "example.model"
+
+# Generating random file with training data
+# 10000 lines with 100 characters in each line
+n_lines = 10000
+n_characters = 100
+with open(train_data_path, "w") as fout:
+ for _ in range(n_lines):
+ print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)
+
+# Generating random text
+test_text = "".join([random.choice("abcde ") for _ in range(100)])
+
+# Training model
+yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)
+
+# Loading model
+bpe = yttm.BPE(model=model_path)
+
+# Two types of tokenization
+print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
+print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
+```
+
+&nbsp;
+### Training model
+```python
+youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
+```
+Trains BPE model and saves to file.
+
+**Args:**
+
+* `data`: string, path to file with training data
+* `model`: string, path to where the trained model will be saved
+* `vocab_size`: int, number of tokens in the final vocabulary
+* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
+* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)).
+* `pad_id`: int, reserved id for padding
+* `unk_id`: int, reserved id for unknown symbols
+* `bos_id`: int, reserved id for begin of sentence token
+* `eos_id`: int, reserved id for end of sentence token
+
+**Returns**: Class `youtokentome.BPE` with the loaded model.
+
+
+&nbsp;
+
+### Model loading
+
+```python
+youtokentome.BPE(model, n_threads=-1)
+```
+
+Class constructor. Loads the trained model.
+
+* `model`: string, path to the trained model
+* `n_threads`: int, number of parallel threads used to run.
+ If equal to -1, then the maximum number of threads available will be used.
+
+&nbsp;
+
+### Methods
+Class `youtokentome.BPE` has the following methods:
+#### encode
+```python
+encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
+```
+
+**Args:**
+
+* `sentences`: list of strings, sentences for tokenization.
+* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords.
+* `bos`: bool, if True then token “beginning of sentence” will be added
+* `eos`: bool, if True then token “end of sentence” will be added
+* `reverse`: bool, if True the output sequence of tokens will be reversed
+* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].
+
+
+**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD`
+ then a list of lists of integers or list of lists of strings will be returned
+respectively.
+
+&nbsp;
+#### vocab
+
+```python
+vocab(self)
+```
+
+**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds
+ to i-th subword.
+
+&nbsp;
+#### vocab_size
+
+```python
+vocab_size(self)
+```
+
+**Returns:** int. Size of vocabulary.
+
+&nbsp;
+#### subword_to_id
+
+```python
+subword_to_id(self, subword)
+```
+**Args:**
+* `subword`: string.
+
+**Returns:**
+Integer from the range [0, vocab_size-1]. Id of subword or,
+ if there is no such subword in the vocabulary, `unk_id` will be
+returned.
+
+&nbsp;
+#### id_to_subword
+
+```python
+id_to_subword(self, id)
+```
+**Args:**
+* `id`: int, must be in the range [0, vocab_size-1]
+
+**Returns:** string. Subword from vocabulary by id.
+
+&nbsp;
+#### decode
+```python
+decode(self, ids, ignore_ids=None)
+```
+Convert each id to subword and concatenate with space symbol.
+
+**Args:**
+
+ * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1]
+ * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]
+
+
+**Returns:** List of strings.
+
+## Command line interface
+
+### Example
+
+```bash
+$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
+$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
+```
+
+
+### Supported commands
+
+`YouTokenToMe` supports the following commands:
+
+```
+$ yttm --help
+
+Usage: yttm [OPTIONS] COMMAND [ARGS]...
+
+Options:
+ --help Show this message and exit.
+
+Commands:
+ bpe Train BPE model.
+ decode Decode ids to text.
+ encode Encode text to ids or subwords.
+ vocab Print list of learned subwords.
+```
+
+Command `bpe` allows you to train Byte Pair Encoding model based on a text file.
+
+```
+$ yttm bpe --help
+
+Usage: yttm bpe [OPTIONS]
+
+ Train BPE model.
+
+Options:
+ --data PATH Training data file path. [required]
+ --model PATH Output model file path. [required]
+ --vocab_size INTEGER Number of tokens in the final vocabulary. [required]
+ --coverage FLOAT Fraction of characters covered by the model. [default: 1.0]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --pad_id INTEGER Padding token id. [default: 0]
+ --unk_id INTEGER Unknown token id. [default: 1]
+ --bos_id INTEGER 'Begin of sentence' token id. [default: 2]
+ --eos_id INTEGER 'End of sentence' token id. [default: 3]
+ --help Show this message and exit.
+```
+
+
+Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.
+
+By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
+8 (see [benchmark](benchmark.md#number-of-threads)).
+
+With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
+ Each sentence will be tokenized and written to the `stdout` before the next sentence is read.
+
+
+```
+$ yttm encode --help
+
+Usage: yttm encode [OPTIONS]
+
+ Encode text to ids or subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --output_type TEXT 'id' or 'subword'. [required]
+ --n_threads INTEGER Number of threads. [default: -1]
+ --bos Add tab 'begin of sentence'.
+ --eos Add tab 'end of sentence'.
+ --reverse Reverse output sequence of tokens.
+ --stream Process each line before reading the next one.
+ --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0]
+ --help Show this message and exit.
+```
+
+Print vocabulary. This can be useful for understanding the model.
+
+```
+$ yttm vocab --help
+
+Usage: yttm vocab [OPTIONS]
+
+ Print list of learned subwords.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --verbose Add merging rules.
+ --help Show this message and exit.
+```
+
+Convert ids back to text. Use `stdin` for input and `stdout` for output.
+
+```
+$ yttm decode --help
+
+Usage: yttm decode [OPTIONS]
+
+ Decode ids to text.
+
+Options:
+ --model PATH Path to file with learned model. [required]
+ --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
+ --help Show this message and exit.
+```
+
+
+
+
+
+
+
+
+
+
+
+%prep
+%autosetup -n youtokentome-1.0.6
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-youtokentome -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Tue Apr 11 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.6-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..d3bf1e1
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+2b892f24fe358d5868b8324efea288ae youtokentome-1.0.6.tar.gz