diff options
author | CoprDistGit <infra@openeuler.org> | 2023-04-11 08:48:42 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-04-11 08:48:42 +0000 |
commit | 233980910ec6cbc862fdd73b8aeda7bd6cd5fe63 (patch) | |
tree | 59c6c82582a4f09fdec3127bc56cd09023eec05b | |
parent | 132ec4e9f645d34eb744e1342762456c6a3e6e39 (diff) |
automatic import of python-youtokentome
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-youtokentome.spec | 994 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 996 insertions, 0 deletions
@@ -0,0 +1 @@ +/youtokentome-1.0.6.tar.gz diff --git a/python-youtokentome.spec b/python-youtokentome.spec new file mode 100644 index 0000000..5b960b0 --- /dev/null +++ b/python-youtokentome.spec @@ -0,0 +1,994 @@ +%global _empty_manifest_terminate_build 0 +Name: python-youtokentome +Version: 1.0.6 +Release: 1 +Summary: Unsupervised text tokenizer focused on computational efficiency +License: MIT +URL: https://github.com/vkcom/youtokentome +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/ae/f8b0d15696766eb35dda6cf84a23d42ae7f3ba37aa30e5e2287fd94ac053/youtokentome-1.0.6.tar.gz +BuildArch: noarch + +Requires: python3-Click + +%description + + +[](https://pepy.tech/project/youtokentome) +[](https://github.com/python/black) + +[](https://travis-ci.org/VKCOM/YouTokenToMe) + +# YouTokenToMe + +YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. +Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) + and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. + Check out our [benchmark](benchmark.md) results. + +Key advantages: + +* Multithreading for training and tokenization +* The algorithm has `O(N)` complexity, where `N` is the length of training data +* Highly efficient implementation in C++ +* Python wrapper and command-line interface + +Extra features: +* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) + +As well as in the algorithm from the original paper, ours does not consider tokens +that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. + +For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into + +`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` + +## Installation + +```bash +pip install youtokentome +``` +## Python interface + +### Example +Let's start with a self-contained example. + +```python +import random + +import youtokentome as yttm + +train_data_path = "train_data.txt" +model_path = "example.model" + +# Generating random file with training data +# 10000 lines with 100 characters in each line +n_lines = 10000 +n_characters = 100 +with open(train_data_path, "w") as fout: + for _ in range(n_lines): + print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) + +# Generating random text +test_text = "".join([random.choice("abcde ") for _ in range(100)]) + +# Training model +yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) + +# Loading model +bpe = yttm.BPE(model=model_path) + +# Two types of tokenization +print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) +print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) +``` + + +### Training model +```python +youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) +``` +Trains BPE model and saves to file. + +**Args:** + +* `data`: string, path to file with training data +* `model`: string, path to where the trained model will be saved +* `vocab_size`: int, number of tokens in the final vocabulary +* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. +* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). +* `pad_id`: int, reserved id for padding +* `unk_id`: int, reserved id for unknown symbols +* `bos_id`: int, reserved id for begin of sentence token +* `eos_id`: int, reserved id for end of sentence token + +**Returns**: Class `youtokentome.BPE` with the loaded model. + + + + +### Model loading + +```python +youtokentome.BPE(model, n_threads=-1) +``` + +Class constructor. Loads the trained model. + +* `model`: string, path to the trained model +* `n_threads`: int, number of parallel threads used to run. + If equal to -1, then the maximum number of threads available will be used. + + + +### Methods +Class `youtokentome.BPE` has the following methods: +#### encode +```python +encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) +``` + +**Args:** + +* `sentences`: list of strings, sentences for tokenization. +* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. +* `bos`: bool, if True then token “beginning of sentence” will be added +* `eos`: bool, if True then token “end of sentence” will be added +* `reverse`: bool, if True the output sequence of tokens will be reversed +* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. + + +**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` + then a list of lists of integers or list of lists of strings will be returned +respectively. + + +#### vocab + +```python +vocab(self) +``` + +**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds + to i-th subword. + + +#### vocab_size + +```python +vocab_size(self) +``` + +**Returns:** int. Size of vocabulary. + + +#### subword_to_id + +```python +subword_to_id(self, subword) +``` +**Args:** +* `subword`: string. + +**Returns:** +Integer from the range [0, vocab_size-1]. Id of subword or, + if there is no such subword in the vocabulary, `unk_id` will be +returned. + + +#### id_to_subword + +```python +id_to_subword(self, id) +``` +**Args:** +* `id`: int, must be in the range [0, vocab_size-1] + +**Returns:** string. Subword from vocabulary by id. + + +#### decode +```python +decode(self, ids, ignore_ids=None) +``` +Convert each id to subword and concatenate with space symbol. + +**Args:** + + * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] + * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] + + +**Returns:** List of strings. + +## Command line interface + +### Example + +```bash +$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 +$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA +``` + + +### Supported commands + +`YouTokenToMe` supports the following commands: + +``` +$ yttm --help + +Usage: yttm [OPTIONS] COMMAND [ARGS]... + +Options: + --help Show this message and exit. + +Commands: + bpe Train BPE model. + decode Decode ids to text. + encode Encode text to ids or subwords. + vocab Print list of learned subwords. +``` + +Command `bpe` allows you to train Byte Pair Encoding model based on a text file. + +``` +$ yttm bpe --help + +Usage: yttm bpe [OPTIONS] + + Train BPE model. + +Options: + --data PATH Training data file path. [required] + --model PATH Output model file path. [required] + --vocab_size INTEGER Number of tokens in the final vocabulary. [required] + --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] + --n_threads INTEGER Number of threads. [default: -1] + --pad_id INTEGER Padding token id. [default: 0] + --unk_id INTEGER Unknown token id. [default: 1] + --bos_id INTEGER 'Begin of sentence' token id. [default: 2] + --eos_id INTEGER 'End of sentence' token id. [default: 3] + --help Show this message and exit. +``` + + +Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. + +By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by +8 (see [benchmark](benchmark.md#number-of-threads)). + +With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. + Each sentence will be tokenized and written to the `stdout` before the next sentence is read. + + +``` +$ yttm encode --help + +Usage: yttm encode [OPTIONS] + + Encode text to ids or subwords. + +Options: + --model PATH Path to file with learned model. [required] + --output_type TEXT 'id' or 'subword'. [required] + --n_threads INTEGER Number of threads. [default: -1] + --bos Add tab 'begin of sentence'. + --eos Add tab 'end of sentence'. + --reverse Reverse output sequence of tokens. + --stream Process each line before reading the next one. + --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] + --help Show this message and exit. +``` + +Print vocabulary. This can be useful for understanding the model. + +``` +$ yttm vocab --help + +Usage: yttm vocab [OPTIONS] + + Print list of learned subwords. + +Options: + --model PATH Path to file with learned model. [required] + --verbose Add merging rules. + --help Show this message and exit. +``` + +Convert ids back to text. Use `stdin` for input and `stdout` for output. + +``` +$ yttm decode --help + +Usage: yttm decode [OPTIONS] + + Decode ids to text. + +Options: + --model PATH Path to file with learned model. [required] + --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 + --help Show this message and exit. +``` + + + + + + + + + + + +%package -n python3-youtokentome +Summary: Unsupervised text tokenizer focused on computational efficiency +Provides: python-youtokentome +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-youtokentome + + +[](https://pepy.tech/project/youtokentome) +[](https://github.com/python/black) + +[](https://travis-ci.org/VKCOM/YouTokenToMe) + +# YouTokenToMe + +YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. +Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) + and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. + Check out our [benchmark](benchmark.md) results. + +Key advantages: + +* Multithreading for training and tokenization +* The algorithm has `O(N)` complexity, where `N` is the length of training data +* Highly efficient implementation in C++ +* Python wrapper and command-line interface + +Extra features: +* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) + +As well as in the algorithm from the original paper, ours does not consider tokens +that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. + +For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into + +`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` + +## Installation + +```bash +pip install youtokentome +``` +## Python interface + +### Example +Let's start with a self-contained example. + +```python +import random + +import youtokentome as yttm + +train_data_path = "train_data.txt" +model_path = "example.model" + +# Generating random file with training data +# 10000 lines with 100 characters in each line +n_lines = 10000 +n_characters = 100 +with open(train_data_path, "w") as fout: + for _ in range(n_lines): + print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) + +# Generating random text +test_text = "".join([random.choice("abcde ") for _ in range(100)]) + +# Training model +yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) + +# Loading model +bpe = yttm.BPE(model=model_path) + +# Two types of tokenization +print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) +print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) +``` + + +### Training model +```python +youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) +``` +Trains BPE model and saves to file. + +**Args:** + +* `data`: string, path to file with training data +* `model`: string, path to where the trained model will be saved +* `vocab_size`: int, number of tokens in the final vocabulary +* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. +* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). +* `pad_id`: int, reserved id for padding +* `unk_id`: int, reserved id for unknown symbols +* `bos_id`: int, reserved id for begin of sentence token +* `eos_id`: int, reserved id for end of sentence token + +**Returns**: Class `youtokentome.BPE` with the loaded model. + + + + +### Model loading + +```python +youtokentome.BPE(model, n_threads=-1) +``` + +Class constructor. Loads the trained model. + +* `model`: string, path to the trained model +* `n_threads`: int, number of parallel threads used to run. + If equal to -1, then the maximum number of threads available will be used. + + + +### Methods +Class `youtokentome.BPE` has the following methods: +#### encode +```python +encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) +``` + +**Args:** + +* `sentences`: list of strings, sentences for tokenization. +* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. +* `bos`: bool, if True then token “beginning of sentence” will be added +* `eos`: bool, if True then token “end of sentence” will be added +* `reverse`: bool, if True the output sequence of tokens will be reversed +* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. + + +**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` + then a list of lists of integers or list of lists of strings will be returned +respectively. + + +#### vocab + +```python +vocab(self) +``` + +**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds + to i-th subword. + + +#### vocab_size + +```python +vocab_size(self) +``` + +**Returns:** int. Size of vocabulary. + + +#### subword_to_id + +```python +subword_to_id(self, subword) +``` +**Args:** +* `subword`: string. + +**Returns:** +Integer from the range [0, vocab_size-1]. Id of subword or, + if there is no such subword in the vocabulary, `unk_id` will be +returned. + + +#### id_to_subword + +```python +id_to_subword(self, id) +``` +**Args:** +* `id`: int, must be in the range [0, vocab_size-1] + +**Returns:** string. Subword from vocabulary by id. + + +#### decode +```python +decode(self, ids, ignore_ids=None) +``` +Convert each id to subword and concatenate with space symbol. + +**Args:** + + * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] + * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] + + +**Returns:** List of strings. + +## Command line interface + +### Example + +```bash +$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 +$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA +``` + + +### Supported commands + +`YouTokenToMe` supports the following commands: + +``` +$ yttm --help + +Usage: yttm [OPTIONS] COMMAND [ARGS]... + +Options: + --help Show this message and exit. + +Commands: + bpe Train BPE model. + decode Decode ids to text. + encode Encode text to ids or subwords. + vocab Print list of learned subwords. +``` + +Command `bpe` allows you to train Byte Pair Encoding model based on a text file. + +``` +$ yttm bpe --help + +Usage: yttm bpe [OPTIONS] + + Train BPE model. + +Options: + --data PATH Training data file path. [required] + --model PATH Output model file path. [required] + --vocab_size INTEGER Number of tokens in the final vocabulary. [required] + --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] + --n_threads INTEGER Number of threads. [default: -1] + --pad_id INTEGER Padding token id. [default: 0] + --unk_id INTEGER Unknown token id. [default: 1] + --bos_id INTEGER 'Begin of sentence' token id. [default: 2] + --eos_id INTEGER 'End of sentence' token id. [default: 3] + --help Show this message and exit. +``` + + +Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. + +By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by +8 (see [benchmark](benchmark.md#number-of-threads)). + +With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. + Each sentence will be tokenized and written to the `stdout` before the next sentence is read. + + +``` +$ yttm encode --help + +Usage: yttm encode [OPTIONS] + + Encode text to ids or subwords. + +Options: + --model PATH Path to file with learned model. [required] + --output_type TEXT 'id' or 'subword'. [required] + --n_threads INTEGER Number of threads. [default: -1] + --bos Add tab 'begin of sentence'. + --eos Add tab 'end of sentence'. + --reverse Reverse output sequence of tokens. + --stream Process each line before reading the next one. + --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] + --help Show this message and exit. +``` + +Print vocabulary. This can be useful for understanding the model. + +``` +$ yttm vocab --help + +Usage: yttm vocab [OPTIONS] + + Print list of learned subwords. + +Options: + --model PATH Path to file with learned model. [required] + --verbose Add merging rules. + --help Show this message and exit. +``` + +Convert ids back to text. Use `stdin` for input and `stdout` for output. + +``` +$ yttm decode --help + +Usage: yttm decode [OPTIONS] + + Decode ids to text. + +Options: + --model PATH Path to file with learned model. [required] + --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 + --help Show this message and exit. +``` + + + + + + + + + + + +%package help +Summary: Development documents and examples for youtokentome +Provides: python3-youtokentome-doc +%description help + + +[](https://pepy.tech/project/youtokentome) +[](https://github.com/python/black) + +[](https://travis-ci.org/VKCOM/YouTokenToMe) + +# YouTokenToMe + +YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. +Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) + and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. + Check out our [benchmark](benchmark.md) results. + +Key advantages: + +* Multithreading for training and tokenization +* The algorithm has `O(N)` complexity, where `N` is the length of training data +* Highly efficient implementation in C++ +* Python wrapper and command-line interface + +Extra features: +* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) + +As well as in the algorithm from the original paper, ours does not consider tokens +that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. + +For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into + +`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` + +## Installation + +```bash +pip install youtokentome +``` +## Python interface + +### Example +Let's start with a self-contained example. + +```python +import random + +import youtokentome as yttm + +train_data_path = "train_data.txt" +model_path = "example.model" + +# Generating random file with training data +# 10000 lines with 100 characters in each line +n_lines = 10000 +n_characters = 100 +with open(train_data_path, "w") as fout: + for _ in range(n_lines): + print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) + +# Generating random text +test_text = "".join([random.choice("abcde ") for _ in range(100)]) + +# Training model +yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) + +# Loading model +bpe = yttm.BPE(model=model_path) + +# Two types of tokenization +print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) +print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) +``` + + +### Training model +```python +youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) +``` +Trains BPE model and saves to file. + +**Args:** + +* `data`: string, path to file with training data +* `model`: string, path to where the trained model will be saved +* `vocab_size`: int, number of tokens in the final vocabulary +* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. +* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). +* `pad_id`: int, reserved id for padding +* `unk_id`: int, reserved id for unknown symbols +* `bos_id`: int, reserved id for begin of sentence token +* `eos_id`: int, reserved id for end of sentence token + +**Returns**: Class `youtokentome.BPE` with the loaded model. + + + + +### Model loading + +```python +youtokentome.BPE(model, n_threads=-1) +``` + +Class constructor. Loads the trained model. + +* `model`: string, path to the trained model +* `n_threads`: int, number of parallel threads used to run. + If equal to -1, then the maximum number of threads available will be used. + + + +### Methods +Class `youtokentome.BPE` has the following methods: +#### encode +```python +encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) +``` + +**Args:** + +* `sentences`: list of strings, sentences for tokenization. +* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. +* `bos`: bool, if True then token “beginning of sentence” will be added +* `eos`: bool, if True then token “end of sentence” will be added +* `reverse`: bool, if True the output sequence of tokens will be reversed +* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. + + +**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` + then a list of lists of integers or list of lists of strings will be returned +respectively. + + +#### vocab + +```python +vocab(self) +``` + +**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds + to i-th subword. + + +#### vocab_size + +```python +vocab_size(self) +``` + +**Returns:** int. Size of vocabulary. + + +#### subword_to_id + +```python +subword_to_id(self, subword) +``` +**Args:** +* `subword`: string. + +**Returns:** +Integer from the range [0, vocab_size-1]. Id of subword or, + if there is no such subword in the vocabulary, `unk_id` will be +returned. + + +#### id_to_subword + +```python +id_to_subword(self, id) +``` +**Args:** +* `id`: int, must be in the range [0, vocab_size-1] + +**Returns:** string. Subword from vocabulary by id. + + +#### decode +```python +decode(self, ids, ignore_ids=None) +``` +Convert each id to subword and concatenate with space symbol. + +**Args:** + + * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] + * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] + + +**Returns:** List of strings. + +## Command line interface + +### Example + +```bash +$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 +$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA +``` + + +### Supported commands + +`YouTokenToMe` supports the following commands: + +``` +$ yttm --help + +Usage: yttm [OPTIONS] COMMAND [ARGS]... + +Options: + --help Show this message and exit. + +Commands: + bpe Train BPE model. + decode Decode ids to text. + encode Encode text to ids or subwords. + vocab Print list of learned subwords. +``` + +Command `bpe` allows you to train Byte Pair Encoding model based on a text file. + +``` +$ yttm bpe --help + +Usage: yttm bpe [OPTIONS] + + Train BPE model. + +Options: + --data PATH Training data file path. [required] + --model PATH Output model file path. [required] + --vocab_size INTEGER Number of tokens in the final vocabulary. [required] + --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] + --n_threads INTEGER Number of threads. [default: -1] + --pad_id INTEGER Padding token id. [default: 0] + --unk_id INTEGER Unknown token id. [default: 1] + --bos_id INTEGER 'Begin of sentence' token id. [default: 2] + --eos_id INTEGER 'End of sentence' token id. [default: 3] + --help Show this message and exit. +``` + + +Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. + +By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by +8 (see [benchmark](benchmark.md#number-of-threads)). + +With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. + Each sentence will be tokenized and written to the `stdout` before the next sentence is read. + + +``` +$ yttm encode --help + +Usage: yttm encode [OPTIONS] + + Encode text to ids or subwords. + +Options: + --model PATH Path to file with learned model. [required] + --output_type TEXT 'id' or 'subword'. [required] + --n_threads INTEGER Number of threads. [default: -1] + --bos Add tab 'begin of sentence'. + --eos Add tab 'end of sentence'. + --reverse Reverse output sequence of tokens. + --stream Process each line before reading the next one. + --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] + --help Show this message and exit. +``` + +Print vocabulary. This can be useful for understanding the model. + +``` +$ yttm vocab --help + +Usage: yttm vocab [OPTIONS] + + Print list of learned subwords. + +Options: + --model PATH Path to file with learned model. [required] + --verbose Add merging rules. + --help Show this message and exit. +``` + +Convert ids back to text. Use `stdin` for input and `stdout` for output. + +``` +$ yttm decode --help + +Usage: yttm decode [OPTIONS] + + Decode ids to text. + +Options: + --model PATH Path to file with learned model. [required] + --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 + --help Show this message and exit. +``` + + + + + + + + + + + +%prep +%autosetup -n youtokentome-1.0.6 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-youtokentome -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Apr 11 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.6-1 +- Package Spec generated @@ -0,0 +1 @@ +2b892f24fe358d5868b8324efea288ae youtokentome-1.0.6.tar.gz |