%global _empty_manifest_terminate_build 0 Name: python-youtokentome Version: 1.0.6 Release: 1 Summary: Unsupervised text tokenizer focused on computational efficiency License: MIT URL: https://github.com/vkcom/youtokentome Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/ae/f8b0d15696766eb35dda6cf84a23d42ae7f3ba37aa30e5e2287fd94ac053/youtokentome-1.0.6.tar.gz BuildArch: noarch Requires: python3-Click %description ![PyPI](https://img.shields.io/pypi/v/youtokentome.svg) [![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black) ![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg) [![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe) # YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. Check out our [benchmark](benchmark.md) results. Key advantages: * Multithreading for training and tokenization * The algorithm has `O(N)` complexity, where `N` is the length of training data * Highly efficient implementation in C++ * Python wrapper and command-line interface Extra features: * BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into `['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` ## Installation ```bash pip install youtokentome ``` ## Python interface ### Example Let's start with a self-contained example. ```python import random import youtokentome as yttm train_data_path = "train_data.txt" model_path = "example.model" # Generating random file with training data # 10000 lines with 100 characters in each line n_lines = 10000 n_characters = 100 with open(train_data_path, "w") as fout: for _ in range(n_lines): print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) # Generating random text test_text = "".join([random.choice("abcde ") for _ in range(100)]) # Training model yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) # Loading model bpe = yttm.BPE(model=model_path) # Two types of tokenization print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) ```   ### Training model ```python youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) ``` Trains BPE model and saves to file. **Args:** * `data`: string, path to file with training data * `model`: string, path to where the trained model will be saved * `vocab_size`: int, number of tokens in the final vocabulary * `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. * `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). * `pad_id`: int, reserved id for padding * `unk_id`: int, reserved id for unknown symbols * `bos_id`: int, reserved id for begin of sentence token * `eos_id`: int, reserved id for end of sentence token **Returns**: Class `youtokentome.BPE` with the loaded model.   ### Model loading ```python youtokentome.BPE(model, n_threads=-1) ``` Class constructor. Loads the trained model. * `model`: string, path to the trained model * `n_threads`: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.   ### Methods Class `youtokentome.BPE` has the following methods: #### encode ```python encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) ``` **Args:** * `sentences`: list of strings, sentences for tokenization. * `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. * `bos`: bool, if True then token “beginning of sentence” will be added * `eos`: bool, if True then token “end of sentence” will be added * `reverse`: bool, if True the output sequence of tokens will be reversed * `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. **Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` then a list of lists of integers or list of lists of strings will be returned respectively.   #### vocab ```python vocab(self) ``` **Returns:** A list `vocab_size` strings. The i-th string in the list corresponds to i-th subword.   #### vocab_size ```python vocab_size(self) ``` **Returns:** int. Size of vocabulary.   #### subword_to_id ```python subword_to_id(self, subword) ``` **Args:** * `subword`: string. **Returns:** Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, `unk_id` will be returned.   #### id_to_subword ```python id_to_subword(self, id) ``` **Args:** * `id`: int, must be in the range [0, vocab_size-1] **Returns:** string. Subword from vocabulary by id.   #### decode ```python decode(self, ids, ignore_ids=None) ``` Convert each id to subword and concatenate with space symbol. **Args:** * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] **Returns:** List of strings. ## Command line interface ### Example ```bash $ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 $ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA ``` ### Supported commands `YouTokenToMe` supports the following commands: ``` $ yttm --help Usage: yttm [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: bpe Train BPE model. decode Decode ids to text. encode Encode text to ids or subwords. vocab Print list of learned subwords. ``` Command `bpe` allows you to train Byte Pair Encoding model based on a text file. ``` $ yttm bpe --help Usage: yttm bpe [OPTIONS] Train BPE model. Options: --data PATH Training data file path. [required] --model PATH Output model file path. [required] --vocab_size INTEGER Number of tokens in the final vocabulary. [required] --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] --n_threads INTEGER Number of threads. [default: -1] --pad_id INTEGER Padding token id. [default: 0] --unk_id INTEGER Unknown token id. [default: 1] --bos_id INTEGER 'Begin of sentence' token id. [default: 2] --eos_id INTEGER 'End of sentence' token id. [default: 3] --help Show this message and exit. ``` Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the `stdout` before the next sentence is read. ``` $ yttm encode --help Usage: yttm encode [OPTIONS] Encode text to ids or subwords. Options: --model PATH Path to file with learned model. [required] --output_type TEXT 'id' or 'subword'. [required] --n_threads INTEGER Number of threads. [default: -1] --bos Add tab 'begin of sentence'. --eos Add tab 'end of sentence'. --reverse Reverse output sequence of tokens. --stream Process each line before reading the next one. --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] --help Show this message and exit. ``` Print vocabulary. This can be useful for understanding the model. ``` $ yttm vocab --help Usage: yttm vocab [OPTIONS] Print list of learned subwords. Options: --model PATH Path to file with learned model. [required] --verbose Add merging rules. --help Show this message and exit. ``` Convert ids back to text. Use `stdin` for input and `stdout` for output. ``` $ yttm decode --help Usage: yttm decode [OPTIONS] Decode ids to text. Options: --model PATH Path to file with learned model. [required] --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 --help Show this message and exit. ``` %package -n python3-youtokentome Summary: Unsupervised text tokenizer focused on computational efficiency Provides: python-youtokentome BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-youtokentome ![PyPI](https://img.shields.io/pypi/v/youtokentome.svg) [![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black) ![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg) [![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe) # YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. Check out our [benchmark](benchmark.md) results. Key advantages: * Multithreading for training and tokenization * The algorithm has `O(N)` complexity, where `N` is the length of training data * Highly efficient implementation in C++ * Python wrapper and command-line interface Extra features: * BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into `['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` ## Installation ```bash pip install youtokentome ``` ## Python interface ### Example Let's start with a self-contained example. ```python import random import youtokentome as yttm train_data_path = "train_data.txt" model_path = "example.model" # Generating random file with training data # 10000 lines with 100 characters in each line n_lines = 10000 n_characters = 100 with open(train_data_path, "w") as fout: for _ in range(n_lines): print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) # Generating random text test_text = "".join([random.choice("abcde ") for _ in range(100)]) # Training model yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) # Loading model bpe = yttm.BPE(model=model_path) # Two types of tokenization print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) ```   ### Training model ```python youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) ``` Trains BPE model and saves to file. **Args:** * `data`: string, path to file with training data * `model`: string, path to where the trained model will be saved * `vocab_size`: int, number of tokens in the final vocabulary * `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. * `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). * `pad_id`: int, reserved id for padding * `unk_id`: int, reserved id for unknown symbols * `bos_id`: int, reserved id for begin of sentence token * `eos_id`: int, reserved id for end of sentence token **Returns**: Class `youtokentome.BPE` with the loaded model.   ### Model loading ```python youtokentome.BPE(model, n_threads=-1) ``` Class constructor. Loads the trained model. * `model`: string, path to the trained model * `n_threads`: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.   ### Methods Class `youtokentome.BPE` has the following methods: #### encode ```python encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) ``` **Args:** * `sentences`: list of strings, sentences for tokenization. * `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. * `bos`: bool, if True then token “beginning of sentence” will be added * `eos`: bool, if True then token “end of sentence” will be added * `reverse`: bool, if True the output sequence of tokens will be reversed * `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. **Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` then a list of lists of integers or list of lists of strings will be returned respectively.   #### vocab ```python vocab(self) ``` **Returns:** A list `vocab_size` strings. The i-th string in the list corresponds to i-th subword.   #### vocab_size ```python vocab_size(self) ``` **Returns:** int. Size of vocabulary.   #### subword_to_id ```python subword_to_id(self, subword) ``` **Args:** * `subword`: string. **Returns:** Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, `unk_id` will be returned.   #### id_to_subword ```python id_to_subword(self, id) ``` **Args:** * `id`: int, must be in the range [0, vocab_size-1] **Returns:** string. Subword from vocabulary by id.   #### decode ```python decode(self, ids, ignore_ids=None) ``` Convert each id to subword and concatenate with space symbol. **Args:** * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] **Returns:** List of strings. ## Command line interface ### Example ```bash $ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 $ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA ``` ### Supported commands `YouTokenToMe` supports the following commands: ``` $ yttm --help Usage: yttm [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: bpe Train BPE model. decode Decode ids to text. encode Encode text to ids or subwords. vocab Print list of learned subwords. ``` Command `bpe` allows you to train Byte Pair Encoding model based on a text file. ``` $ yttm bpe --help Usage: yttm bpe [OPTIONS] Train BPE model. Options: --data PATH Training data file path. [required] --model PATH Output model file path. [required] --vocab_size INTEGER Number of tokens in the final vocabulary. [required] --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] --n_threads INTEGER Number of threads. [default: -1] --pad_id INTEGER Padding token id. [default: 0] --unk_id INTEGER Unknown token id. [default: 1] --bos_id INTEGER 'Begin of sentence' token id. [default: 2] --eos_id INTEGER 'End of sentence' token id. [default: 3] --help Show this message and exit. ``` Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the `stdout` before the next sentence is read. ``` $ yttm encode --help Usage: yttm encode [OPTIONS] Encode text to ids or subwords. Options: --model PATH Path to file with learned model. [required] --output_type TEXT 'id' or 'subword'. [required] --n_threads INTEGER Number of threads. [default: -1] --bos Add tab 'begin of sentence'. --eos Add tab 'end of sentence'. --reverse Reverse output sequence of tokens. --stream Process each line before reading the next one. --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] --help Show this message and exit. ``` Print vocabulary. This can be useful for understanding the model. ``` $ yttm vocab --help Usage: yttm vocab [OPTIONS] Print list of learned subwords. Options: --model PATH Path to file with learned model. [required] --verbose Add merging rules. --help Show this message and exit. ``` Convert ids back to text. Use `stdin` for input and `stdout` for output. ``` $ yttm decode --help Usage: yttm decode [OPTIONS] Decode ids to text. Options: --model PATH Path to file with learned model. [required] --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 --help Show this message and exit. ``` %package help Summary: Development documents and examples for youtokentome Provides: python3-youtokentome-doc %description help ![PyPI](https://img.shields.io/pypi/v/youtokentome.svg) [![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black) ![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg) [![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe) # YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]. Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE) and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster. Check out our [benchmark](benchmark.md) results. Key advantages: * Multithreading for training and tokenization * The algorithm has `O(N)` complexity, where `N` is the length of training data * Highly efficient implementation in C++ * Python wrapper and command-line interface Extra features: * BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267)) As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored. For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into `['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']` ## Installation ```bash pip install youtokentome ``` ## Python interface ### Example Let's start with a self-contained example. ```python import random import youtokentome as yttm train_data_path = "train_data.txt" model_path = "example.model" # Generating random file with training data # 10000 lines with 100 characters in each line n_lines = 10000 n_characters = 100 with open(train_data_path, "w") as fout: for _ in range(n_lines): print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout) # Generating random text test_text = "".join([random.choice("abcde ") for _ in range(100)]) # Training model yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path) # Loading model bpe = yttm.BPE(model=model_path) # Two types of tokenization print(bpe.encode([test_text], output_type=yttm.OutputType.ID)) print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD)) ```   ### Training model ```python youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3) ``` Trains BPE model and saves to file. **Args:** * `data`: string, path to file with training data * `model`: string, path to where the trained model will be saved * `vocab_size`: int, number of tokens in the final vocabulary * `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. * `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). * `pad_id`: int, reserved id for padding * `unk_id`: int, reserved id for unknown symbols * `bos_id`: int, reserved id for begin of sentence token * `eos_id`: int, reserved id for end of sentence token **Returns**: Class `youtokentome.BPE` with the loaded model.   ### Model loading ```python youtokentome.BPE(model, n_threads=-1) ``` Class constructor. Loads the trained model. * `model`: string, path to the trained model * `n_threads`: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.   ### Methods Class `youtokentome.BPE` has the following methods: #### encode ```python encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0) ``` **Args:** * `sentences`: list of strings, sentences for tokenization. * `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords. * `bos`: bool, if True then token “beginning of sentence” will be added * `eos`: bool, if True then token “end of sentence” will be added * `reverse`: bool, if True the output sequence of tokens will be reversed * `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1]. **Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` then a list of lists of integers or list of lists of strings will be returned respectively.   #### vocab ```python vocab(self) ``` **Returns:** A list `vocab_size` strings. The i-th string in the list corresponds to i-th subword.   #### vocab_size ```python vocab_size(self) ``` **Returns:** int. Size of vocabulary.   #### subword_to_id ```python subword_to_id(self, subword) ``` **Args:** * `subword`: string. **Returns:** Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, `unk_id` will be returned.   #### id_to_subword ```python id_to_subword(self, id) ``` **Args:** * `id`: int, must be in the range [0, vocab_size-1] **Returns:** string. Subword from vocabulary by id.   #### decode ```python decode(self, ids, ignore_ids=None) ``` Convert each id to subword and concatenate with space symbol. **Args:** * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1] * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None] **Returns:** List of strings. ## Command line interface ### Example ```bash $ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000 $ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA ``` ### Supported commands `YouTokenToMe` supports the following commands: ``` $ yttm --help Usage: yttm [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: bpe Train BPE model. decode Decode ids to text. encode Encode text to ids or subwords. vocab Print list of learned subwords. ``` Command `bpe` allows you to train Byte Pair Encoding model based on a text file. ``` $ yttm bpe --help Usage: yttm bpe [OPTIONS] Train BPE model. Options: --data PATH Training data file path. [required] --model PATH Output model file path. [required] --vocab_size INTEGER Number of tokens in the final vocabulary. [required] --coverage FLOAT Fraction of characters covered by the model. [default: 1.0] --n_threads INTEGER Number of threads. [default: -1] --pad_id INTEGER Padding token id. [default: 0] --unk_id INTEGER Unknown token id. [default: 1] --bos_id INTEGER 'Begin of sentence' token id. [default: 2] --eos_id INTEGER 'End of sentence' token id. [default: 3] --help Show this message and exit. ``` Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output. By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)). With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the `stdout` before the next sentence is read. ``` $ yttm encode --help Usage: yttm encode [OPTIONS] Encode text to ids or subwords. Options: --model PATH Path to file with learned model. [required] --output_type TEXT 'id' or 'subword'. [required] --n_threads INTEGER Number of threads. [default: -1] --bos Add tab 'begin of sentence'. --eos Add tab 'end of sentence'. --reverse Reverse output sequence of tokens. --stream Process each line before reading the next one. --dropout_prob BPE-dropout probability (the probability of a merge being dropped). [default: 0] --help Show this message and exit. ``` Print vocabulary. This can be useful for understanding the model. ``` $ yttm vocab --help Usage: yttm vocab [OPTIONS] Print list of learned subwords. Options: --model PATH Path to file with learned model. [required] --verbose Add merging rules. --help Show this message and exit. ``` Convert ids back to text. Use `stdin` for input and `stdout` for output. ``` $ yttm decode --help Usage: yttm decode [OPTIONS] Decode ids to text. Options: --model PATH Path to file with learned model. [required] --ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3 --help Show this message and exit. ``` %prep %autosetup -n youtokentome-1.0.6 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-youtokentome -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Sun Apr 23 2023 Python_Bot - 1.0.6-1 - Package Spec generated