diff options
Diffstat (limited to 'python-tokenizers.spec')
-rw-r--r-- | python-tokenizers.spec | 600 |
1 files changed, 600 insertions, 0 deletions
diff --git a/python-tokenizers.spec b/python-tokenizers.spec new file mode 100644 index 0000000..3278522 --- /dev/null +++ b/python-tokenizers.spec @@ -0,0 +1,600 @@ +%global _empty_manifest_terminate_build 0 +Name: python-tokenizers +Version: 0.13.3 +Release: 1 +Summary: Fast and Customizable Tokenizers +License: Apache License 2.0 +URL: https://github.com/huggingface/tokenizers +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/29/9c/936ebad6dd963616189d6362f4c2c03a0314cf2a221ba15e48dd714d29cf/tokenizers-0.13.3.tar.gz + +Requires: python3-pytest +Requires: python3-requests +Requires: python3-numpy +Requires: python3-datasets +Requires: python3-black +Requires: python3-sphinx +Requires: python3-sphinx-rtd-theme +Requires: python3-setuptools-rust +Requires: python3-pytest +Requires: python3-requests +Requires: python3-numpy +Requires: python3-datasets +Requires: python3-black + +%description +<p align="center"> + <br> + <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/> + <br> +<p> +<p align="center"> + <a href="https://badge.fury.io/py/tokenizers"> + <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg"> + </a> + <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE"> + <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue"> + </a> +</p> +<br> + +# Tokenizers + +Provides an implementation of today's most used tokenizers, with a focus on performance and +versatility. + +Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation. +If you are interested in the High-level design, you can go check it there. + +Otherwise, let's dive in! + +## Main features: + + - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 + most common BPE versions). + - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes + less than 20 seconds to tokenize a GB of text on a server's CPU. + - Easy to use, but also extremely versatile. + - Designed for research and production. + - Normalization comes with alignments tracking. It's always possible to get the part of the + original sentence that corresponds to a given token. + - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. + +### Installation + +#### With pip: + +```bash +pip install tokenizers +``` + +#### From sources: + +To use this method, you need to have the Rust installed: + +```bash +# Install with: +curl https://sh.rustup.rs -sSf | sh -s -- -y +export PATH="$HOME/.cargo/bin:$PATH" +``` + +Once Rust is installed, you can compile doing the following + +```bash +git clone https://github.com/huggingface/tokenizers +cd tokenizers/bindings/python + +# Create a virtual env (you can use yours as well) +python -m venv .env +source .env/bin/activate + +# Install `tokenizers` in the current virtual env +pip install setuptools_rust +python setup.py install +``` + +### Load a pretrained tokenizer from the Hub + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_pretrained("bert-base-cased") +``` + +### Using the provided Tokenizers + +We provide some pre-build tokenizers to cover the most common cases. You can easily load one of +these using some `vocab.json` and `merges.txt` files: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +vocab = "./path/to/vocab.json" +merges = "./path/to/merges.txt" +tokenizer = CharBPETokenizer(vocab, merges) + +# And then encode: +encoded = tokenizer.encode("I can feel the magic, can you?") +print(encoded.ids) +print(encoded.tokens) +``` + +And you can train them just as simply: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +tokenizer = CharBPETokenizer() + +# Then train it! +tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ]) + +# Now, let's use it: +encoded = tokenizer.encode("I can feel the magic, can you?") + +# And finally save it somewhere +tokenizer.save("./path/to/directory/my-bpe.tokenizer.json") +``` + +#### Provided Tokenizers + + - `CharBPETokenizer`: The original BPE + - `ByteLevelBPETokenizer`: The byte level version of the BPE + - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece + - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece + +All of these can be used and trained as explained above! + +### Build your own + +Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, +by putting all the different parts you need together. +You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs. + +#### Building a byte-level BPE + +Here is an example showing how to build your own byte-level BPE by putting all the different pieces +together, and then saving it to a single file: + +```python +from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors + +# Initialize a tokenizer +tokenizer = Tokenizer(models.BPE()) + +# Customize pre-tokenization and decoding +tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) +tokenizer.decoder = decoders.ByteLevel() +tokenizer.post_processor = processors.ByteLevel(trim_offsets=True) + +# And then train +trainer = trainers.BpeTrainer( + vocab_size=20000, + min_frequency=2, + initial_alphabet=pre_tokenizers.ByteLevel.alphabet() +) +tokenizer.train([ + "./path/to/dataset/1.txt", + "./path/to/dataset/2.txt", + "./path/to/dataset/3.txt" +], trainer=trainer) + +# And Save it +tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True) +``` + +Now, when you want to use this tokenizer, this is as simple as: + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") + +encoded = tokenizer.encode("I can feel the magic, can you?") +``` + + +%package -n python3-tokenizers +Summary: Fast and Customizable Tokenizers +Provides: python-tokenizers +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +BuildRequires: python3-cffi +BuildRequires: gcc +BuildRequires: gdb +%description -n python3-tokenizers +<p align="center"> + <br> + <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/> + <br> +<p> +<p align="center"> + <a href="https://badge.fury.io/py/tokenizers"> + <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg"> + </a> + <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE"> + <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue"> + </a> +</p> +<br> + +# Tokenizers + +Provides an implementation of today's most used tokenizers, with a focus on performance and +versatility. + +Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation. +If you are interested in the High-level design, you can go check it there. + +Otherwise, let's dive in! + +## Main features: + + - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 + most common BPE versions). + - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes + less than 20 seconds to tokenize a GB of text on a server's CPU. + - Easy to use, but also extremely versatile. + - Designed for research and production. + - Normalization comes with alignments tracking. It's always possible to get the part of the + original sentence that corresponds to a given token. + - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. + +### Installation + +#### With pip: + +```bash +pip install tokenizers +``` + +#### From sources: + +To use this method, you need to have the Rust installed: + +```bash +# Install with: +curl https://sh.rustup.rs -sSf | sh -s -- -y +export PATH="$HOME/.cargo/bin:$PATH" +``` + +Once Rust is installed, you can compile doing the following + +```bash +git clone https://github.com/huggingface/tokenizers +cd tokenizers/bindings/python + +# Create a virtual env (you can use yours as well) +python -m venv .env +source .env/bin/activate + +# Install `tokenizers` in the current virtual env +pip install setuptools_rust +python setup.py install +``` + +### Load a pretrained tokenizer from the Hub + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_pretrained("bert-base-cased") +``` + +### Using the provided Tokenizers + +We provide some pre-build tokenizers to cover the most common cases. You can easily load one of +these using some `vocab.json` and `merges.txt` files: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +vocab = "./path/to/vocab.json" +merges = "./path/to/merges.txt" +tokenizer = CharBPETokenizer(vocab, merges) + +# And then encode: +encoded = tokenizer.encode("I can feel the magic, can you?") +print(encoded.ids) +print(encoded.tokens) +``` + +And you can train them just as simply: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +tokenizer = CharBPETokenizer() + +# Then train it! +tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ]) + +# Now, let's use it: +encoded = tokenizer.encode("I can feel the magic, can you?") + +# And finally save it somewhere +tokenizer.save("./path/to/directory/my-bpe.tokenizer.json") +``` + +#### Provided Tokenizers + + - `CharBPETokenizer`: The original BPE + - `ByteLevelBPETokenizer`: The byte level version of the BPE + - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece + - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece + +All of these can be used and trained as explained above! + +### Build your own + +Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, +by putting all the different parts you need together. +You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs. + +#### Building a byte-level BPE + +Here is an example showing how to build your own byte-level BPE by putting all the different pieces +together, and then saving it to a single file: + +```python +from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors + +# Initialize a tokenizer +tokenizer = Tokenizer(models.BPE()) + +# Customize pre-tokenization and decoding +tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) +tokenizer.decoder = decoders.ByteLevel() +tokenizer.post_processor = processors.ByteLevel(trim_offsets=True) + +# And then train +trainer = trainers.BpeTrainer( + vocab_size=20000, + min_frequency=2, + initial_alphabet=pre_tokenizers.ByteLevel.alphabet() +) +tokenizer.train([ + "./path/to/dataset/1.txt", + "./path/to/dataset/2.txt", + "./path/to/dataset/3.txt" +], trainer=trainer) + +# And Save it +tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True) +``` + +Now, when you want to use this tokenizer, this is as simple as: + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") + +encoded = tokenizer.encode("I can feel the magic, can you?") +``` + + +%package help +Summary: Development documents and examples for tokenizers +Provides: python3-tokenizers-doc +%description help +<p align="center"> + <br> + <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/> + <br> +<p> +<p align="center"> + <a href="https://badge.fury.io/py/tokenizers"> + <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg"> + </a> + <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE"> + <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue"> + </a> +</p> +<br> + +# Tokenizers + +Provides an implementation of today's most used tokenizers, with a focus on performance and +versatility. + +Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation. +If you are interested in the High-level design, you can go check it there. + +Otherwise, let's dive in! + +## Main features: + + - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 + most common BPE versions). + - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes + less than 20 seconds to tokenize a GB of text on a server's CPU. + - Easy to use, but also extremely versatile. + - Designed for research and production. + - Normalization comes with alignments tracking. It's always possible to get the part of the + original sentence that corresponds to a given token. + - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. + +### Installation + +#### With pip: + +```bash +pip install tokenizers +``` + +#### From sources: + +To use this method, you need to have the Rust installed: + +```bash +# Install with: +curl https://sh.rustup.rs -sSf | sh -s -- -y +export PATH="$HOME/.cargo/bin:$PATH" +``` + +Once Rust is installed, you can compile doing the following + +```bash +git clone https://github.com/huggingface/tokenizers +cd tokenizers/bindings/python + +# Create a virtual env (you can use yours as well) +python -m venv .env +source .env/bin/activate + +# Install `tokenizers` in the current virtual env +pip install setuptools_rust +python setup.py install +``` + +### Load a pretrained tokenizer from the Hub + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_pretrained("bert-base-cased") +``` + +### Using the provided Tokenizers + +We provide some pre-build tokenizers to cover the most common cases. You can easily load one of +these using some `vocab.json` and `merges.txt` files: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +vocab = "./path/to/vocab.json" +merges = "./path/to/merges.txt" +tokenizer = CharBPETokenizer(vocab, merges) + +# And then encode: +encoded = tokenizer.encode("I can feel the magic, can you?") +print(encoded.ids) +print(encoded.tokens) +``` + +And you can train them just as simply: + +```python +from tokenizers import CharBPETokenizer + +# Initialize a tokenizer +tokenizer = CharBPETokenizer() + +# Then train it! +tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ]) + +# Now, let's use it: +encoded = tokenizer.encode("I can feel the magic, can you?") + +# And finally save it somewhere +tokenizer.save("./path/to/directory/my-bpe.tokenizer.json") +``` + +#### Provided Tokenizers + + - `CharBPETokenizer`: The original BPE + - `ByteLevelBPETokenizer`: The byte level version of the BPE + - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece + - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece + +All of these can be used and trained as explained above! + +### Build your own + +Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, +by putting all the different parts you need together. +You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs. + +#### Building a byte-level BPE + +Here is an example showing how to build your own byte-level BPE by putting all the different pieces +together, and then saving it to a single file: + +```python +from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors + +# Initialize a tokenizer +tokenizer = Tokenizer(models.BPE()) + +# Customize pre-tokenization and decoding +tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) +tokenizer.decoder = decoders.ByteLevel() +tokenizer.post_processor = processors.ByteLevel(trim_offsets=True) + +# And then train +trainer = trainers.BpeTrainer( + vocab_size=20000, + min_frequency=2, + initial_alphabet=pre_tokenizers.ByteLevel.alphabet() +) +tokenizer.train([ + "./path/to/dataset/1.txt", + "./path/to/dataset/2.txt", + "./path/to/dataset/3.txt" +], trainer=trainer) + +# And Save it +tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True) +``` + +Now, when you want to use this tokenizer, this is as simple as: + +```python +from tokenizers import Tokenizer + +tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") + +encoded = tokenizer.encode("I can feel the magic, can you?") +``` + + +%prep +%autosetup -n tokenizers-0.13.3 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-tokenizers -f filelist.lst +%dir %{python3_sitearch}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.13.3-1 +- Package Spec generated |