summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-04-10 08:43:31 +0000
committerCoprDistGit <infra@openeuler.org>2023-04-10 08:43:31 +0000
commitb837084082dcb3550e98570dae2a050a85dc0fcb (patch)
treef29aa99802a165a13e512dc6bed694c12f7455ed
parentf8b88c2c79627dfd646ef69d988044cee3039747 (diff)
automatic import of python-tokenizers
-rw-r--r--.gitignore1
-rw-r--r--python-tokenizers.spec600
-rw-r--r--sources1
3 files changed, 602 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..1a79e44 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/tokenizers-0.13.3.tar.gz
diff --git a/python-tokenizers.spec b/python-tokenizers.spec
new file mode 100644
index 0000000..3278522
--- /dev/null
+++ b/python-tokenizers.spec
@@ -0,0 +1,600 @@
+%global _empty_manifest_terminate_build 0
+Name: python-tokenizers
+Version: 0.13.3
+Release: 1
+Summary: Fast and Customizable Tokenizers
+License: Apache License 2.0
+URL: https://github.com/huggingface/tokenizers
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/29/9c/936ebad6dd963616189d6362f4c2c03a0314cf2a221ba15e48dd714d29cf/tokenizers-0.13.3.tar.gz
+
+Requires: python3-pytest
+Requires: python3-requests
+Requires: python3-numpy
+Requires: python3-datasets
+Requires: python3-black
+Requires: python3-sphinx
+Requires: python3-sphinx-rtd-theme
+Requires: python3-setuptools-rust
+Requires: python3-pytest
+Requires: python3-requests
+Requires: python3-numpy
+Requires: python3-datasets
+Requires: python3-black
+
+%description
+<p align="center">
+ <br>
+ <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
+ <br>
+<p>
+<p align="center">
+ <a href="https://badge.fury.io/py/tokenizers">
+ <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
+ </a>
+ <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
+ <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
+ </a>
+</p>
+<br>
+
+# Tokenizers
+
+Provides an implementation of today's most used tokenizers, with a focus on performance and
+versatility.
+
+Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
+If you are interested in the High-level design, you can go check it there.
+
+Otherwise, let's dive in!
+
+## Main features:
+
+ - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
+ most common BPE versions).
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+ less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+ original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
+### Installation
+
+#### With pip:
+
+```bash
+pip install tokenizers
+```
+
+#### From sources:
+
+To use this method, you need to have the Rust installed:
+
+```bash
+# Install with:
+curl https://sh.rustup.rs -sSf | sh -s -- -y
+export PATH="$HOME/.cargo/bin:$PATH"
+```
+
+Once Rust is installed, you can compile doing the following
+
+```bash
+git clone https://github.com/huggingface/tokenizers
+cd tokenizers/bindings/python
+
+# Create a virtual env (you can use yours as well)
+python -m venv .env
+source .env/bin/activate
+
+# Install `tokenizers` in the current virtual env
+pip install setuptools_rust
+python setup.py install
+```
+
+### Load a pretrained tokenizer from the Hub
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_pretrained("bert-base-cased")
+```
+
+### Using the provided Tokenizers
+
+We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
+these using some `vocab.json` and `merges.txt` files:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+vocab = "./path/to/vocab.json"
+merges = "./path/to/merges.txt"
+tokenizer = CharBPETokenizer(vocab, merges)
+
+# And then encode:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+print(encoded.ids)
+print(encoded.tokens)
+```
+
+And you can train them just as simply:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+tokenizer = CharBPETokenizer()
+
+# Then train it!
+tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
+
+# Now, let's use it:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+
+# And finally save it somewhere
+tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
+```
+
+#### Provided Tokenizers
+
+ - `CharBPETokenizer`: The original BPE
+ - `ByteLevelBPETokenizer`: The byte level version of the BPE
+ - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
+ - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
+
+All of these can be used and trained as explained above!
+
+### Build your own
+
+Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
+by putting all the different parts you need together.
+You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
+
+#### Building a byte-level BPE
+
+Here is an example showing how to build your own byte-level BPE by putting all the different pieces
+together, and then saving it to a single file:
+
+```python
+from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
+
+# Initialize a tokenizer
+tokenizer = Tokenizer(models.BPE())
+
+# Customize pre-tokenization and decoding
+tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
+tokenizer.decoder = decoders.ByteLevel()
+tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
+
+# And then train
+trainer = trainers.BpeTrainer(
+ vocab_size=20000,
+ min_frequency=2,
+ initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
+)
+tokenizer.train([
+ "./path/to/dataset/1.txt",
+ "./path/to/dataset/2.txt",
+ "./path/to/dataset/3.txt"
+], trainer=trainer)
+
+# And Save it
+tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
+```
+
+Now, when you want to use this tokenizer, this is as simple as:
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
+
+encoded = tokenizer.encode("I can feel the magic, can you?")
+```
+
+
+%package -n python3-tokenizers
+Summary: Fast and Customizable Tokenizers
+Provides: python-tokenizers
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+BuildRequires: python3-cffi
+BuildRequires: gcc
+BuildRequires: gdb
+%description -n python3-tokenizers
+<p align="center">
+ <br>
+ <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
+ <br>
+<p>
+<p align="center">
+ <a href="https://badge.fury.io/py/tokenizers">
+ <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
+ </a>
+ <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
+ <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
+ </a>
+</p>
+<br>
+
+# Tokenizers
+
+Provides an implementation of today's most used tokenizers, with a focus on performance and
+versatility.
+
+Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
+If you are interested in the High-level design, you can go check it there.
+
+Otherwise, let's dive in!
+
+## Main features:
+
+ - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
+ most common BPE versions).
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+ less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+ original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
+### Installation
+
+#### With pip:
+
+```bash
+pip install tokenizers
+```
+
+#### From sources:
+
+To use this method, you need to have the Rust installed:
+
+```bash
+# Install with:
+curl https://sh.rustup.rs -sSf | sh -s -- -y
+export PATH="$HOME/.cargo/bin:$PATH"
+```
+
+Once Rust is installed, you can compile doing the following
+
+```bash
+git clone https://github.com/huggingface/tokenizers
+cd tokenizers/bindings/python
+
+# Create a virtual env (you can use yours as well)
+python -m venv .env
+source .env/bin/activate
+
+# Install `tokenizers` in the current virtual env
+pip install setuptools_rust
+python setup.py install
+```
+
+### Load a pretrained tokenizer from the Hub
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_pretrained("bert-base-cased")
+```
+
+### Using the provided Tokenizers
+
+We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
+these using some `vocab.json` and `merges.txt` files:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+vocab = "./path/to/vocab.json"
+merges = "./path/to/merges.txt"
+tokenizer = CharBPETokenizer(vocab, merges)
+
+# And then encode:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+print(encoded.ids)
+print(encoded.tokens)
+```
+
+And you can train them just as simply:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+tokenizer = CharBPETokenizer()
+
+# Then train it!
+tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
+
+# Now, let's use it:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+
+# And finally save it somewhere
+tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
+```
+
+#### Provided Tokenizers
+
+ - `CharBPETokenizer`: The original BPE
+ - `ByteLevelBPETokenizer`: The byte level version of the BPE
+ - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
+ - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
+
+All of these can be used and trained as explained above!
+
+### Build your own
+
+Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
+by putting all the different parts you need together.
+You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
+
+#### Building a byte-level BPE
+
+Here is an example showing how to build your own byte-level BPE by putting all the different pieces
+together, and then saving it to a single file:
+
+```python
+from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
+
+# Initialize a tokenizer
+tokenizer = Tokenizer(models.BPE())
+
+# Customize pre-tokenization and decoding
+tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
+tokenizer.decoder = decoders.ByteLevel()
+tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
+
+# And then train
+trainer = trainers.BpeTrainer(
+ vocab_size=20000,
+ min_frequency=2,
+ initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
+)
+tokenizer.train([
+ "./path/to/dataset/1.txt",
+ "./path/to/dataset/2.txt",
+ "./path/to/dataset/3.txt"
+], trainer=trainer)
+
+# And Save it
+tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
+```
+
+Now, when you want to use this tokenizer, this is as simple as:
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
+
+encoded = tokenizer.encode("I can feel the magic, can you?")
+```
+
+
+%package help
+Summary: Development documents and examples for tokenizers
+Provides: python3-tokenizers-doc
+%description help
+<p align="center">
+ <br>
+ <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
+ <br>
+<p>
+<p align="center">
+ <a href="https://badge.fury.io/py/tokenizers">
+ <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
+ </a>
+ <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
+ <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
+ </a>
+</p>
+<br>
+
+# Tokenizers
+
+Provides an implementation of today's most used tokenizers, with a focus on performance and
+versatility.
+
+Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
+If you are interested in the High-level design, you can go check it there.
+
+Otherwise, let's dive in!
+
+## Main features:
+
+ - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
+ most common BPE versions).
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+ less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+ original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
+### Installation
+
+#### With pip:
+
+```bash
+pip install tokenizers
+```
+
+#### From sources:
+
+To use this method, you need to have the Rust installed:
+
+```bash
+# Install with:
+curl https://sh.rustup.rs -sSf | sh -s -- -y
+export PATH="$HOME/.cargo/bin:$PATH"
+```
+
+Once Rust is installed, you can compile doing the following
+
+```bash
+git clone https://github.com/huggingface/tokenizers
+cd tokenizers/bindings/python
+
+# Create a virtual env (you can use yours as well)
+python -m venv .env
+source .env/bin/activate
+
+# Install `tokenizers` in the current virtual env
+pip install setuptools_rust
+python setup.py install
+```
+
+### Load a pretrained tokenizer from the Hub
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_pretrained("bert-base-cased")
+```
+
+### Using the provided Tokenizers
+
+We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
+these using some `vocab.json` and `merges.txt` files:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+vocab = "./path/to/vocab.json"
+merges = "./path/to/merges.txt"
+tokenizer = CharBPETokenizer(vocab, merges)
+
+# And then encode:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+print(encoded.ids)
+print(encoded.tokens)
+```
+
+And you can train them just as simply:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+tokenizer = CharBPETokenizer()
+
+# Then train it!
+tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
+
+# Now, let's use it:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+
+# And finally save it somewhere
+tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
+```
+
+#### Provided Tokenizers
+
+ - `CharBPETokenizer`: The original BPE
+ - `ByteLevelBPETokenizer`: The byte level version of the BPE
+ - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
+ - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
+
+All of these can be used and trained as explained above!
+
+### Build your own
+
+Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
+by putting all the different parts you need together.
+You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
+
+#### Building a byte-level BPE
+
+Here is an example showing how to build your own byte-level BPE by putting all the different pieces
+together, and then saving it to a single file:
+
+```python
+from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
+
+# Initialize a tokenizer
+tokenizer = Tokenizer(models.BPE())
+
+# Customize pre-tokenization and decoding
+tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
+tokenizer.decoder = decoders.ByteLevel()
+tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
+
+# And then train
+trainer = trainers.BpeTrainer(
+ vocab_size=20000,
+ min_frequency=2,
+ initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
+)
+tokenizer.train([
+ "./path/to/dataset/1.txt",
+ "./path/to/dataset/2.txt",
+ "./path/to/dataset/3.txt"
+], trainer=trainer)
+
+# And Save it
+tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
+```
+
+Now, when you want to use this tokenizer, this is as simple as:
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
+
+encoded = tokenizer.encode("I can feel the magic, can you?")
+```
+
+
+%prep
+%autosetup -n tokenizers-0.13.3
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-tokenizers -f filelist.lst
+%dir %{python3_sitearch}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.13.3-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..3fea867
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+47734c88552962e3b28a1b3705f6d32b tokenizers-0.13.3.tar.gz