summaryrefslogtreecommitdiff
path: root/python-tokenizers.spec
blob: 02bc89f3ad2c91c8f52faf551f882a804a9ab262 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
%global debug_package %{nil}
%global _empty_manifest_terminate_build 0

Name:		python-tokenizers
Version:	0.15.1
Release:	1
Summary:	Fast State-of-the-Art Tokenizers optimized for Research and Production
License:	Apache-2.0
URL:		https://github.com/huggingface/tokenizers
Source0:	https://github.com/huggingface/tokenizers/archive/refs/tags/v%{version}.tar.gz

Requires:	python3-numpy
Requires:	python3-pytorch

%description
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:
The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization 
    standards, such as NFD or NFKC. More details about how to use the Normalizers are available on the Hugging Face blog
The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, 
    a language model would need, such as special tokens.

%package -n python3-tokenizers
Summary:	Fast State-of-the-Art Tokenizers optimized for Research and Production
Provides:	python3-tokenizers
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-setuptools_scm
BuildRequires:	python3-pbr
BuildRequires:	python3-pip
BuildRequires:	python3-wheel
BuildRequires:	python3-hatchling

BuildRequires:	rust cargo
BuildRequires:	python3-maturin
BuildRequires:	python3-setuptools-rust

%description -n python3-tokenizers
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:
The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization 
    standards, such as NFD or NFKC. More details about how to use the Normalizers are available on the Hugging Face blog
The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, 
    a language model would need, such as special tokens.

%prep
%autosetup -p1 -n tokenizers-%{version}

%build
pushd ./bindings/python
%pyproject_build
popd
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi

%install
pushd ./bindings/python
%pyproject_install
popd

%files -n python3-tokenizers
%doc *.md
%license LICENSE
%{python3_sitearch}/*

%changelog
* Sun Jan 28 2024 Binshuo Zu <274620705z@gmail.com> - 0.15.1-1
- Package init