%global _empty_manifest_terminate_build 0
Name: python-razdel
Version: 0.5.0
Release: 1
Summary: Splits russian text into tokens, sentences, section. Rule-based
License: MIT
URL: https://github.com/natasha/razdel
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz
BuildArch: noarch
%description
 [](https://codecov.io/gh/natasha/razdel)
`razdel` — rule-based system for Russian sentence and word tokenization..
## Usage
```python
>>> from razdel import tokenize
>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]
>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
```
```python
>>> from razdel import sentenize
>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''
>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```
## Installation
`razdel` supports Python 3.5+ and PyPy 3.
```bash
$ pip install razdel
```
## Quality, performance
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.
`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.
`time` — total seconds taken.
`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb.
### Tokens
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.findall(\w+|\d+|\p+) |
4161 |
0.5 |
2660 |
0.5 |
2277 |
0.4 |
7606 |
0.4 |
spacy |
4388 |
6.2 |
2103 |
5.8 |
1740 |
4.1 |
4057 |
3.9 |
nltk.word_tokenize |
14245 |
3.4 |
60893 |
3.3 |
13496 |
2.7 |
41485 |
2.9 |
mystem |
4514 |
5.0 |
3153 |
4.7 |
2497 |
3.7 |
2028 |
3.9 |
mosestokenizer |
1886 |
2.1 |
1330 |
1.9 |
1796 |
1.6 |
2123 |
1.7 |
segtok.word_tokenize |
2772 |
2.3 |
1288 |
2.3 |
1759 |
1.8 |
1229 |
1.8 |
aatimofeev/spacy_russian_tokenizer |
2930 |
48.7 |
719 |
51.1 |
678 |
39.5 |
2681 |
52.2 |
koziev/rutokenizer |
2627 |
1.1 |
1386 |
1.0 |
2893 |
0.8 |
9411 |
0.9 |
razdel.tokenize |
1510 |
2.9 |
1483 |
2.8 |
322 |
2.0 |
2124 |
2.2 |
### Sentencies
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.split([.?!…]) |
20456 |
0.9 |
6576 |
0.6 |
10084 |
0.7 |
23356 |
1.0 |
segtok.split_single |
19008 |
17.8 |
4422 |
13.4 |
159738 |
1.1 |
164218 |
2.8 |
mosestokenizer |
41666 |
8.9 |
22082 |
5.7 |
12663 |
6.4 |
50560 |
7.4 |
nltk.sent_tokenize |
16420 |
10.1 |
4350 |
5.3 |
7074 |
5.6 |
32534 |
8.9 |
deeppavlov/rusenttokenize |
10192 |
10.9 |
1210 |
7.9 |
8910 |
6.8 |
21410 |
7.0 |
razdel.sentenize |
9274 |
6.1 |
824 |
3.9 |
11414 |
4.5 |
10594 |
7.5 |
## Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
## Development
Test:
```bash
pip install -e .
pip install -r requirements/ci.txt
make test
make int # 2000 integration tests
```
Package:
```bash
make version
git push
git push --tags
make clean wheel upload
```
`mystem` errors on `syntag`:
```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```
Non-trivial token tests:
```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```
Update integration tests:
```bash
cd razdel/tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
```
`razdel` and `moses` diff:
```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
```
`razdel` performance:
```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```
%package -n python3-razdel
Summary: Splits russian text into tokens, sentences, section. Rule-based
Provides: python-razdel
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-razdel
 [](https://codecov.io/gh/natasha/razdel)
`razdel` — rule-based system for Russian sentence and word tokenization..
## Usage
```python
>>> from razdel import tokenize
>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]
>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
```
```python
>>> from razdel import sentenize
>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''
>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```
## Installation
`razdel` supports Python 3.5+ and PyPy 3.
```bash
$ pip install razdel
```
## Quality, performance
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.
`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.
`time` — total seconds taken.
`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb.
### Tokens
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.findall(\w+|\d+|\p+) |
4161 |
0.5 |
2660 |
0.5 |
2277 |
0.4 |
7606 |
0.4 |
spacy |
4388 |
6.2 |
2103 |
5.8 |
1740 |
4.1 |
4057 |
3.9 |
nltk.word_tokenize |
14245 |
3.4 |
60893 |
3.3 |
13496 |
2.7 |
41485 |
2.9 |
mystem |
4514 |
5.0 |
3153 |
4.7 |
2497 |
3.7 |
2028 |
3.9 |
mosestokenizer |
1886 |
2.1 |
1330 |
1.9 |
1796 |
1.6 |
2123 |
1.7 |
segtok.word_tokenize |
2772 |
2.3 |
1288 |
2.3 |
1759 |
1.8 |
1229 |
1.8 |
aatimofeev/spacy_russian_tokenizer |
2930 |
48.7 |
719 |
51.1 |
678 |
39.5 |
2681 |
52.2 |
koziev/rutokenizer |
2627 |
1.1 |
1386 |
1.0 |
2893 |
0.8 |
9411 |
0.9 |
razdel.tokenize |
1510 |
2.9 |
1483 |
2.8 |
322 |
2.0 |
2124 |
2.2 |
### Sentencies
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.split([.?!…]) |
20456 |
0.9 |
6576 |
0.6 |
10084 |
0.7 |
23356 |
1.0 |
segtok.split_single |
19008 |
17.8 |
4422 |
13.4 |
159738 |
1.1 |
164218 |
2.8 |
mosestokenizer |
41666 |
8.9 |
22082 |
5.7 |
12663 |
6.4 |
50560 |
7.4 |
nltk.sent_tokenize |
16420 |
10.1 |
4350 |
5.3 |
7074 |
5.6 |
32534 |
8.9 |
deeppavlov/rusenttokenize |
10192 |
10.9 |
1210 |
7.9 |
8910 |
6.8 |
21410 |
7.0 |
razdel.sentenize |
9274 |
6.1 |
824 |
3.9 |
11414 |
4.5 |
10594 |
7.5 |
## Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
## Development
Test:
```bash
pip install -e .
pip install -r requirements/ci.txt
make test
make int # 2000 integration tests
```
Package:
```bash
make version
git push
git push --tags
make clean wheel upload
```
`mystem` errors on `syntag`:
```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```
Non-trivial token tests:
```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```
Update integration tests:
```bash
cd razdel/tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
```
`razdel` and `moses` diff:
```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
```
`razdel` performance:
```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```
%package help
Summary: Development documents and examples for razdel
Provides: python3-razdel-doc
%description help
 [](https://codecov.io/gh/natasha/razdel)
`razdel` — rule-based system for Russian sentence and word tokenization..
## Usage
```python
>>> from razdel import tokenize
>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]
>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
```
```python
>>> from razdel import sentenize
>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''
>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```
## Installation
`razdel` supports Python 3.5+ and PyPy 3.
```bash
$ pip install razdel
```
## Quality, performance
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.
`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.
`time` — total seconds taken.
`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb.
### Tokens
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.findall(\w+|\d+|\p+) |
4161 |
0.5 |
2660 |
0.5 |
2277 |
0.4 |
7606 |
0.4 |
spacy |
4388 |
6.2 |
2103 |
5.8 |
1740 |
4.1 |
4057 |
3.9 |
nltk.word_tokenize |
14245 |
3.4 |
60893 |
3.3 |
13496 |
2.7 |
41485 |
2.9 |
mystem |
4514 |
5.0 |
3153 |
4.7 |
2497 |
3.7 |
2028 |
3.9 |
mosestokenizer |
1886 |
2.1 |
1330 |
1.9 |
1796 |
1.6 |
2123 |
1.7 |
segtok.word_tokenize |
2772 |
2.3 |
1288 |
2.3 |
1759 |
1.8 |
1229 |
1.8 |
aatimofeev/spacy_russian_tokenizer |
2930 |
48.7 |
719 |
51.1 |
678 |
39.5 |
2681 |
52.2 |
koziev/rutokenizer |
2627 |
1.1 |
1386 |
1.0 |
2893 |
0.8 |
9411 |
0.9 |
razdel.tokenize |
1510 |
2.9 |
1483 |
2.8 |
322 |
2.0 |
2124 |
2.2 |
### Sentencies
|
corpora |
syntag |
gicrya |
rnc |
|
errors |
time |
errors |
time |
errors |
time |
errors |
time |
re.split([.?!…]) |
20456 |
0.9 |
6576 |
0.6 |
10084 |
0.7 |
23356 |
1.0 |
segtok.split_single |
19008 |
17.8 |
4422 |
13.4 |
159738 |
1.1 |
164218 |
2.8 |
mosestokenizer |
41666 |
8.9 |
22082 |
5.7 |
12663 |
6.4 |
50560 |
7.4 |
nltk.sent_tokenize |
16420 |
10.1 |
4350 |
5.3 |
7074 |
5.6 |
32534 |
8.9 |
deeppavlov/rusenttokenize |
10192 |
10.9 |
1210 |
7.9 |
8910 |
6.8 |
21410 |
7.0 |
razdel.sentenize |
9274 |
6.1 |
824 |
3.9 |
11414 |
4.5 |
10594 |
7.5 |
## Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
## Development
Test:
```bash
pip install -e .
pip install -r requirements/ci.txt
make test
make int # 2000 integration tests
```
Package:
```bash
make version
git push
git push --tags
make clean wheel upload
```
`mystem` errors on `syntag`:
```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```
Non-trivial token tests:
```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```
Update integration tests:
```bash
cd razdel/tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
```
`razdel` and `moses` diff:
```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
```
`razdel` performance:
```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```
%prep
%autosetup -n razdel-0.5.0
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-razdel -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot - 0.5.0-1
- Package Spec generated