%global _empty_manifest_terminate_build 0 Name: python-razdel Version: 0.5.0 Release: 1 Summary: Splits russian text into tokens, sentences, section. Rule-based License: MIT URL: https://github.com/natasha/razdel Source0: https://mirrors.nju.edu.cn/pypi/web/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz BuildArch: noarch %description ![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) `razdel` — rule-based system for Russian sentence and word tokenization.. ## Usage ```python >>> from razdel import tokenize >>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) >>> tokens [Substring(0, 13, 'Кружка-термос'), Substring(14, 16, 'на'), Substring(17, 20, '0.5'), Substring(20, 21, 'л'), Substring(22, 23, '(') ...] >>> [_.text for _ in tokens] ['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] ``` ```python >>> from razdel import sentenize >>> text = ''' ... - "Так в чем же дело?" - "Не ра-ду-ют". ... И т. д. и т. п. В общем, вся газета ... ''' >>> list(sentenize(text)) [Substring(1, 23, '- "Так в чем же дело?"'), Substring(24, 40, '- "Не ра-ду-ют".'), Substring(41, 56, 'И т. д. и т. п.'), Substring(57, 76, 'В общем, вся газета')] ``` ## Installation `razdel` supports Python 3.5+ and PyPy 3. ```bash $ pip install razdel ``` ## Quality, performance Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. `razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. `errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. `time` — total seconds taken. `spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. ### Tokens
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.findall(\w+|\d+|\p+) 4161 0.5 2660 0.5 2277 0.4 7606 0.4
spacy 4388 6.2 2103 5.8 1740 4.1 4057 3.9
nltk.word_tokenize 14245 3.4 60893 3.3 13496 2.7 41485 2.9
mystem 4514 5.0 3153 4.7 2497 3.7 2028 3.9
mosestokenizer 1886 2.1 1330 1.9 1796 1.6 2123 1.7
segtok.word_tokenize 2772 2.3 1288 2.3 1759 1.8 1229 1.8
aatimofeev/spacy_russian_tokenizer 2930 48.7 719 51.1 678 39.5 2681 52.2
koziev/rutokenizer 2627 1.1 1386 1.0 2893 0.8 9411 0.9
razdel.tokenize 1510 2.9 1483 2.8 322 2.0 2124 2.2
### Sentencies
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.split([.?!…]) 20456 0.9 6576 0.6 10084 0.7 23356 1.0
segtok.split_single 19008 17.8 4422 13.4 159738 1.1 164218 2.8
mosestokenizer 41666 8.9 22082 5.7 12663 6.4 50560 7.4
nltk.sent_tokenize 16420 10.1 4350 5.3 7074 5.6 32534 8.9
deeppavlov/rusenttokenize 10192 10.9 1210 7.9 8910 6.8 21410 7.0
razdel.sentenize 9274 6.1 824 3.9 11414 4.5 10594 7.5
## Support - Chat — https://telegram.me/natural_language_processing - Issues — https://github.com/natasha/razdel/issues ## Development Test: ```bash pip install -e . pip install -r requirements/ci.txt make test make int # 2000 integration tests ``` Package: ```bash make version git push git push --tags make clean wheel upload ``` `mystem` errors on `syntag`: ```bash # see naeval/data cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less ``` Non-trivial token tests: ```bash pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt ``` Update integration tests: ```bash cd razdel/tests/data/ pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt ``` `razdel` and `moses` diff: ```bash cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less ``` `razdel` performance: ```bash cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l ``` %package -n python3-razdel Summary: Splits russian text into tokens, sentences, section. Rule-based Provides: python-razdel BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-razdel ![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) `razdel` — rule-based system for Russian sentence and word tokenization.. ## Usage ```python >>> from razdel import tokenize >>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) >>> tokens [Substring(0, 13, 'Кружка-термос'), Substring(14, 16, 'на'), Substring(17, 20, '0.5'), Substring(20, 21, 'л'), Substring(22, 23, '(') ...] >>> [_.text for _ in tokens] ['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] ``` ```python >>> from razdel import sentenize >>> text = ''' ... - "Так в чем же дело?" - "Не ра-ду-ют". ... И т. д. и т. п. В общем, вся газета ... ''' >>> list(sentenize(text)) [Substring(1, 23, '- "Так в чем же дело?"'), Substring(24, 40, '- "Не ра-ду-ют".'), Substring(41, 56, 'И т. д. и т. п.'), Substring(57, 76, 'В общем, вся газета')] ``` ## Installation `razdel` supports Python 3.5+ and PyPy 3. ```bash $ pip install razdel ``` ## Quality, performance Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. `razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. `errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. `time` — total seconds taken. `spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. ### Tokens
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.findall(\w+|\d+|\p+) 4161 0.5 2660 0.5 2277 0.4 7606 0.4
spacy 4388 6.2 2103 5.8 1740 4.1 4057 3.9
nltk.word_tokenize 14245 3.4 60893 3.3 13496 2.7 41485 2.9
mystem 4514 5.0 3153 4.7 2497 3.7 2028 3.9
mosestokenizer 1886 2.1 1330 1.9 1796 1.6 2123 1.7
segtok.word_tokenize 2772 2.3 1288 2.3 1759 1.8 1229 1.8
aatimofeev/spacy_russian_tokenizer 2930 48.7 719 51.1 678 39.5 2681 52.2
koziev/rutokenizer 2627 1.1 1386 1.0 2893 0.8 9411 0.9
razdel.tokenize 1510 2.9 1483 2.8 322 2.0 2124 2.2
### Sentencies
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.split([.?!…]) 20456 0.9 6576 0.6 10084 0.7 23356 1.0
segtok.split_single 19008 17.8 4422 13.4 159738 1.1 164218 2.8
mosestokenizer 41666 8.9 22082 5.7 12663 6.4 50560 7.4
nltk.sent_tokenize 16420 10.1 4350 5.3 7074 5.6 32534 8.9
deeppavlov/rusenttokenize 10192 10.9 1210 7.9 8910 6.8 21410 7.0
razdel.sentenize 9274 6.1 824 3.9 11414 4.5 10594 7.5
## Support - Chat — https://telegram.me/natural_language_processing - Issues — https://github.com/natasha/razdel/issues ## Development Test: ```bash pip install -e . pip install -r requirements/ci.txt make test make int # 2000 integration tests ``` Package: ```bash make version git push git push --tags make clean wheel upload ``` `mystem` errors on `syntag`: ```bash # see naeval/data cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less ``` Non-trivial token tests: ```bash pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt ``` Update integration tests: ```bash cd razdel/tests/data/ pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt ``` `razdel` and `moses` diff: ```bash cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less ``` `razdel` performance: ```bash cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l ``` %package help Summary: Development documents and examples for razdel Provides: python3-razdel-doc %description help ![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) `razdel` — rule-based system for Russian sentence and word tokenization.. ## Usage ```python >>> from razdel import tokenize >>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) >>> tokens [Substring(0, 13, 'Кружка-термос'), Substring(14, 16, 'на'), Substring(17, 20, '0.5'), Substring(20, 21, 'л'), Substring(22, 23, '(') ...] >>> [_.text for _ in tokens] ['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] ``` ```python >>> from razdel import sentenize >>> text = ''' ... - "Так в чем же дело?" - "Не ра-ду-ют". ... И т. д. и т. п. В общем, вся газета ... ''' >>> list(sentenize(text)) [Substring(1, 23, '- "Так в чем же дело?"'), Substring(24, 40, '- "Не ра-ду-ют".'), Substring(41, 56, 'И т. д. и т. п.'), Substring(57, 76, 'В общем, вся газета')] ``` ## Installation `razdel` supports Python 3.5+ and PyPy 3. ```bash $ pip install razdel ``` ## Quality, performance Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. `razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. `errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. `time` — total seconds taken. `spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. ### Tokens
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.findall(\w+|\d+|\p+) 4161 0.5 2660 0.5 2277 0.4 7606 0.4
spacy 4388 6.2 2103 5.8 1740 4.1 4057 3.9
nltk.word_tokenize 14245 3.4 60893 3.3 13496 2.7 41485 2.9
mystem 4514 5.0 3153 4.7 2497 3.7 2028 3.9
mosestokenizer 1886 2.1 1330 1.9 1796 1.6 2123 1.7
segtok.word_tokenize 2772 2.3 1288 2.3 1759 1.8 1229 1.8
aatimofeev/spacy_russian_tokenizer 2930 48.7 719 51.1 678 39.5 2681 52.2
koziev/rutokenizer 2627 1.1 1386 1.0 2893 0.8 9411 0.9
razdel.tokenize 1510 2.9 1483 2.8 322 2.0 2124 2.2
### Sentencies
corpora syntag gicrya rnc
errors time errors time errors time errors time
re.split([.?!…]) 20456 0.9 6576 0.6 10084 0.7 23356 1.0
segtok.split_single 19008 17.8 4422 13.4 159738 1.1 164218 2.8
mosestokenizer 41666 8.9 22082 5.7 12663 6.4 50560 7.4
nltk.sent_tokenize 16420 10.1 4350 5.3 7074 5.6 32534 8.9
deeppavlov/rusenttokenize 10192 10.9 1210 7.9 8910 6.8 21410 7.0
razdel.sentenize 9274 6.1 824 3.9 11414 4.5 10594 7.5
## Support - Chat — https://telegram.me/natural_language_processing - Issues — https://github.com/natasha/razdel/issues ## Development Test: ```bash pip install -e . pip install -r requirements/ci.txt make test make int # 2000 integration tests ``` Package: ```bash make version git push git push --tags make clean wheel upload ``` `mystem` errors on `syntag`: ```bash # see naeval/data cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less ``` Non-trivial token tests: ```bash pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt ``` Update integration tests: ```bash cd razdel/tests/data/ pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt ``` `razdel` and `moses` diff: ```bash cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less ``` `razdel` performance: ```bash cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l ``` %prep %autosetup -n razdel-0.5.0 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-razdel -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Fri May 05 2023 Python_Bot - 0.5.0-1 - Package Spec generated