From a8b3a2fd62c4679458c0719e277515938948f886 Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Fri, 5 May 2023 06:20:58 +0000 Subject: automatic import of python-razdel --- .gitignore | 1 + python-razdel.spec | 1107 ++++++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 1109 insertions(+) create mode 100644 python-razdel.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..4ce610d 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/razdel-0.5.0.tar.gz diff --git a/python-razdel.spec b/python-razdel.spec new file mode 100644 index 0000000..f01264c --- /dev/null +++ b/python-razdel.spec @@ -0,0 +1,1107 @@ +%global _empty_manifest_terminate_build 0 +Name: python-razdel +Version: 0.5.0 +Release: 1 +Summary: Splits russian text into tokens, sentences, section. Rule-based +License: MIT +URL: https://github.com/natasha/razdel +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz +BuildArch: noarch + + +%description + + +![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) + +`razdel` — rule-based system for Russian sentence and word tokenization.. + +## Usage + +```python +>>> from razdel import tokenize + +>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) +>>> tokens +[Substring(0, 13, 'Кружка-термос'), + Substring(14, 16, 'на'), + Substring(17, 20, '0.5'), + Substring(20, 21, 'л'), + Substring(22, 23, '(') + ...] + +>>> [_.text for _ in tokens] +['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] +``` + +```python +>>> from razdel import sentenize + +>>> text = ''' +... - "Так в чем же дело?" - "Не ра-ду-ют". +... И т. д. и т. п. В общем, вся газета +... ''' + +>>> list(sentenize(text)) +[Substring(1, 23, '- "Так в чем же дело?"'), + Substring(24, 40, '- "Не ра-ду-ют".'), + Substring(41, 56, 'И т. д. и т. п.'), + Substring(57, 76, 'В общем, вся газета')] +``` + +## Installation + +`razdel` supports Python 3.5+ and PyPy 3. + +```bash +$ pip install razdel +``` + +## Quality, performance + + +Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. + +`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. + +We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. + +`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. + +`time` — total seconds taken. + +`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. + +### Tokens + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.findall(\w+|\d+|\p+)41610.526600.522770.476060.4
spacy43886.221035.817404.140573.9
nltk.word_tokenize142453.4608933.3134962.7414852.9
mystem45145.031534.724973.720283.9
mosestokenizer18862.113301.917961.621231.7
segtok.word_tokenize27722.312882.317591.812291.8
aatimofeev/spacy_russian_tokenizer293048.771951.167839.5268152.2
koziev/rutokenizer26271.113861.028930.894110.9
razdel.tokenize15102.914832.83222.021242.2
+ + +### Sentencies + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.split([.?!…])204560.965760.6100840.7233561.0
segtok.split_single1900817.8442213.41597381.11642182.8
mosestokenizer416668.9220825.7126636.4505607.4
nltk.sent_tokenize1642010.143505.370745.6325348.9
deeppavlov/rusenttokenize1019210.912107.989106.8214107.0
razdel.sentenize92746.18243.9114144.5105947.5
+ + +## Support + +- Chat — https://telegram.me/natural_language_processing +- Issues — https://github.com/natasha/razdel/issues + +## Development + +Test: + +```bash +pip install -e . +pip install -r requirements/ci.txt +make test +make int # 2000 integration tests +``` + +Package: + +```bash +make version +git push +git push --tags + +make clean wheel upload +``` + +`mystem` errors on `syntag`: + +```bash +# see naeval/data +cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less +``` + +Non-trivial token tests: + +```bash +pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt +pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt +``` + +Update integration tests: + +```bash +cd razdel/tests/data/ +pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt +``` + +`razdel` and `moses` diff: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less +``` + +`razdel` performance: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l +``` + + + + +%package -n python3-razdel +Summary: Splits russian text into tokens, sentences, section. Rule-based +Provides: python-razdel +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-razdel + + +![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) + +`razdel` — rule-based system for Russian sentence and word tokenization.. + +## Usage + +```python +>>> from razdel import tokenize + +>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) +>>> tokens +[Substring(0, 13, 'Кружка-термос'), + Substring(14, 16, 'на'), + Substring(17, 20, '0.5'), + Substring(20, 21, 'л'), + Substring(22, 23, '(') + ...] + +>>> [_.text for _ in tokens] +['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] +``` + +```python +>>> from razdel import sentenize + +>>> text = ''' +... - "Так в чем же дело?" - "Не ра-ду-ют". +... И т. д. и т. п. В общем, вся газета +... ''' + +>>> list(sentenize(text)) +[Substring(1, 23, '- "Так в чем же дело?"'), + Substring(24, 40, '- "Не ра-ду-ют".'), + Substring(41, 56, 'И т. д. и т. п.'), + Substring(57, 76, 'В общем, вся газета')] +``` + +## Installation + +`razdel` supports Python 3.5+ and PyPy 3. + +```bash +$ pip install razdel +``` + +## Quality, performance + + +Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. + +`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. + +We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. + +`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. + +`time` — total seconds taken. + +`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. + +### Tokens + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.findall(\w+|\d+|\p+)41610.526600.522770.476060.4
spacy43886.221035.817404.140573.9
nltk.word_tokenize142453.4608933.3134962.7414852.9
mystem45145.031534.724973.720283.9
mosestokenizer18862.113301.917961.621231.7
segtok.word_tokenize27722.312882.317591.812291.8
aatimofeev/spacy_russian_tokenizer293048.771951.167839.5268152.2
koziev/rutokenizer26271.113861.028930.894110.9
razdel.tokenize15102.914832.83222.021242.2
+ + +### Sentencies + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.split([.?!…])204560.965760.6100840.7233561.0
segtok.split_single1900817.8442213.41597381.11642182.8
mosestokenizer416668.9220825.7126636.4505607.4
nltk.sent_tokenize1642010.143505.370745.6325348.9
deeppavlov/rusenttokenize1019210.912107.989106.8214107.0
razdel.sentenize92746.18243.9114144.5105947.5
+ + +## Support + +- Chat — https://telegram.me/natural_language_processing +- Issues — https://github.com/natasha/razdel/issues + +## Development + +Test: + +```bash +pip install -e . +pip install -r requirements/ci.txt +make test +make int # 2000 integration tests +``` + +Package: + +```bash +make version +git push +git push --tags + +make clean wheel upload +``` + +`mystem` errors on `syntag`: + +```bash +# see naeval/data +cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less +``` + +Non-trivial token tests: + +```bash +pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt +pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt +``` + +Update integration tests: + +```bash +cd razdel/tests/data/ +pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt +``` + +`razdel` and `moses` diff: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less +``` + +`razdel` performance: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l +``` + + + + +%package help +Summary: Development documents and examples for razdel +Provides: python3-razdel-doc +%description help + + +![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel) + +`razdel` — rule-based system for Russian sentence and word tokenization.. + +## Usage + +```python +>>> from razdel import tokenize + +>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)')) +>>> tokens +[Substring(0, 13, 'Кружка-термос'), + Substring(14, 16, 'на'), + Substring(17, 20, '0.5'), + Substring(20, 21, 'л'), + Substring(22, 23, '(') + ...] + +>>> [_.text for _ in tokens] +['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')'] +``` + +```python +>>> from razdel import sentenize + +>>> text = ''' +... - "Так в чем же дело?" - "Не ра-ду-ют". +... И т. д. и т. п. В общем, вся газета +... ''' + +>>> list(sentenize(text)) +[Substring(1, 23, '- "Так в чем же дело?"'), + Substring(24, 40, '- "Не ра-ду-ют".'), + Substring(41, 56, 'И т. д. и т. п.'), + Substring(57, 76, 'В общем, вся газета')] +``` + +## Installation + +`razdel` supports Python 3.5+ and PyPy 3. + +```bash +$ pip install razdel +``` + +## Quality, performance + + +Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`. + +`razdel` tries to mimic segmentation of these 4 datasets : SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents. + +We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors. + +`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`. + +`time` — total seconds taken. + +`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py. Tables are computed in segment/main.ipynb. + +### Tokens + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.findall(\w+|\d+|\p+)41610.526600.522770.476060.4
spacy43886.221035.817404.140573.9
nltk.word_tokenize142453.4608933.3134962.7414852.9
mystem45145.031534.724973.720283.9
mosestokenizer18862.113301.917961.621231.7
segtok.word_tokenize27722.312882.317591.812291.8
aatimofeev/spacy_russian_tokenizer293048.771951.167839.5268152.2
koziev/rutokenizer26271.113861.028930.894110.9
razdel.tokenize15102.914832.83222.021242.2
+ + +### Sentencies + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.split([.?!…])204560.965760.6100840.7233561.0
segtok.split_single1900817.8442213.41597381.11642182.8
mosestokenizer416668.9220825.7126636.4505607.4
nltk.sent_tokenize1642010.143505.370745.6325348.9
deeppavlov/rusenttokenize1019210.912107.989106.8214107.0
razdel.sentenize92746.18243.9114144.5105947.5
+ + +## Support + +- Chat — https://telegram.me/natural_language_processing +- Issues — https://github.com/natasha/razdel/issues + +## Development + +Test: + +```bash +pip install -e . +pip install -r requirements/ci.txt +make test +make int # 2000 integration tests +``` + +Package: + +```bash +make version +git push +git push --tags + +make clean wheel upload +``` + +`mystem` errors on `syntag`: + +```bash +# see naeval/data +cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less +``` + +Non-trivial token tests: + +```bash +pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt +pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt +``` + +Update integration tests: + +```bash +cd razdel/tests/data/ +pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt +``` + +`razdel` and `moses` diff: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less +``` + +`razdel` performance: + +```bash +cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l +``` + + + + +%prep +%autosetup -n razdel-0.5.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-razdel -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Fri May 05 2023 Python_Bot - 0.5.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..9705e9a --- /dev/null +++ b/sources @@ -0,0 +1 @@ +638852a3b703aaa57927e1e40a1a74dc razdel-0.5.0.tar.gz -- cgit v1.2.3