%global _empty_manifest_terminate_build 0
Name: python-rusenttokenize
Version: 0.0.5
Release: 1
Summary: Rule-based sentence tokenizer for Russian language
License: Apache Software License
URL: https://github.com/deepmipt/ru_sentence_tokenizer
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/6d/76/1226e1ddc11ad492a191664a4926c607bcbf1e5b352134ca6f83c4af8205/rusenttokenize-0.0.5.tar.gz
BuildArch: noarch
%description
# ru_sent_tokenize
A simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.
# Installation
```
pip install rusenttokenize
```
# Running
```ipython
>>> from rusenttokenize import ru_sent_tokenize
>>> ru_sent_tokenize('Эта шоколадка за 400р. ничего из себя не представляла. Артём решил больше не ходить в этот магазин')
['Эта шоколадка за 400р. ничего из себя не представляла.', 'Артём решил больше не ходить в этот магазин']
```
# Metrics
The tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics.
Precision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.
Recall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.
tokenizer |
OpenCorpora |
SynTagRus |
Precision |
Recall |
Execution Time (sec) |
Precision |
Recall |
Execution Time (sec) |
nltk.sent_tokenize |
94.30 |
86.06 |
8.67 |
98.15 |
94.95 |
5.07 |
nltk.sent_tokenize(x, language='russian') |
95.53 |
88.37 |
8.54 |
98.44 |
95.45 |
5.68 |
bureaucratic-labs.segmentator.split |
97.16 |
88.62 |
359 |
96.79 |
92.55 |
210 |
ru_sent_tokenize |
98.73 |
93.45 |
4.92 |
99.81 |
98.59 |
2.87 |
[Notebook](https://github.com/deepmipt/ru_sentence_tokenizer/blob/master/metrics/calculate.ipynb) shows how the table above was calculated
%package -n python3-rusenttokenize
Summary: Rule-based sentence tokenizer for Russian language
Provides: python-rusenttokenize
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-rusenttokenize
# ru_sent_tokenize
A simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.
# Installation
```
pip install rusenttokenize
```
# Running
```ipython
>>> from rusenttokenize import ru_sent_tokenize
>>> ru_sent_tokenize('Эта шоколадка за 400р. ничего из себя не представляла. Артём решил больше не ходить в этот магазин')
['Эта шоколадка за 400р. ничего из себя не представляла.', 'Артём решил больше не ходить в этот магазин']
```
# Metrics
The tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics.
Precision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.
Recall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.
tokenizer |
OpenCorpora |
SynTagRus |
Precision |
Recall |
Execution Time (sec) |
Precision |
Recall |
Execution Time (sec) |
nltk.sent_tokenize |
94.30 |
86.06 |
8.67 |
98.15 |
94.95 |
5.07 |
nltk.sent_tokenize(x, language='russian') |
95.53 |
88.37 |
8.54 |
98.44 |
95.45 |
5.68 |
bureaucratic-labs.segmentator.split |
97.16 |
88.62 |
359 |
96.79 |
92.55 |
210 |
ru_sent_tokenize |
98.73 |
93.45 |
4.92 |
99.81 |
98.59 |
2.87 |
[Notebook](https://github.com/deepmipt/ru_sentence_tokenizer/blob/master/metrics/calculate.ipynb) shows how the table above was calculated
%package help
Summary: Development documents and examples for rusenttokenize
Provides: python3-rusenttokenize-doc
%description help
# ru_sent_tokenize
A simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.
# Installation
```
pip install rusenttokenize
```
# Running
```ipython
>>> from rusenttokenize import ru_sent_tokenize
>>> ru_sent_tokenize('Эта шоколадка за 400р. ничего из себя не представляла. Артём решил больше не ходить в этот магазин')
['Эта шоколадка за 400р. ничего из себя не представляла.', 'Артём решил больше не ходить в этот магазин']
```
# Metrics
The tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics.
Precision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.
Recall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.
tokenizer |
OpenCorpora |
SynTagRus |
Precision |
Recall |
Execution Time (sec) |
Precision |
Recall |
Execution Time (sec) |
nltk.sent_tokenize |
94.30 |
86.06 |
8.67 |
98.15 |
94.95 |
5.07 |
nltk.sent_tokenize(x, language='russian') |
95.53 |
88.37 |
8.54 |
98.44 |
95.45 |
5.68 |
bureaucratic-labs.segmentator.split |
97.16 |
88.62 |
359 |
96.79 |
92.55 |
210 |
ru_sent_tokenize |
98.73 |
93.45 |
4.92 |
99.81 |
98.59 |
2.87 |
[Notebook](https://github.com/deepmipt/ru_sentence_tokenizer/blob/master/metrics/calculate.ipynb) shows how the table above was calculated
%prep
%autosetup -n rusenttokenize-0.0.5
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-rusenttokenize -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot - 0.0.5-1
- Package Spec generated