summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-10 07:48:35 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-10 07:48:35 +0000
commit4c4a0c4b39b55886bc48b3548a279a988d9387b8 (patch)
treec7aea2ca174e22ad4f9dc6d7702ca5f726d2c763
parent36934aa7cf6dd1046ec0bb768ab5df44e8b0d08f (diff)
automatic import of python-slovnet
-rw-r--r--.gitignore1
-rw-r--r--python-slovnet.spec3129
-rw-r--r--sources1
3 files changed, 3131 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..2ea4263 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/slovnet-0.6.0.tar.gz
diff --git a/python-slovnet.spec b/python-slovnet.spec
new file mode 100644
index 0000000..803a668
--- /dev/null
+++ b/python-slovnet.spec
@@ -0,0 +1,3129 @@
+%global _empty_manifest_terminate_build 0
+Name: python-slovnet
+Version: 0.6.0
+Release: 1
+Summary: Deep-learning based NLP modeling for Russian language
+License: MIT
+URL: https://github.com/natasha/slovnet
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/3e/d1/bba34dec46f1fcb85ca35815268be427ee89d09728c85ae4ab294dd9db09/slovnet-0.6.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-numpy
+Requires: python3-razdel
+Requires: python3-navec
+
+%description
+
+<img src="https://github.com/natasha/natasha-logos/blob/master/slovnet.svg">
+
+![CI](https://github.com/natasha/slovnet/actions/workflows/test.yml/badge.svg)
+
+SlovNet is a Python library for deep-learning based NLP modeling for Russian language. Library is integrated with other <a href="https://github.com/natasha/">Natasha</a> projects: <a href="https://github.com/natasha/nerus">Nerus</a> — large automatically annotated corpus, <a href="https://github.com/natasha/razdel">Razdel</a> — sentence segmenter, tokenizer and <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings. Slovnet provides high quality practical models for Russian NER, morphology and syntax, see <a href="#evaluation">evaluation section</a> for more:
+
+* NER is 1-2% worse than current BERT SOTA by DeepPavlov but 60 times smaller in size (~30 MB) and works fast on CPU (~25 news articles/sec).
+* Morphology tagger and syntax parser have comparable accuracy on news dataset with large SOTA BERT models, take 50 times less space (~30 MB), work faster on CPU (~500 sentences/sec).
+
+## Downloads
+
+<table>
+
+<tr>
+<th>Model</th>
+<th>Size</th>
+<th>Description</th>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_ner_news_v1.tar">slovnet_ner_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian NER, standart PER, LOC, ORG annotation, trained on news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_morph_news_v1.tar">slovnet_morph_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian morphology tagger optimized for news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_syntax_news_v1.tar">slovnet_syntax_news_v1.tar</a>
+</td>
+<td>3MB</td>
+<td>
+ Russian syntax parser optimized for news articles.
+</td>
+</tr>
+
+</table>
+
+## Install
+
+During inference Slovnet depends only on Numpy. Library supports Python 3.5+, PyPy 3.
+
+```bash
+$ pip install slovnet
+```
+
+## Usage
+
+Download model weights and vocabs package, use links from <a href="#downloads">downloads section</a> and <a href="https://github.com/natasha/navec#downloads">Navec download section</a>. Optionally install <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> to visualize NER markup.
+
+Slovnet annotator `map` method has list of items as input and same size iterator over markups as output. Internally items are processed in batches of size `batch_size`. Default size is 8, larger batch — more RAM, better CPU utilization. `__call__` method just calls `map` with a list of 1 item.
+
+### NER
+
+```python
+>>> from navec import Navec
+>>> from slovnet import NER
+>>> from ipymarkup import show_span_ascii_markup as show_markup
+
+>>> text = 'Европейский союз добавил в санкционный список девять политических деятелей из самопровозглашенных республик Донбасса — Донецкой народной республики (ДНР) и Луганской народной республики (ЛНР) — в связи с прошедшими там выборами. Об этом говорится в документе, опубликованном в официальном журнале Евросоюза. В новом списке фигурирует Леонид Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там присутствуют Владимир Бидевка и Денис Мирошниченко, председатели законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена Кравченко, председатели ЦИК обеих республик. Выборы прошли в непризнанных республиках Донбасса 11 ноября. На них удержали лидерство действующие руководители и партии — Денис Пушилин и «Донецкая республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР. Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после встречи с украинским лидером Петром Порошенко осудили проведение выборов, заявив, что они нелегитимны и «подрывают территориальную целостность и суверенитет Украины». Позже к осуждению присоединились США с обещаниями новых санкций для России.'
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> ner = NER.load('slovnet_ner_news_v1.tar')
+>>> ner.navec(navec)
+
+>>> markup = ner(text)
+>>> show_markup(markup.text, markup.spans)
+Европейский союз добавил в санкционный список девять политических
+LOC─────────────
+деятелей из самопровозглашенных республик Донбасса — Донецкой народной
+ LOC───── LOC──────────────
+ республики (ДНР) и Луганской народной республики (ЛНР) — в связи с
+───────────────── LOC────────────────────────────────
+прошедшими там выборами. Об этом говорится в документе, опубликованном
+ в официальном журнале Евросоюза. В новом списке фигурирует Леонид
+ LOC────── PER────
+Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там
+──────── LOC
+присутствуют Владимир Бидевка и Денис Мирошниченко, председатели
+ PER───────────── PER───────────────
+законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена
+ LOC LOC PER───────────── PER───
+Кравченко, председатели ЦИК обеих республик. Выборы прошли в
+───────── ORG
+непризнанных республиках Донбасса 11 ноября. На них удержали лидерство
+ LOC─────
+ действующие руководители и партии — Денис Пушилин и «Донецкая
+ PER────────── ORG──────
+республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР.
+────────── LOC PER──────────── ORG────────── LOC
+ Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после
+ LOC──── PER───────────── LOC PER───────────
+ встречи с украинским лидером Петром Порошенко осудили проведение
+ PER─────────────
+выборов, заявив, что они нелегитимны и «подрывают территориальную
+целостность и суверенитет Украины». Позже к осуждению присоединились
+ LOC────
+США с обещаниями новых санкций для России.
+LOC LOC───
+
+```
+
+### Morphology
+
+Morphology annotator processes tokenized text. To split the input into sentencies and tokens use <a href="https://github.com/natasha/razdel">Razdel</a>.
+
+```python
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Morph
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> morph = Morph.load('slovnet_morph_news_v1.tar', batch_size=4)
+>>> morph.navec(navec)
+
+>>> markup = next(morph.map(chunk))
+>>> for token in markup.tokens:
+>>> print(f'{token.text:>20} {token.tag}')
+ Европейский ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
+ союз NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing
+ добавил VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
+ в ADP
+ санкционный ADJ|Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing
+ список NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing
+ девять NUM|Case=Nom
+ политических ADJ|Case=Gen|Degree=Pos|Number=Plur
+ деятелей NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur
+ из ADP
+ самопровозглашенных ADJ|Case=Gen|Degree=Pos|Number=Plur
+ республик NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur
+ Донбасса PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing
+ — PUNCT
+ Донецкой ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ДНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ и CCONJ
+ Луганской ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ЛНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ — PUNCT
+ в ADP
+ связи NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing
+ с ADP
+ прошедшими VERB|Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act
+ там ADV|Degree=Pos
+ выборами NOUN|Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur
+ . PUNCT
+
+```
+
+### Syntax
+
+Syntax parser processes sentencies split into tokens. Use <a href="https://github.com/natasha/razdel">Razdel</a> for segmentation.
+
+```python
+>>> from ipymarkup import show_dep_ascii_markup as show_markup
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Syntax
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> syntax = Syntax.load('slovnet_syntax_news_v1.tar')
+>>> syntax.navec(navec)
+
+>>> markup = next(syntax.map(chunk))
+
+# Convert CoNLL-style format to source, target indices
+>>> words, deps = [], []
+>>> for token in markup.tokens:
+>>> words.append(token.text)
+>>> source = int(token.head_id) - 1
+>>> target = int(token.id) - 1
+>>> if source > 0 and source != target: # skip root, loops
+>>> deps.append([source, target, token.rel])
+>>> show_markup(words, deps)
+ ┌► Европейский amod
+ ┌►└─ союз nsubj
+┌───────┌─┌─└─── добавил
+│ │ │ ┌──► в case
+│ │ │ │ ┌► санкционный amod
+│ │ └►└─└─ список obl
+│ │ ┌──► девять nummod:gov
+│ │ │ ┌► политических amod
+│ ┌─────└►┌─└─└─ деятелей obj
+│ │ │ ┌──► из case
+│ │ │ │ ┌► самопровозглашенных amod
+│ │ └►└─└─ республик nmod
+│ │ └──► Донбасса nmod
+│ │ ┌──────────► — punct
+│ │ │ ┌──► Донецкой amod
+│ │ │ │ ┌► народной amod
+│ │ │ ┌─┌─┌─└─└─ республики
+│ │ │ │ │ │ ┌► ( punct
+│ │ │ │ │ └►┌─└─ ДНР parataxis
+│ │ │ │ │ └──► ) punct
+│ │ │ │ │ ┌────► и cc
+│ │ │ │ │ │ ┌──► Луганской amod
+│ │ │ │ │ │ │ ┌► народной amod
+│ │ └─│ └►└─└─└─ республики conj
+│ │ │ ┌► ( punct
+│ │ └────►┌─└─ ЛНР parataxis
+│ │ └──► ) punct
+│ │ ┌──────► — punct
+│ │ │ ┌►┌─┌─ в case
+│ │ │ │ │ └► связи fixed
+│ │ │ │ └──► с fixed
+│ │ │ │ ┌►┌─ прошедшими acl
+│ │ │ │ │ └► там advmod
+│ └────►└─└─└─── выборами nmod
+└──────────────► . punct
+
+```
+
+## Documentation
+
+Materials are in Russian:
+
+* <a href="https://natasha.github.io/ner">Article about distillation and quantization in Slovnet</a>
+* <a href="https://youtu.be/-7XT_U6hVvk?t=2034">Slovnet section of Datafest 2020 talk</a>
+
+## Evaluation
+
+In addition to quality metrics we measure speed and models size, parameters that are important in production:
+
+* `init` — time between system launch and first response. It is convenient for testing and devops to have model that starts quickly.
+* `disk` — file size of artefacts one needs to download before using the system: model weights, embeddings, binaries, vocabs. It is convenient to deploy compact models in production.
+* `ram` — average CPU/GPU RAM usage.
+* `speed` — number of input items processed per second: news articles, tokenized sentencies.
+
+### NER
+
+4 datasets are used for evaluation: <a href="https://github.com/natasha/corus#load_factru"><code>factru</code></a>, <a href="https://github.com/natasha/corus#load_gareev"><code>gareev</code></a>, <a href="https://github.com/natasha/corus#load_ne5"><code>ne5</code></a> and <a href="https://github.com/natasha/corus#load_bsnlp"><code>bsnlp</code></a>. Slovnet is compared to <a href="https://github.com/natasha/naeval#deeppavlov_ner"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_ner"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_slavic_bert_ner"><code>deeppavlov_slavic</code></a>, <a href="https://github.com/natasha/naeval#pullenti"><code>pullenti</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>, <a href="https://github.com/natasha/naeval#texterra"><code>texterra</code></a>, <a href="https://github.com/natasha/naeval#tomita"><code>tomita</code></a>, <a href="https://github.com/natasha/naeval#mitie"><code>mitie</code></a>.
+
+For every column top 3 results are highlighted:
+
+<!--- ner1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="3" halign="left">factru</th>
+ <th colspan="2" halign="left">gareev</th>
+ <th colspan="3" halign="left">ne5</th>
+ <th colspan="3" halign="left">bsnlp</th>
+ </tr>
+ <tr>
+ <th>f1</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.959</b></td>
+ <td><b>0.915</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.977</b></td>
+ <td><b>0.899</b></td>
+ <td><b>0.984</b></td>
+ <td><b>0.973</b></td>
+ <td><b>0.951</b></td>
+ <td>0.944</td>
+ <td>0.834</td>
+ <td>0.718</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.973</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.831</b></td>
+ <td><b>0.991</b></td>
+ <td><b>0.911</b></td>
+ <td><b>0.996</b></td>
+ <td><b>0.989</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.960</b></td>
+ <td>0.838</td>
+ <td><b>0.733</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.910</td>
+ <td>0.886</td>
+ <td>0.742</td>
+ <td>0.944</td>
+ <td>0.798</td>
+ <td>0.942</td>
+ <td>0.919</td>
+ <td>0.881</td>
+ <td>0.866</td>
+ <td>0.767</td>
+ <td>0.624</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.971</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.980</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.997</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.954</b></td>
+ <td><b>0.840</b></td>
+ <td><b>0.741</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>0.956</td>
+ <td>0.884</td>
+ <td>0.714</td>
+ <td>0.976</td>
+ <td>0.776</td>
+ <td>0.984</td>
+ <td>0.817</td>
+ <td>0.761</td>
+ <td><b>0.965</b></td>
+ <td><b>0.925</b></td>
+ <td><b>0.831</b></td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td>0.905</td>
+ <td>0.814</td>
+ <td>0.686</td>
+ <td>0.939</td>
+ <td>0.639</td>
+ <td>0.952</td>
+ <td>0.862</td>
+ <td>0.683</td>
+ <td>0.900</td>
+ <td>0.769</td>
+ <td>0.566</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>0.901</td>
+ <td>0.886</td>
+ <td>0.765</td>
+ <td>0.970</td>
+ <td>0.883</td>
+ <td>0.967</td>
+ <td>0.928</td>
+ <td>0.918</td>
+ <td>0.919</td>
+ <td>0.823</td>
+ <td>0.693</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.943</td>
+ <td>0.865</td>
+ <td>0.687</td>
+ <td>0.953</td>
+ <td>0.827</td>
+ <td>0.923</td>
+ <td>0.753</td>
+ <td>0.734</td>
+ <td>0.938</td>
+ <td><b>0.838</b></td>
+ <td>0.724</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>0.900</td>
+ <td>0.800</td>
+ <td>0.597</td>
+ <td>0.888</td>
+ <td>0.561</td>
+ <td>0.901</td>
+ <td>0.777</td>
+ <td>0.594</td>
+ <td>0.858</td>
+ <td>0.783</td>
+ <td>0.548</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td>0.929</td>
+ <td></td>
+ <td></td>
+ <td>0.921</td>
+ <td></td>
+ <td>0.945</td>
+ <td></td>
+ <td></td>
+ <td>0.881</td>
+ <td></td>
+ <td></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>0.888</td>
+ <td>0.861</td>
+ <td>0.532</td>
+ <td>0.849</td>
+ <td>0.452</td>
+ <td>0.753</td>
+ <td>0.642</td>
+ <td>0.432</td>
+ <td>0.736</td>
+ <td>0.801</td>
+ <td>0.524</td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner1 --->
+
+`it/s` — news articles per second, 1 article ≈ 1KB.
+
+<!--- ner2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>205</b></td>
+ <td>25.3</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>473</td>
+ <td>9500</td>
+ <td><b>40.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>5.9</td>
+ <td>1024</td>
+ <td>3072</td>
+ <td>24.3 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.5</td>
+ <td>2048</td>
+ <td>6144</td>
+ <td>13.1 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>35.0</td>
+ <td>2048</td>
+ <td>4096</td>
+ <td>8.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td><b>2.9</b></td>
+ <td><b>16</b></td>
+ <td><b>253</b></td>
+ <td>6.0</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>625</td>
+ <td>8.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>3.0</td>
+ <td>591</td>
+ <td>11264</td>
+ <td>3.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>47.6</td>
+ <td>193</td>
+ <td>3379</td>
+ <td>4.0</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td><b>2.0</b></td>
+ <td><b>64</b></td>
+ <td><b>63</b></td>
+ <td><b>29.8</b></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>28.3</td>
+ <td>327</td>
+ <td>261</td>
+ <td><b>32.8</b></td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner2 --->
+
+### Morphology
+
+<a href="https://github.com/natasha/corus#load_gramru">Datasets from GramEval2020</a> are used for evaluation:
+
+* `news` — sample from Lenta.ru.
+* `wiki` — UD GSD.
+* `fiction` — SynTagRus + JZ.
+* `social`, `poetry` — social, poetry subset of Taiga.
+
+Slovnet is compated to a number of existing morphology taggers: <a href="https://github.com/natasha/naeval#deeppavlov_morph"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_morph"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#rupostagger"><code>rupostagger</code></a>, <a href="https://github.com/natasha/naeval#rnnmorph"><code>rnnmorph</code></a>, <a href="https://github.com/natasha/naeval#mary"><code>maru</code></a>, <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+For every column top 3 results are highlighted. `slovnet` was trained only on news dataset:
+
+<!--- morph1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>news</th>
+ <th>wiki</th>
+ <th>fiction</th>
+ <th>social</th>
+ <th>poetry</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.961</b></td>
+ <td>0.815</td>
+ <td>0.905</td>
+ <td>0.807</td>
+ <td>0.664</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.982</b></td>
+ <td><b>0.884</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.890</b></td>
+ <td><b>0.856</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.940</td>
+ <td>0.841</td>
+ <td>0.944</td>
+ <td>0.870</td>
+ <td><b>0.857</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>0.951</td>
+ <td><b>0.868</b></td>
+ <td><b>0.964</b></td>
+ <td><b>0.892</b></td>
+ <td><b>0.865</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.918</td>
+ <td>0.811</td>
+ <td><b>0.957</b></td>
+ <td>0.870</td>
+ <td>0.776</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.964</b></td>
+ <td><b>0.849</b></td>
+ <td>0.942</td>
+ <td>0.857</td>
+ <td>0.784</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.934</td>
+ <td>0.831</td>
+ <td>0.940</td>
+ <td><b>0.873</b></td>
+ <td>0.825</td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>0.896</td>
+ <td>0.812</td>
+ <td>0.890</td>
+ <td>0.860</td>
+ <td>0.838</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>0.894</td>
+ <td>0.808</td>
+ <td>0.887</td>
+ <td>0.861</td>
+ <td>0.840</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>0.673</td>
+ <td>0.645</td>
+ <td>0.661</td>
+ <td>0.641</td>
+ <td>0.636</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph1 --->
+
+`it/s` — sentences per second.
+
+<!--- morph2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>115</b></td>
+ <td><b>532.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>475</td>
+ <td>8087</td>
+ <td><b>285.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td><b>4.0</b></td>
+ <td>32</td>
+ <td>10240</td>
+ <td>90.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>20.0</td>
+ <td>1393</td>
+ <td>8704</td>
+ <td>85.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td>45</td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>579</td>
+ <td>50.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>2.0</b></td>
+ <td>591</td>
+ <td>393</td>
+ <td><b>92.0</b></td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>8.7</td>
+ <td><b>10</b></td>
+ <td>289</td>
+ <td>16.6</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>15.8</td>
+ <td>44</td>
+ <td>370</td>
+ <td>36.4</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>4.8</td>
+ <td><b>3</b></td>
+ <td><b>118</b></td>
+ <td>48.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph2 --->
+
+### Syntax
+
+Slovnet is compated to several existing syntax parsers: <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_syntax"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+<!--- syntax1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="2" halign="left">news</th>
+ <th colspan="2" halign="left">wiki</th>
+ <th colspan="2" halign="left">fiction</th>
+ <th colspan="2" halign="left">social</th>
+ <th colspan="2" halign="left">poetry</th>
+ </tr>
+ <tr>
+ <th></th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td>0.907</td>
+ <td>0.880</td>
+ <td>0.775</td>
+ <td>0.718</td>
+ <td>0.806</td>
+ <td>0.776</td>
+ <td>0.726</td>
+ <td>0.656</td>
+ <td>0.542</td>
+ <td>0.469</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.965</b></td>
+ <td><b>0.936</b></td>
+ <td><b>0.891</b></td>
+ <td><b>0.828</b></td>
+ <td><b>0.958</b></td>
+ <td><b>0.940</b></td>
+ <td><b>0.846</b></td>
+ <td><b>0.782</b></td>
+ <td><b>0.776</b></td>
+ <td><b>0.706</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.962</b></td>
+ <td><b>0.910</b></td>
+ <td><b>0.882</b></td>
+ <td><b>0.786</b></td>
+ <td><b>0.963</b></td>
+ <td><b>0.929</b></td>
+ <td><b>0.844</b></td>
+ <td><b>0.761</b></td>
+ <td><b>0.784</b></td>
+ <td><b>0.691</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.873</td>
+ <td>0.823</td>
+ <td>0.622</td>
+ <td>0.531</td>
+ <td>0.910</td>
+ <td>0.876</td>
+ <td>0.700</td>
+ <td>0.624</td>
+ <td>0.625</td>
+ <td>0.534</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.943</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.851</b></td>
+ <td><b>0.783</b></td>
+ <td>0.901</td>
+ <td>0.874</td>
+ <td><b>0.804</b></td>
+ <td><b>0.737</b></td>
+ <td>0.704</td>
+ <td><b>0.616</b></td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.940</td>
+ <td>0.886</td>
+ <td>0.815</td>
+ <td>0.716</td>
+ <td><b>0.936</b></td>
+ <td><b>0.895</b></td>
+ <td>0.802</td>
+ <td>0.714</td>
+ <td><b>0.713</b></td>
+ <td>0.613</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax1 --->
+
+`it/s` — sentences per second.
+
+<!--- syntax2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>125</b></td>
+ <td><b>450.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>5.0</b></td>
+ <td>504</td>
+ <td>3427</td>
+ <td><b>200.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.0</td>
+ <td>1427</td>
+ <td>8704</td>
+ <td><b>75.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td><b>45</b></td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>9.0</td>
+ <td><b>140</b></td>
+ <td><b>579</b></td>
+ <td>41.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>3.0</b></td>
+ <td>591</td>
+ <td>890</td>
+ <td>12.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax2 --->
+
+## Support
+
+- Chat — https://telegram.me/natural_language_processing
+- Issues — https://github.com/natasha/slovnet/issues
+- Commercial support — https://lab.alexkuk.ru
+
+## Development
+
+Dev env
+
+```bash
+python -m venv ~/.venvs/natasha-slovnet
+source ~/.venvs/natasha-slovnet/bin/activate
+
+pip install -r requirements/dev.txt
+pip install -e .
+```
+
+Test
+
+```bash
+make test
+```
+
+Rent GPU
+
+```bash
+yc compute instance create \
+ --name gpu \
+ --zone ru-central1-a \
+ --network-interface subnet-name=default,nat-ip-version=ipv4 \
+ --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804-lts-ngc,type=network-ssd,size=20 \
+ --cores=8 \
+ --memory=96 \
+ --gpus=1 \
+ --ssh-key ~/.ssh/id_rsa.pub \
+ --folder-name default \
+ --platform-id gpu-standard-v1 \
+ --preemptible
+
+yc compute instance delete --name gpu
+```
+
+Setup instance
+
+```
+sudo locale-gen ru_RU.UTF-8
+
+sudo apt-get update
+sudo apt-get install -y \
+ python3-pip
+
+# grpcio long install ~10m, not using prebuilt wheel
+# "it is not compatible with this Python"
+sudo pip3 install -v \
+ jupyter \
+ tensorboard
+
+mkdir runs
+nohup tensorboard \
+ --logdir=runs \
+ --host=localhost \
+ --port=6006 \
+ --reload_interval=1 &
+
+nohup jupyter notebook \
+ --no-browser \
+ --allow-root \
+ --ip=localhost \
+ --port=8888 \
+ --NotebookApp.token='' \
+ --NotebookApp.password='' &
+
+ssh -Nf gpu -L 8888:localhost:8888 -L 6006:localhost:6006
+
+scp ~/.slovnet.json gpu:~
+rsync --exclude data -rv . gpu:~/slovnet
+rsync -u --exclude data -rv 'gpu:~/slovnet/*' .
+```
+
+Intall dev
+
+```bash
+pip3 install -r slovnet/requirements/dev.txt -r slovnet/requirements/gpu.txt
+pip3 install -e slovnet
+```
+
+Release
+
+```bash
+# Update setup.py version
+
+git commit -am 'Up version'
+git tag v0.6.0
+
+git push
+git push --tags
+
+# Github Action builds dist and publishes to PyPi
+```
+
+
+%package -n python3-slovnet
+Summary: Deep-learning based NLP modeling for Russian language
+Provides: python-slovnet
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-slovnet
+
+<img src="https://github.com/natasha/natasha-logos/blob/master/slovnet.svg">
+
+![CI](https://github.com/natasha/slovnet/actions/workflows/test.yml/badge.svg)
+
+SlovNet is a Python library for deep-learning based NLP modeling for Russian language. Library is integrated with other <a href="https://github.com/natasha/">Natasha</a> projects: <a href="https://github.com/natasha/nerus">Nerus</a> — large automatically annotated corpus, <a href="https://github.com/natasha/razdel">Razdel</a> — sentence segmenter, tokenizer and <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings. Slovnet provides high quality practical models for Russian NER, morphology and syntax, see <a href="#evaluation">evaluation section</a> for more:
+
+* NER is 1-2% worse than current BERT SOTA by DeepPavlov but 60 times smaller in size (~30 MB) and works fast on CPU (~25 news articles/sec).
+* Morphology tagger and syntax parser have comparable accuracy on news dataset with large SOTA BERT models, take 50 times less space (~30 MB), work faster on CPU (~500 sentences/sec).
+
+## Downloads
+
+<table>
+
+<tr>
+<th>Model</th>
+<th>Size</th>
+<th>Description</th>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_ner_news_v1.tar">slovnet_ner_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian NER, standart PER, LOC, ORG annotation, trained on news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_morph_news_v1.tar">slovnet_morph_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian morphology tagger optimized for news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_syntax_news_v1.tar">slovnet_syntax_news_v1.tar</a>
+</td>
+<td>3MB</td>
+<td>
+ Russian syntax parser optimized for news articles.
+</td>
+</tr>
+
+</table>
+
+## Install
+
+During inference Slovnet depends only on Numpy. Library supports Python 3.5+, PyPy 3.
+
+```bash
+$ pip install slovnet
+```
+
+## Usage
+
+Download model weights and vocabs package, use links from <a href="#downloads">downloads section</a> and <a href="https://github.com/natasha/navec#downloads">Navec download section</a>. Optionally install <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> to visualize NER markup.
+
+Slovnet annotator `map` method has list of items as input and same size iterator over markups as output. Internally items are processed in batches of size `batch_size`. Default size is 8, larger batch — more RAM, better CPU utilization. `__call__` method just calls `map` with a list of 1 item.
+
+### NER
+
+```python
+>>> from navec import Navec
+>>> from slovnet import NER
+>>> from ipymarkup import show_span_ascii_markup as show_markup
+
+>>> text = 'Европейский союз добавил в санкционный список девять политических деятелей из самопровозглашенных республик Донбасса — Донецкой народной республики (ДНР) и Луганской народной республики (ЛНР) — в связи с прошедшими там выборами. Об этом говорится в документе, опубликованном в официальном журнале Евросоюза. В новом списке фигурирует Леонид Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там присутствуют Владимир Бидевка и Денис Мирошниченко, председатели законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена Кравченко, председатели ЦИК обеих республик. Выборы прошли в непризнанных республиках Донбасса 11 ноября. На них удержали лидерство действующие руководители и партии — Денис Пушилин и «Донецкая республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР. Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после встречи с украинским лидером Петром Порошенко осудили проведение выборов, заявив, что они нелегитимны и «подрывают территориальную целостность и суверенитет Украины». Позже к осуждению присоединились США с обещаниями новых санкций для России.'
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> ner = NER.load('slovnet_ner_news_v1.tar')
+>>> ner.navec(navec)
+
+>>> markup = ner(text)
+>>> show_markup(markup.text, markup.spans)
+Европейский союз добавил в санкционный список девять политических
+LOC─────────────
+деятелей из самопровозглашенных республик Донбасса — Донецкой народной
+ LOC───── LOC──────────────
+ республики (ДНР) и Луганской народной республики (ЛНР) — в связи с
+───────────────── LOC────────────────────────────────
+прошедшими там выборами. Об этом говорится в документе, опубликованном
+ в официальном журнале Евросоюза. В новом списке фигурирует Леонид
+ LOC────── PER────
+Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там
+──────── LOC
+присутствуют Владимир Бидевка и Денис Мирошниченко, председатели
+ PER───────────── PER───────────────
+законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена
+ LOC LOC PER───────────── PER───
+Кравченко, председатели ЦИК обеих республик. Выборы прошли в
+───────── ORG
+непризнанных республиках Донбасса 11 ноября. На них удержали лидерство
+ LOC─────
+ действующие руководители и партии — Денис Пушилин и «Донецкая
+ PER────────── ORG──────
+республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР.
+────────── LOC PER──────────── ORG────────── LOC
+ Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после
+ LOC──── PER───────────── LOC PER───────────
+ встречи с украинским лидером Петром Порошенко осудили проведение
+ PER─────────────
+выборов, заявив, что они нелегитимны и «подрывают территориальную
+целостность и суверенитет Украины». Позже к осуждению присоединились
+ LOC────
+США с обещаниями новых санкций для России.
+LOC LOC───
+
+```
+
+### Morphology
+
+Morphology annotator processes tokenized text. To split the input into sentencies and tokens use <a href="https://github.com/natasha/razdel">Razdel</a>.
+
+```python
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Morph
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> morph = Morph.load('slovnet_morph_news_v1.tar', batch_size=4)
+>>> morph.navec(navec)
+
+>>> markup = next(morph.map(chunk))
+>>> for token in markup.tokens:
+>>> print(f'{token.text:>20} {token.tag}')
+ Европейский ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
+ союз NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing
+ добавил VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
+ в ADP
+ санкционный ADJ|Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing
+ список NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing
+ девять NUM|Case=Nom
+ политических ADJ|Case=Gen|Degree=Pos|Number=Plur
+ деятелей NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur
+ из ADP
+ самопровозглашенных ADJ|Case=Gen|Degree=Pos|Number=Plur
+ республик NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur
+ Донбасса PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing
+ — PUNCT
+ Донецкой ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ДНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ и CCONJ
+ Луганской ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ЛНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ — PUNCT
+ в ADP
+ связи NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing
+ с ADP
+ прошедшими VERB|Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act
+ там ADV|Degree=Pos
+ выборами NOUN|Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur
+ . PUNCT
+
+```
+
+### Syntax
+
+Syntax parser processes sentencies split into tokens. Use <a href="https://github.com/natasha/razdel">Razdel</a> for segmentation.
+
+```python
+>>> from ipymarkup import show_dep_ascii_markup as show_markup
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Syntax
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> syntax = Syntax.load('slovnet_syntax_news_v1.tar')
+>>> syntax.navec(navec)
+
+>>> markup = next(syntax.map(chunk))
+
+# Convert CoNLL-style format to source, target indices
+>>> words, deps = [], []
+>>> for token in markup.tokens:
+>>> words.append(token.text)
+>>> source = int(token.head_id) - 1
+>>> target = int(token.id) - 1
+>>> if source > 0 and source != target: # skip root, loops
+>>> deps.append([source, target, token.rel])
+>>> show_markup(words, deps)
+ ┌► Европейский amod
+ ┌►└─ союз nsubj
+┌───────┌─┌─└─── добавил
+│ │ │ ┌──► в case
+│ │ │ │ ┌► санкционный amod
+│ │ └►└─└─ список obl
+│ │ ┌──► девять nummod:gov
+│ │ │ ┌► политических amod
+│ ┌─────└►┌─└─└─ деятелей obj
+│ │ │ ┌──► из case
+│ │ │ │ ┌► самопровозглашенных amod
+│ │ └►└─└─ республик nmod
+│ │ └──► Донбасса nmod
+│ │ ┌──────────► — punct
+│ │ │ ┌──► Донецкой amod
+│ │ │ │ ┌► народной amod
+│ │ │ ┌─┌─┌─└─└─ республики
+│ │ │ │ │ │ ┌► ( punct
+│ │ │ │ │ └►┌─└─ ДНР parataxis
+│ │ │ │ │ └──► ) punct
+│ │ │ │ │ ┌────► и cc
+│ │ │ │ │ │ ┌──► Луганской amod
+│ │ │ │ │ │ │ ┌► народной amod
+│ │ └─│ └►└─└─└─ республики conj
+│ │ │ ┌► ( punct
+│ │ └────►┌─└─ ЛНР parataxis
+│ │ └──► ) punct
+│ │ ┌──────► — punct
+│ │ │ ┌►┌─┌─ в case
+│ │ │ │ │ └► связи fixed
+│ │ │ │ └──► с fixed
+│ │ │ │ ┌►┌─ прошедшими acl
+│ │ │ │ │ └► там advmod
+│ └────►└─└─└─── выборами nmod
+└──────────────► . punct
+
+```
+
+## Documentation
+
+Materials are in Russian:
+
+* <a href="https://natasha.github.io/ner">Article about distillation and quantization in Slovnet</a>
+* <a href="https://youtu.be/-7XT_U6hVvk?t=2034">Slovnet section of Datafest 2020 talk</a>
+
+## Evaluation
+
+In addition to quality metrics we measure speed and models size, parameters that are important in production:
+
+* `init` — time between system launch and first response. It is convenient for testing and devops to have model that starts quickly.
+* `disk` — file size of artefacts one needs to download before using the system: model weights, embeddings, binaries, vocabs. It is convenient to deploy compact models in production.
+* `ram` — average CPU/GPU RAM usage.
+* `speed` — number of input items processed per second: news articles, tokenized sentencies.
+
+### NER
+
+4 datasets are used for evaluation: <a href="https://github.com/natasha/corus#load_factru"><code>factru</code></a>, <a href="https://github.com/natasha/corus#load_gareev"><code>gareev</code></a>, <a href="https://github.com/natasha/corus#load_ne5"><code>ne5</code></a> and <a href="https://github.com/natasha/corus#load_bsnlp"><code>bsnlp</code></a>. Slovnet is compared to <a href="https://github.com/natasha/naeval#deeppavlov_ner"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_ner"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_slavic_bert_ner"><code>deeppavlov_slavic</code></a>, <a href="https://github.com/natasha/naeval#pullenti"><code>pullenti</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>, <a href="https://github.com/natasha/naeval#texterra"><code>texterra</code></a>, <a href="https://github.com/natasha/naeval#tomita"><code>tomita</code></a>, <a href="https://github.com/natasha/naeval#mitie"><code>mitie</code></a>.
+
+For every column top 3 results are highlighted:
+
+<!--- ner1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="3" halign="left">factru</th>
+ <th colspan="2" halign="left">gareev</th>
+ <th colspan="3" halign="left">ne5</th>
+ <th colspan="3" halign="left">bsnlp</th>
+ </tr>
+ <tr>
+ <th>f1</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.959</b></td>
+ <td><b>0.915</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.977</b></td>
+ <td><b>0.899</b></td>
+ <td><b>0.984</b></td>
+ <td><b>0.973</b></td>
+ <td><b>0.951</b></td>
+ <td>0.944</td>
+ <td>0.834</td>
+ <td>0.718</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.973</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.831</b></td>
+ <td><b>0.991</b></td>
+ <td><b>0.911</b></td>
+ <td><b>0.996</b></td>
+ <td><b>0.989</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.960</b></td>
+ <td>0.838</td>
+ <td><b>0.733</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.910</td>
+ <td>0.886</td>
+ <td>0.742</td>
+ <td>0.944</td>
+ <td>0.798</td>
+ <td>0.942</td>
+ <td>0.919</td>
+ <td>0.881</td>
+ <td>0.866</td>
+ <td>0.767</td>
+ <td>0.624</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.971</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.980</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.997</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.954</b></td>
+ <td><b>0.840</b></td>
+ <td><b>0.741</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>0.956</td>
+ <td>0.884</td>
+ <td>0.714</td>
+ <td>0.976</td>
+ <td>0.776</td>
+ <td>0.984</td>
+ <td>0.817</td>
+ <td>0.761</td>
+ <td><b>0.965</b></td>
+ <td><b>0.925</b></td>
+ <td><b>0.831</b></td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td>0.905</td>
+ <td>0.814</td>
+ <td>0.686</td>
+ <td>0.939</td>
+ <td>0.639</td>
+ <td>0.952</td>
+ <td>0.862</td>
+ <td>0.683</td>
+ <td>0.900</td>
+ <td>0.769</td>
+ <td>0.566</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>0.901</td>
+ <td>0.886</td>
+ <td>0.765</td>
+ <td>0.970</td>
+ <td>0.883</td>
+ <td>0.967</td>
+ <td>0.928</td>
+ <td>0.918</td>
+ <td>0.919</td>
+ <td>0.823</td>
+ <td>0.693</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.943</td>
+ <td>0.865</td>
+ <td>0.687</td>
+ <td>0.953</td>
+ <td>0.827</td>
+ <td>0.923</td>
+ <td>0.753</td>
+ <td>0.734</td>
+ <td>0.938</td>
+ <td><b>0.838</b></td>
+ <td>0.724</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>0.900</td>
+ <td>0.800</td>
+ <td>0.597</td>
+ <td>0.888</td>
+ <td>0.561</td>
+ <td>0.901</td>
+ <td>0.777</td>
+ <td>0.594</td>
+ <td>0.858</td>
+ <td>0.783</td>
+ <td>0.548</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td>0.929</td>
+ <td></td>
+ <td></td>
+ <td>0.921</td>
+ <td></td>
+ <td>0.945</td>
+ <td></td>
+ <td></td>
+ <td>0.881</td>
+ <td></td>
+ <td></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>0.888</td>
+ <td>0.861</td>
+ <td>0.532</td>
+ <td>0.849</td>
+ <td>0.452</td>
+ <td>0.753</td>
+ <td>0.642</td>
+ <td>0.432</td>
+ <td>0.736</td>
+ <td>0.801</td>
+ <td>0.524</td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner1 --->
+
+`it/s` — news articles per second, 1 article ≈ 1KB.
+
+<!--- ner2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>205</b></td>
+ <td>25.3</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>473</td>
+ <td>9500</td>
+ <td><b>40.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>5.9</td>
+ <td>1024</td>
+ <td>3072</td>
+ <td>24.3 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.5</td>
+ <td>2048</td>
+ <td>6144</td>
+ <td>13.1 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>35.0</td>
+ <td>2048</td>
+ <td>4096</td>
+ <td>8.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td><b>2.9</b></td>
+ <td><b>16</b></td>
+ <td><b>253</b></td>
+ <td>6.0</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>625</td>
+ <td>8.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>3.0</td>
+ <td>591</td>
+ <td>11264</td>
+ <td>3.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>47.6</td>
+ <td>193</td>
+ <td>3379</td>
+ <td>4.0</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td><b>2.0</b></td>
+ <td><b>64</b></td>
+ <td><b>63</b></td>
+ <td><b>29.8</b></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>28.3</td>
+ <td>327</td>
+ <td>261</td>
+ <td><b>32.8</b></td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner2 --->
+
+### Morphology
+
+<a href="https://github.com/natasha/corus#load_gramru">Datasets from GramEval2020</a> are used for evaluation:
+
+* `news` — sample from Lenta.ru.
+* `wiki` — UD GSD.
+* `fiction` — SynTagRus + JZ.
+* `social`, `poetry` — social, poetry subset of Taiga.
+
+Slovnet is compated to a number of existing morphology taggers: <a href="https://github.com/natasha/naeval#deeppavlov_morph"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_morph"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#rupostagger"><code>rupostagger</code></a>, <a href="https://github.com/natasha/naeval#rnnmorph"><code>rnnmorph</code></a>, <a href="https://github.com/natasha/naeval#mary"><code>maru</code></a>, <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+For every column top 3 results are highlighted. `slovnet` was trained only on news dataset:
+
+<!--- morph1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>news</th>
+ <th>wiki</th>
+ <th>fiction</th>
+ <th>social</th>
+ <th>poetry</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.961</b></td>
+ <td>0.815</td>
+ <td>0.905</td>
+ <td>0.807</td>
+ <td>0.664</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.982</b></td>
+ <td><b>0.884</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.890</b></td>
+ <td><b>0.856</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.940</td>
+ <td>0.841</td>
+ <td>0.944</td>
+ <td>0.870</td>
+ <td><b>0.857</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>0.951</td>
+ <td><b>0.868</b></td>
+ <td><b>0.964</b></td>
+ <td><b>0.892</b></td>
+ <td><b>0.865</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.918</td>
+ <td>0.811</td>
+ <td><b>0.957</b></td>
+ <td>0.870</td>
+ <td>0.776</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.964</b></td>
+ <td><b>0.849</b></td>
+ <td>0.942</td>
+ <td>0.857</td>
+ <td>0.784</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.934</td>
+ <td>0.831</td>
+ <td>0.940</td>
+ <td><b>0.873</b></td>
+ <td>0.825</td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>0.896</td>
+ <td>0.812</td>
+ <td>0.890</td>
+ <td>0.860</td>
+ <td>0.838</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>0.894</td>
+ <td>0.808</td>
+ <td>0.887</td>
+ <td>0.861</td>
+ <td>0.840</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>0.673</td>
+ <td>0.645</td>
+ <td>0.661</td>
+ <td>0.641</td>
+ <td>0.636</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph1 --->
+
+`it/s` — sentences per second.
+
+<!--- morph2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>115</b></td>
+ <td><b>532.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>475</td>
+ <td>8087</td>
+ <td><b>285.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td><b>4.0</b></td>
+ <td>32</td>
+ <td>10240</td>
+ <td>90.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>20.0</td>
+ <td>1393</td>
+ <td>8704</td>
+ <td>85.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td>45</td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>579</td>
+ <td>50.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>2.0</b></td>
+ <td>591</td>
+ <td>393</td>
+ <td><b>92.0</b></td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>8.7</td>
+ <td><b>10</b></td>
+ <td>289</td>
+ <td>16.6</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>15.8</td>
+ <td>44</td>
+ <td>370</td>
+ <td>36.4</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>4.8</td>
+ <td><b>3</b></td>
+ <td><b>118</b></td>
+ <td>48.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph2 --->
+
+### Syntax
+
+Slovnet is compated to several existing syntax parsers: <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_syntax"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+<!--- syntax1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="2" halign="left">news</th>
+ <th colspan="2" halign="left">wiki</th>
+ <th colspan="2" halign="left">fiction</th>
+ <th colspan="2" halign="left">social</th>
+ <th colspan="2" halign="left">poetry</th>
+ </tr>
+ <tr>
+ <th></th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td>0.907</td>
+ <td>0.880</td>
+ <td>0.775</td>
+ <td>0.718</td>
+ <td>0.806</td>
+ <td>0.776</td>
+ <td>0.726</td>
+ <td>0.656</td>
+ <td>0.542</td>
+ <td>0.469</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.965</b></td>
+ <td><b>0.936</b></td>
+ <td><b>0.891</b></td>
+ <td><b>0.828</b></td>
+ <td><b>0.958</b></td>
+ <td><b>0.940</b></td>
+ <td><b>0.846</b></td>
+ <td><b>0.782</b></td>
+ <td><b>0.776</b></td>
+ <td><b>0.706</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.962</b></td>
+ <td><b>0.910</b></td>
+ <td><b>0.882</b></td>
+ <td><b>0.786</b></td>
+ <td><b>0.963</b></td>
+ <td><b>0.929</b></td>
+ <td><b>0.844</b></td>
+ <td><b>0.761</b></td>
+ <td><b>0.784</b></td>
+ <td><b>0.691</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.873</td>
+ <td>0.823</td>
+ <td>0.622</td>
+ <td>0.531</td>
+ <td>0.910</td>
+ <td>0.876</td>
+ <td>0.700</td>
+ <td>0.624</td>
+ <td>0.625</td>
+ <td>0.534</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.943</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.851</b></td>
+ <td><b>0.783</b></td>
+ <td>0.901</td>
+ <td>0.874</td>
+ <td><b>0.804</b></td>
+ <td><b>0.737</b></td>
+ <td>0.704</td>
+ <td><b>0.616</b></td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.940</td>
+ <td>0.886</td>
+ <td>0.815</td>
+ <td>0.716</td>
+ <td><b>0.936</b></td>
+ <td><b>0.895</b></td>
+ <td>0.802</td>
+ <td>0.714</td>
+ <td><b>0.713</b></td>
+ <td>0.613</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax1 --->
+
+`it/s` — sentences per second.
+
+<!--- syntax2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>125</b></td>
+ <td><b>450.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>5.0</b></td>
+ <td>504</td>
+ <td>3427</td>
+ <td><b>200.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.0</td>
+ <td>1427</td>
+ <td>8704</td>
+ <td><b>75.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td><b>45</b></td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>9.0</td>
+ <td><b>140</b></td>
+ <td><b>579</b></td>
+ <td>41.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>3.0</b></td>
+ <td>591</td>
+ <td>890</td>
+ <td>12.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax2 --->
+
+## Support
+
+- Chat — https://telegram.me/natural_language_processing
+- Issues — https://github.com/natasha/slovnet/issues
+- Commercial support — https://lab.alexkuk.ru
+
+## Development
+
+Dev env
+
+```bash
+python -m venv ~/.venvs/natasha-slovnet
+source ~/.venvs/natasha-slovnet/bin/activate
+
+pip install -r requirements/dev.txt
+pip install -e .
+```
+
+Test
+
+```bash
+make test
+```
+
+Rent GPU
+
+```bash
+yc compute instance create \
+ --name gpu \
+ --zone ru-central1-a \
+ --network-interface subnet-name=default,nat-ip-version=ipv4 \
+ --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804-lts-ngc,type=network-ssd,size=20 \
+ --cores=8 \
+ --memory=96 \
+ --gpus=1 \
+ --ssh-key ~/.ssh/id_rsa.pub \
+ --folder-name default \
+ --platform-id gpu-standard-v1 \
+ --preemptible
+
+yc compute instance delete --name gpu
+```
+
+Setup instance
+
+```
+sudo locale-gen ru_RU.UTF-8
+
+sudo apt-get update
+sudo apt-get install -y \
+ python3-pip
+
+# grpcio long install ~10m, not using prebuilt wheel
+# "it is not compatible with this Python"
+sudo pip3 install -v \
+ jupyter \
+ tensorboard
+
+mkdir runs
+nohup tensorboard \
+ --logdir=runs \
+ --host=localhost \
+ --port=6006 \
+ --reload_interval=1 &
+
+nohup jupyter notebook \
+ --no-browser \
+ --allow-root \
+ --ip=localhost \
+ --port=8888 \
+ --NotebookApp.token='' \
+ --NotebookApp.password='' &
+
+ssh -Nf gpu -L 8888:localhost:8888 -L 6006:localhost:6006
+
+scp ~/.slovnet.json gpu:~
+rsync --exclude data -rv . gpu:~/slovnet
+rsync -u --exclude data -rv 'gpu:~/slovnet/*' .
+```
+
+Intall dev
+
+```bash
+pip3 install -r slovnet/requirements/dev.txt -r slovnet/requirements/gpu.txt
+pip3 install -e slovnet
+```
+
+Release
+
+```bash
+# Update setup.py version
+
+git commit -am 'Up version'
+git tag v0.6.0
+
+git push
+git push --tags
+
+# Github Action builds dist and publishes to PyPi
+```
+
+
+%package help
+Summary: Development documents and examples for slovnet
+Provides: python3-slovnet-doc
+%description help
+
+<img src="https://github.com/natasha/natasha-logos/blob/master/slovnet.svg">
+
+![CI](https://github.com/natasha/slovnet/actions/workflows/test.yml/badge.svg)
+
+SlovNet is a Python library for deep-learning based NLP modeling for Russian language. Library is integrated with other <a href="https://github.com/natasha/">Natasha</a> projects: <a href="https://github.com/natasha/nerus">Nerus</a> — large automatically annotated corpus, <a href="https://github.com/natasha/razdel">Razdel</a> — sentence segmenter, tokenizer and <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings. Slovnet provides high quality practical models for Russian NER, morphology and syntax, see <a href="#evaluation">evaluation section</a> for more:
+
+* NER is 1-2% worse than current BERT SOTA by DeepPavlov but 60 times smaller in size (~30 MB) and works fast on CPU (~25 news articles/sec).
+* Morphology tagger and syntax parser have comparable accuracy on news dataset with large SOTA BERT models, take 50 times less space (~30 MB), work faster on CPU (~500 sentences/sec).
+
+## Downloads
+
+<table>
+
+<tr>
+<th>Model</th>
+<th>Size</th>
+<th>Description</th>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_ner_news_v1.tar">slovnet_ner_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian NER, standart PER, LOC, ORG annotation, trained on news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_morph_news_v1.tar">slovnet_morph_news_v1.tar</a>
+</td>
+<td>2MB</td>
+<td>
+ Russian morphology tagger optimized for news articles.
+</td>
+</tr>
+
+<tr>
+<td>
+ <a href="https://storage.yandexcloud.net/natasha-slovnet/packs/slovnet_syntax_news_v1.tar">slovnet_syntax_news_v1.tar</a>
+</td>
+<td>3MB</td>
+<td>
+ Russian syntax parser optimized for news articles.
+</td>
+</tr>
+
+</table>
+
+## Install
+
+During inference Slovnet depends only on Numpy. Library supports Python 3.5+, PyPy 3.
+
+```bash
+$ pip install slovnet
+```
+
+## Usage
+
+Download model weights and vocabs package, use links from <a href="#downloads">downloads section</a> and <a href="https://github.com/natasha/navec#downloads">Navec download section</a>. Optionally install <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> to visualize NER markup.
+
+Slovnet annotator `map` method has list of items as input and same size iterator over markups as output. Internally items are processed in batches of size `batch_size`. Default size is 8, larger batch — more RAM, better CPU utilization. `__call__` method just calls `map` with a list of 1 item.
+
+### NER
+
+```python
+>>> from navec import Navec
+>>> from slovnet import NER
+>>> from ipymarkup import show_span_ascii_markup as show_markup
+
+>>> text = 'Европейский союз добавил в санкционный список девять политических деятелей из самопровозглашенных республик Донбасса — Донецкой народной республики (ДНР) и Луганской народной республики (ЛНР) — в связи с прошедшими там выборами. Об этом говорится в документе, опубликованном в официальном журнале Евросоюза. В новом списке фигурирует Леонид Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там присутствуют Владимир Бидевка и Денис Мирошниченко, председатели законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена Кравченко, председатели ЦИК обеих республик. Выборы прошли в непризнанных республиках Донбасса 11 ноября. На них удержали лидерство действующие руководители и партии — Денис Пушилин и «Донецкая республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР. Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после встречи с украинским лидером Петром Порошенко осудили проведение выборов, заявив, что они нелегитимны и «подрывают территориальную целостность и суверенитет Украины». Позже к осуждению присоединились США с обещаниями новых санкций для России.'
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> ner = NER.load('slovnet_ner_news_v1.tar')
+>>> ner.navec(navec)
+
+>>> markup = ner(text)
+>>> show_markup(markup.text, markup.spans)
+Европейский союз добавил в санкционный список девять политических
+LOC─────────────
+деятелей из самопровозглашенных республик Донбасса — Донецкой народной
+ LOC───── LOC──────────────
+ республики (ДНР) и Луганской народной республики (ЛНР) — в связи с
+───────────────── LOC────────────────────────────────
+прошедшими там выборами. Об этом говорится в документе, опубликованном
+ в официальном журнале Евросоюза. В новом списке фигурирует Леонид
+ LOC────── PER────
+Пасечник, который по итогам выборов стал главой ЛНР. Помимо него там
+──────── LOC
+присутствуют Владимир Бидевка и Денис Мирошниченко, председатели
+ PER───────────── PER───────────────
+законодательных органов ДНР и ЛНР, а также Ольга Позднякова и Елена
+ LOC LOC PER───────────── PER───
+Кравченко, председатели ЦИК обеих республик. Выборы прошли в
+───────── ORG
+непризнанных республиках Донбасса 11 ноября. На них удержали лидерство
+ LOC─────
+ действующие руководители и партии — Денис Пушилин и «Донецкая
+ PER────────── ORG──────
+республика» в ДНР и Леонид Пасечник с движением «Мир Луганщине» в ЛНР.
+────────── LOC PER──────────── ORG────────── LOC
+ Президент Франции Эмманюэль Макрон и канцлер ФРГ Ангела Меркель после
+ LOC──── PER───────────── LOC PER───────────
+ встречи с украинским лидером Петром Порошенко осудили проведение
+ PER─────────────
+выборов, заявив, что они нелегитимны и «подрывают территориальную
+целостность и суверенитет Украины». Позже к осуждению присоединились
+ LOC────
+США с обещаниями новых санкций для России.
+LOC LOC───
+
+```
+
+### Morphology
+
+Morphology annotator processes tokenized text. To split the input into sentencies and tokens use <a href="https://github.com/natasha/razdel">Razdel</a>.
+
+```python
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Morph
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> morph = Morph.load('slovnet_morph_news_v1.tar', batch_size=4)
+>>> morph.navec(navec)
+
+>>> markup = next(morph.map(chunk))
+>>> for token in markup.tokens:
+>>> print(f'{token.text:>20} {token.tag}')
+ Европейский ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
+ союз NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing
+ добавил VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
+ в ADP
+ санкционный ADJ|Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing
+ список NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing
+ девять NUM|Case=Nom
+ политических ADJ|Case=Gen|Degree=Pos|Number=Plur
+ деятелей NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur
+ из ADP
+ самопровозглашенных ADJ|Case=Gen|Degree=Pos|Number=Plur
+ республик NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur
+ Донбасса PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing
+ — PUNCT
+ Донецкой ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ДНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ и CCONJ
+ Луганской ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ народной ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing
+ республики NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ( PUNCT
+ ЛНР PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
+ ) PUNCT
+ — PUNCT
+ в ADP
+ связи NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing
+ с ADP
+ прошедшими VERB|Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act
+ там ADV|Degree=Pos
+ выборами NOUN|Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur
+ . PUNCT
+
+```
+
+### Syntax
+
+Syntax parser processes sentencies split into tokens. Use <a href="https://github.com/natasha/razdel">Razdel</a> for segmentation.
+
+```python
+>>> from ipymarkup import show_dep_ascii_markup as show_markup
+>>> from razdel import sentenize, tokenize
+>>> from navec import Navec
+>>> from slovnet import Syntax
+
+>>> chunk = []
+>>> for sent in sentenize(text):
+>>> tokens = [_.text for _ in tokenize(sent.text)]
+>>> chunk.append(tokens)
+>>> chunk[:1]
+[['Европейский', 'союз', 'добавил', 'в', 'санкционный', 'список', 'девять', 'политических', 'деятелей', 'из', 'самопровозглашенных', 'республик', 'Донбасса', '—', 'Донецкой', 'народной', 'республики', '(', 'ДНР', ')', 'и', 'Луганской', 'народной', 'республики', '(', 'ЛНР', ')', '—', 'в', 'связи', 'с', 'прошедшими', 'там', 'выборами', '.']]
+
+>>> navec = Navec.load('navec_news_v1_1B_250K_300d_100q.tar')
+>>> syntax = Syntax.load('slovnet_syntax_news_v1.tar')
+>>> syntax.navec(navec)
+
+>>> markup = next(syntax.map(chunk))
+
+# Convert CoNLL-style format to source, target indices
+>>> words, deps = [], []
+>>> for token in markup.tokens:
+>>> words.append(token.text)
+>>> source = int(token.head_id) - 1
+>>> target = int(token.id) - 1
+>>> if source > 0 and source != target: # skip root, loops
+>>> deps.append([source, target, token.rel])
+>>> show_markup(words, deps)
+ ┌► Европейский amod
+ ┌►└─ союз nsubj
+┌───────┌─┌─└─── добавил
+│ │ │ ┌──► в case
+│ │ │ │ ┌► санкционный amod
+│ │ └►└─└─ список obl
+│ │ ┌──► девять nummod:gov
+│ │ │ ┌► политических amod
+│ ┌─────└►┌─└─└─ деятелей obj
+│ │ │ ┌──► из case
+│ │ │ │ ┌► самопровозглашенных amod
+│ │ └►└─└─ республик nmod
+│ │ └──► Донбасса nmod
+│ │ ┌──────────► — punct
+│ │ │ ┌──► Донецкой amod
+│ │ │ │ ┌► народной amod
+│ │ │ ┌─┌─┌─└─└─ республики
+│ │ │ │ │ │ ┌► ( punct
+│ │ │ │ │ └►┌─└─ ДНР parataxis
+│ │ │ │ │ └──► ) punct
+│ │ │ │ │ ┌────► и cc
+│ │ │ │ │ │ ┌──► Луганской amod
+│ │ │ │ │ │ │ ┌► народной amod
+│ │ └─│ └►└─└─└─ республики conj
+│ │ │ ┌► ( punct
+│ │ └────►┌─└─ ЛНР parataxis
+│ │ └──► ) punct
+│ │ ┌──────► — punct
+│ │ │ ┌►┌─┌─ в case
+│ │ │ │ │ └► связи fixed
+│ │ │ │ └──► с fixed
+│ │ │ │ ┌►┌─ прошедшими acl
+│ │ │ │ │ └► там advmod
+│ └────►└─└─└─── выборами nmod
+└──────────────► . punct
+
+```
+
+## Documentation
+
+Materials are in Russian:
+
+* <a href="https://natasha.github.io/ner">Article about distillation and quantization in Slovnet</a>
+* <a href="https://youtu.be/-7XT_U6hVvk?t=2034">Slovnet section of Datafest 2020 talk</a>
+
+## Evaluation
+
+In addition to quality metrics we measure speed and models size, parameters that are important in production:
+
+* `init` — time between system launch and first response. It is convenient for testing and devops to have model that starts quickly.
+* `disk` — file size of artefacts one needs to download before using the system: model weights, embeddings, binaries, vocabs. It is convenient to deploy compact models in production.
+* `ram` — average CPU/GPU RAM usage.
+* `speed` — number of input items processed per second: news articles, tokenized sentencies.
+
+### NER
+
+4 datasets are used for evaluation: <a href="https://github.com/natasha/corus#load_factru"><code>factru</code></a>, <a href="https://github.com/natasha/corus#load_gareev"><code>gareev</code></a>, <a href="https://github.com/natasha/corus#load_ne5"><code>ne5</code></a> and <a href="https://github.com/natasha/corus#load_bsnlp"><code>bsnlp</code></a>. Slovnet is compared to <a href="https://github.com/natasha/naeval#deeppavlov_ner"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_ner"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_slavic_bert_ner"><code>deeppavlov_slavic</code></a>, <a href="https://github.com/natasha/naeval#pullenti"><code>pullenti</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>, <a href="https://github.com/natasha/naeval#texterra"><code>texterra</code></a>, <a href="https://github.com/natasha/naeval#tomita"><code>tomita</code></a>, <a href="https://github.com/natasha/naeval#mitie"><code>mitie</code></a>.
+
+For every column top 3 results are highlighted:
+
+<!--- ner1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="3" halign="left">factru</th>
+ <th colspan="2" halign="left">gareev</th>
+ <th colspan="3" halign="left">ne5</th>
+ <th colspan="3" halign="left">bsnlp</th>
+ </tr>
+ <tr>
+ <th>f1</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ <th>PER</th>
+ <th>LOC</th>
+ <th>ORG</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.959</b></td>
+ <td><b>0.915</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.977</b></td>
+ <td><b>0.899</b></td>
+ <td><b>0.984</b></td>
+ <td><b>0.973</b></td>
+ <td><b>0.951</b></td>
+ <td>0.944</td>
+ <td>0.834</td>
+ <td>0.718</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.973</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.831</b></td>
+ <td><b>0.991</b></td>
+ <td><b>0.911</b></td>
+ <td><b>0.996</b></td>
+ <td><b>0.989</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.960</b></td>
+ <td>0.838</td>
+ <td><b>0.733</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.910</td>
+ <td>0.886</td>
+ <td>0.742</td>
+ <td>0.944</td>
+ <td>0.798</td>
+ <td>0.942</td>
+ <td>0.919</td>
+ <td>0.881</td>
+ <td>0.866</td>
+ <td>0.767</td>
+ <td>0.624</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.971</b></td>
+ <td><b>0.928</b></td>
+ <td><b>0.825</b></td>
+ <td><b>0.980</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.997</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.976</b></td>
+ <td><b>0.954</b></td>
+ <td><b>0.840</b></td>
+ <td><b>0.741</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>0.956</td>
+ <td>0.884</td>
+ <td>0.714</td>
+ <td>0.976</td>
+ <td>0.776</td>
+ <td>0.984</td>
+ <td>0.817</td>
+ <td>0.761</td>
+ <td><b>0.965</b></td>
+ <td><b>0.925</b></td>
+ <td><b>0.831</b></td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td>0.905</td>
+ <td>0.814</td>
+ <td>0.686</td>
+ <td>0.939</td>
+ <td>0.639</td>
+ <td>0.952</td>
+ <td>0.862</td>
+ <td>0.683</td>
+ <td>0.900</td>
+ <td>0.769</td>
+ <td>0.566</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>0.901</td>
+ <td>0.886</td>
+ <td>0.765</td>
+ <td>0.970</td>
+ <td>0.883</td>
+ <td>0.967</td>
+ <td>0.928</td>
+ <td>0.918</td>
+ <td>0.919</td>
+ <td>0.823</td>
+ <td>0.693</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.943</td>
+ <td>0.865</td>
+ <td>0.687</td>
+ <td>0.953</td>
+ <td>0.827</td>
+ <td>0.923</td>
+ <td>0.753</td>
+ <td>0.734</td>
+ <td>0.938</td>
+ <td><b>0.838</b></td>
+ <td>0.724</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>0.900</td>
+ <td>0.800</td>
+ <td>0.597</td>
+ <td>0.888</td>
+ <td>0.561</td>
+ <td>0.901</td>
+ <td>0.777</td>
+ <td>0.594</td>
+ <td>0.858</td>
+ <td>0.783</td>
+ <td>0.548</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td>0.929</td>
+ <td></td>
+ <td></td>
+ <td>0.921</td>
+ <td></td>
+ <td>0.945</td>
+ <td></td>
+ <td></td>
+ <td>0.881</td>
+ <td></td>
+ <td></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>0.888</td>
+ <td>0.861</td>
+ <td>0.532</td>
+ <td>0.849</td>
+ <td>0.452</td>
+ <td>0.753</td>
+ <td>0.642</td>
+ <td>0.432</td>
+ <td>0.736</td>
+ <td>0.801</td>
+ <td>0.524</td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner1 --->
+
+`it/s` — news articles per second, 1 article ≈ 1KB.
+
+<!--- ner2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>205</b></td>
+ <td>25.3</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>473</td>
+ <td>9500</td>
+ <td><b>40.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>5.9</td>
+ <td>1024</td>
+ <td>3072</td>
+ <td>24.3 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.5</td>
+ <td>2048</td>
+ <td>6144</td>
+ <td>13.1 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_slavic</th>
+ <td>35.0</td>
+ <td>2048</td>
+ <td>4096</td>
+ <td>8.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>pullenti</th>
+ <td><b>2.9</b></td>
+ <td><b>16</b></td>
+ <td><b>253</b></td>
+ <td>6.0</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>625</td>
+ <td>8.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>3.0</td>
+ <td>591</td>
+ <td>11264</td>
+ <td>3.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>texterra</th>
+ <td>47.6</td>
+ <td>193</td>
+ <td>3379</td>
+ <td>4.0</td>
+ </tr>
+ <tr>
+ <th>tomita</th>
+ <td><b>2.0</b></td>
+ <td><b>64</b></td>
+ <td><b>63</b></td>
+ <td><b>29.8</b></td>
+ </tr>
+ <tr>
+ <th>mitie</th>
+ <td>28.3</td>
+ <td>327</td>
+ <td>261</td>
+ <td><b>32.8</b></td>
+ </tr>
+ </tbody>
+</table>
+<!--- ner2 --->
+
+### Morphology
+
+<a href="https://github.com/natasha/corus#load_gramru">Datasets from GramEval2020</a> are used for evaluation:
+
+* `news` — sample from Lenta.ru.
+* `wiki` — UD GSD.
+* `fiction` — SynTagRus + JZ.
+* `social`, `poetry` — social, poetry subset of Taiga.
+
+Slovnet is compated to a number of existing morphology taggers: <a href="https://github.com/natasha/naeval#deeppavlov_morph"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_morph"><code>deeppavlov_bert</code></a>, <a href="https://github.com/natasha/naeval#rupostagger"><code>rupostagger</code></a>, <a href="https://github.com/natasha/naeval#rnnmorph"><code>rnnmorph</code></a>, <a href="https://github.com/natasha/naeval#mary"><code>maru</code></a>, <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+For every column top 3 results are highlighted. `slovnet` was trained only on news dataset:
+
+<!--- morph1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>news</th>
+ <th>wiki</th>
+ <th>fiction</th>
+ <th>social</th>
+ <th>poetry</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>0.961</b></td>
+ <td>0.815</td>
+ <td>0.905</td>
+ <td>0.807</td>
+ <td>0.664</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.982</b></td>
+ <td><b>0.884</b></td>
+ <td><b>0.990</b></td>
+ <td><b>0.890</b></td>
+ <td><b>0.856</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td>0.940</td>
+ <td>0.841</td>
+ <td>0.944</td>
+ <td>0.870</td>
+ <td><b>0.857</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>0.951</td>
+ <td><b>0.868</b></td>
+ <td><b>0.964</b></td>
+ <td><b>0.892</b></td>
+ <td><b>0.865</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.918</td>
+ <td>0.811</td>
+ <td><b>0.957</b></td>
+ <td>0.870</td>
+ <td>0.776</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.964</b></td>
+ <td><b>0.849</b></td>
+ <td>0.942</td>
+ <td>0.857</td>
+ <td>0.784</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.934</td>
+ <td>0.831</td>
+ <td>0.940</td>
+ <td><b>0.873</b></td>
+ <td>0.825</td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>0.896</td>
+ <td>0.812</td>
+ <td>0.890</td>
+ <td>0.860</td>
+ <td>0.838</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>0.894</td>
+ <td>0.808</td>
+ <td>0.887</td>
+ <td>0.861</td>
+ <td>0.840</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>0.673</td>
+ <td>0.645</td>
+ <td>0.661</td>
+ <td>0.641</td>
+ <td>0.636</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph1 --->
+
+`it/s` — sentences per second.
+
+<!--- morph2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>115</b></td>
+ <td><b>532.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td>5.0</td>
+ <td>475</td>
+ <td>8087</td>
+ <td><b>285.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov</th>
+ <td><b>4.0</b></td>
+ <td>32</td>
+ <td>10240</td>
+ <td>90.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>20.0</td>
+ <td>1393</td>
+ <td>8704</td>
+ <td>85.0 (gpu)</td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td>45</td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>8.0</td>
+ <td>140</td>
+ <td>579</td>
+ <td>50.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>2.0</b></td>
+ <td>591</td>
+ <td>393</td>
+ <td><b>92.0</b></td>
+ </tr>
+ <tr>
+ <th>rnnmorph</th>
+ <td>8.7</td>
+ <td><b>10</b></td>
+ <td>289</td>
+ <td>16.6</td>
+ </tr>
+ <tr>
+ <th>maru</th>
+ <td>15.8</td>
+ <td>44</td>
+ <td>370</td>
+ <td>36.4</td>
+ </tr>
+ <tr>
+ <th>rupostagger</th>
+ <td>4.8</td>
+ <td><b>3</b></td>
+ <td><b>118</b></td>
+ <td>48.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- morph2 --->
+
+### Syntax
+
+Slovnet is compated to several existing syntax parsers: <a href="https://github.com/natasha/naeval#udpipe"><code>udpipe</code></a>, <a href="https://github.com/natasha/naeval#spacy"><code>spacy</code></a>, <a href="https://github.com/natasha/naeval#deeppavlov_bert_syntax"><code>deeppavlov</code></a>, <a href="https://github.com/natasha/naeval#stanza"><code>stanza</code></a>.
+
+<!--- syntax1 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr>
+ <th></th>
+ <th colspan="2" halign="left">news</th>
+ <th colspan="2" halign="left">wiki</th>
+ <th colspan="2" halign="left">fiction</th>
+ <th colspan="2" halign="left">social</th>
+ <th colspan="2" halign="left">poetry</th>
+ </tr>
+ <tr>
+ <th></th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ <th>uas</th>
+ <th>las</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td>0.907</td>
+ <td>0.880</td>
+ <td>0.775</td>
+ <td>0.718</td>
+ <td>0.806</td>
+ <td>0.776</td>
+ <td>0.726</td>
+ <td>0.656</td>
+ <td>0.542</td>
+ <td>0.469</td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>0.965</b></td>
+ <td><b>0.936</b></td>
+ <td><b>0.891</b></td>
+ <td><b>0.828</b></td>
+ <td><b>0.958</b></td>
+ <td><b>0.940</b></td>
+ <td><b>0.846</b></td>
+ <td><b>0.782</b></td>
+ <td><b>0.776</b></td>
+ <td><b>0.706</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td><b>0.962</b></td>
+ <td><b>0.910</b></td>
+ <td><b>0.882</b></td>
+ <td><b>0.786</b></td>
+ <td><b>0.963</b></td>
+ <td><b>0.929</b></td>
+ <td><b>0.844</b></td>
+ <td><b>0.761</b></td>
+ <td><b>0.784</b></td>
+ <td><b>0.691</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>0.873</td>
+ <td>0.823</td>
+ <td>0.622</td>
+ <td>0.531</td>
+ <td>0.910</td>
+ <td>0.876</td>
+ <td>0.700</td>
+ <td>0.624</td>
+ <td>0.625</td>
+ <td>0.534</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td><b>0.943</b></td>
+ <td><b>0.916</b></td>
+ <td><b>0.851</b></td>
+ <td><b>0.783</b></td>
+ <td>0.901</td>
+ <td>0.874</td>
+ <td><b>0.804</b></td>
+ <td><b>0.737</b></td>
+ <td>0.704</td>
+ <td><b>0.616</b></td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td>0.940</td>
+ <td>0.886</td>
+ <td>0.815</td>
+ <td>0.716</td>
+ <td><b>0.936</b></td>
+ <td><b>0.895</b></td>
+ <td>0.802</td>
+ <td>0.714</td>
+ <td><b>0.713</b></td>
+ <td>0.613</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax1 --->
+
+`it/s` — sentences per second.
+
+<!--- syntax2 --->
+<table border="0" class="dataframe">
+ <thead>
+ <tr style="text-align: right;">
+ <th></th>
+ <th>init, s</th>
+ <th>disk, mb</th>
+ <th>ram, mb</th>
+ <th>speed, it/s</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <th>slovnet</th>
+ <td><b>1.0</b></td>
+ <td><b>27</b></td>
+ <td><b>125</b></td>
+ <td><b>450.0</b></td>
+ </tr>
+ <tr>
+ <th>slovnet_bert</th>
+ <td><b>5.0</b></td>
+ <td>504</td>
+ <td>3427</td>
+ <td><b>200.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>deeppavlov_bert</th>
+ <td>34.0</td>
+ <td>1427</td>
+ <td>8704</td>
+ <td><b>75.0 (gpu)</b></td>
+ </tr>
+ <tr>
+ <th>udpipe</th>
+ <td>6.9</td>
+ <td><b>45</b></td>
+ <td><b>242</b></td>
+ <td>56.2</td>
+ </tr>
+ <tr>
+ <th>spacy</th>
+ <td>9.0</td>
+ <td><b>140</b></td>
+ <td><b>579</b></td>
+ <td>41.0</td>
+ </tr>
+ <tr>
+ <th>stanza</th>
+ <td><b>3.0</b></td>
+ <td>591</td>
+ <td>890</td>
+ <td>12.0</td>
+ </tr>
+ </tbody>
+</table>
+<!--- syntax2 --->
+
+## Support
+
+- Chat — https://telegram.me/natural_language_processing
+- Issues — https://github.com/natasha/slovnet/issues
+- Commercial support — https://lab.alexkuk.ru
+
+## Development
+
+Dev env
+
+```bash
+python -m venv ~/.venvs/natasha-slovnet
+source ~/.venvs/natasha-slovnet/bin/activate
+
+pip install -r requirements/dev.txt
+pip install -e .
+```
+
+Test
+
+```bash
+make test
+```
+
+Rent GPU
+
+```bash
+yc compute instance create \
+ --name gpu \
+ --zone ru-central1-a \
+ --network-interface subnet-name=default,nat-ip-version=ipv4 \
+ --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804-lts-ngc,type=network-ssd,size=20 \
+ --cores=8 \
+ --memory=96 \
+ --gpus=1 \
+ --ssh-key ~/.ssh/id_rsa.pub \
+ --folder-name default \
+ --platform-id gpu-standard-v1 \
+ --preemptible
+
+yc compute instance delete --name gpu
+```
+
+Setup instance
+
+```
+sudo locale-gen ru_RU.UTF-8
+
+sudo apt-get update
+sudo apt-get install -y \
+ python3-pip
+
+# grpcio long install ~10m, not using prebuilt wheel
+# "it is not compatible with this Python"
+sudo pip3 install -v \
+ jupyter \
+ tensorboard
+
+mkdir runs
+nohup tensorboard \
+ --logdir=runs \
+ --host=localhost \
+ --port=6006 \
+ --reload_interval=1 &
+
+nohup jupyter notebook \
+ --no-browser \
+ --allow-root \
+ --ip=localhost \
+ --port=8888 \
+ --NotebookApp.token='' \
+ --NotebookApp.password='' &
+
+ssh -Nf gpu -L 8888:localhost:8888 -L 6006:localhost:6006
+
+scp ~/.slovnet.json gpu:~
+rsync --exclude data -rv . gpu:~/slovnet
+rsync -u --exclude data -rv 'gpu:~/slovnet/*' .
+```
+
+Intall dev
+
+```bash
+pip3 install -r slovnet/requirements/dev.txt -r slovnet/requirements/gpu.txt
+pip3 install -e slovnet
+```
+
+Release
+
+```bash
+# Update setup.py version
+
+git commit -am 'Up version'
+git tag v0.6.0
+
+git push
+git push --tags
+
+# Github Action builds dist and publishes to PyPi
+```
+
+
+%prep
+%autosetup -n slovnet-0.6.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-slovnet -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.6.0-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..40e83b8
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+4e6d99673c377ff12f679dd08ac7749e slovnet-0.6.0.tar.gz