%global _empty_manifest_terminate_build 0
Name: python-mordl
Version: 2.0.12
Release: 1
Summary: Morphological parser (POS, lemmata, NER etc.)
License: BSD
URL: https://github.com/fostroll/mordl
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d5/23/a0c98ba2d3f8e6866ee1fca6cd8df73e7d3f8982223a0e215b201f7552d5/mordl-2.0.12.tar.gz
BuildArch: noarch
Requires: python3-corpuscula
Requires: python3-gensim
Requires: python3-junky
Requires: python3-morra
Requires: python3-numpy
Requires: python3-Levenshtein
Requires: python3-sklearn
Requires: python3-torch
Requires: python3-transformers
%description
MorDL: Morphological Tagger (POS, lemmata, NER etc.)
[![PyPI Version](https://img.shields.io/pypi/v/mordl?color=blue)](https://pypi.org/project/mordl/)
[![Python Version](https://img.shields.io/pypi/pyversions/mordl?color=blue)](https://www.python.org/)
[![License: BSD-3](https://img.shields.io/badge/License-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause)
***MorDL*** is a tool to organize the pipeline for complete morphological
sentence parsing (POS-tagging, lemmatization, morphological feature tagging)
and Named-entity recognition.
Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%`
(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used
`seed=42`. Some other `seed` values may help to achive better results. Models'
hyperparameters are also allowed to tune.
The validation with the
[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py)
of
[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html):
* For the inference on the *SynTagRus* test corpus, when predicted fields were
emptied and all other fields were stayed intact, the scores are the same as
outlined above.
* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with
scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`.
For completeness, we included that script in our distribution, so you can use
it for your model evaluation, too. To simplify it, we also made a wrapper
[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18)
for it.
## Installation
### pip
***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To
install via *pip*, run:
```sh
$ pip install mordl
```
If you currently have a previous version of ***MorDL*** installed, run:
```sh
$ pip install mordl -U
```
### From Source
Alternatively, you can install ***MorDL*** from the source of this *git
repository*:
```sh
$ git clone https://github.com/fostroll/mordl.git
$ cd mordl
$ pip install -e .
```
This gives you access to examples that are not included in the *PyPI* package.
## Usage
Our taggers use separate models, so they can be used independently. But to
achieve best results FEATS tagger uses UPOS tags during training. And LEMMA
and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged
corpus, the tagging pipeline is serially applying the taggers, like shown
below (assuming that our goal is NER and we already have trained taggers of
all types):
```python
from mordl import UposTagger, FeatsTagger, NeTagger
tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger()
tagger_u.load('upos_model')
tagger_f.load('feats_model')
tagger_n.load('misc-ne_model')
tagger_n.predict(
tagger_f.predict(
tagger_u.predict('untagged.conllu')
), save_to='result.conllu'
)
```
Any tagger in our pipeline may be replaced with a better one if you have it.
The weakness of separate taggers is that they take more space. If all models
were created with BERT embeddings, and you load them in memory simultaneously,
they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading,
you can use params **device** and **dataset_device** to distribute your models
on various GPUs. Alternatively, if you need just to tag some corpus once, you
may load models serially:
```python
tagger = UposTagger()
tagger.load('upos_model')
tagger.predict('untagged.conllu', save_to='result_upos.conllu')
del tagger # just for sure
tagger = FeatsTagger()
tagger.load('feats_model')
tagger.predict('result_upos.conllu', save_to='result_feats.conllu')
del tagger
tagger = NeTagger()
tagger_n.load('misc-ne_model')
tagger.predict('result_feats.conllu', save_to='result.conllu')
del tagger
```
Don't use identical names for input and output file names when you call the
`.predict()` methods. Normally, there will be no problem, because the methods
by default load all the input file in memory before tagging. But if the input
file is large, you may want to use the **split** parameter for the methods
handle the file by parts. In that case, saving of the first part of the
tagging data occurs before loading next. So, identical names will entail data
loss.
The training process is also simple. If you have training corpora and you
don't want any experiments, just run:
```python
from mordl import UposTagger
tagger = UposTagger()
tagger.load_train_corpus(train_corpus)
tagger.load_test_corpus(dev_corpus)
stat = tagger.train('upos_model', device='cuda:0',
stage3_params={'save_as': 'upos_bert_model'})
```
It is a training pipeline for the UPOS tagger; pipelines for other taggers are
identical.
For a more complete understanding of ***MorDL*** toolkit usage, refer to the
Python notebook with the pipeline example in the `examples` directory of the
***MorDL*** GitHub repository. Also, the detailed descriptions are available
in the docs:
[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start)
[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start)
[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start)
[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start)
[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start)
[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start)
[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start)
Also, you can find training pipelines for different taggers in our
[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb).
This project was developed with the focus on Russian language, but a few
nuances we use for it are unlikely to worsen the quality of processing other
languages.
***MorDL's*** supports
[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is
a file), or
[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md)
(if input/output is an object). Also, ***MorDL's*** allows
[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md)
as input.
## License
***MorDL*** is released under the BSD License. See the
[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more
details.
%package -n python3-mordl
Summary: Morphological parser (POS, lemmata, NER etc.)
Provides: python-mordl
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-mordl
MorDL: Morphological Tagger (POS, lemmata, NER etc.)
[![PyPI Version](https://img.shields.io/pypi/v/mordl?color=blue)](https://pypi.org/project/mordl/)
[![Python Version](https://img.shields.io/pypi/pyversions/mordl?color=blue)](https://www.python.org/)
[![License: BSD-3](https://img.shields.io/badge/License-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause)
***MorDL*** is a tool to organize the pipeline for complete morphological
sentence parsing (POS-tagging, lemmatization, morphological feature tagging)
and Named-entity recognition.
Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%`
(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used
`seed=42`. Some other `seed` values may help to achive better results. Models'
hyperparameters are also allowed to tune.
The validation with the
[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py)
of
[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html):
* For the inference on the *SynTagRus* test corpus, when predicted fields were
emptied and all other fields were stayed intact, the scores are the same as
outlined above.
* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with
scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`.
For completeness, we included that script in our distribution, so you can use
it for your model evaluation, too. To simplify it, we also made a wrapper
[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18)
for it.
## Installation
### pip
***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To
install via *pip*, run:
```sh
$ pip install mordl
```
If you currently have a previous version of ***MorDL*** installed, run:
```sh
$ pip install mordl -U
```
### From Source
Alternatively, you can install ***MorDL*** from the source of this *git
repository*:
```sh
$ git clone https://github.com/fostroll/mordl.git
$ cd mordl
$ pip install -e .
```
This gives you access to examples that are not included in the *PyPI* package.
## Usage
Our taggers use separate models, so they can be used independently. But to
achieve best results FEATS tagger uses UPOS tags during training. And LEMMA
and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged
corpus, the tagging pipeline is serially applying the taggers, like shown
below (assuming that our goal is NER and we already have trained taggers of
all types):
```python
from mordl import UposTagger, FeatsTagger, NeTagger
tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger()
tagger_u.load('upos_model')
tagger_f.load('feats_model')
tagger_n.load('misc-ne_model')
tagger_n.predict(
tagger_f.predict(
tagger_u.predict('untagged.conllu')
), save_to='result.conllu'
)
```
Any tagger in our pipeline may be replaced with a better one if you have it.
The weakness of separate taggers is that they take more space. If all models
were created with BERT embeddings, and you load them in memory simultaneously,
they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading,
you can use params **device** and **dataset_device** to distribute your models
on various GPUs. Alternatively, if you need just to tag some corpus once, you
may load models serially:
```python
tagger = UposTagger()
tagger.load('upos_model')
tagger.predict('untagged.conllu', save_to='result_upos.conllu')
del tagger # just for sure
tagger = FeatsTagger()
tagger.load('feats_model')
tagger.predict('result_upos.conllu', save_to='result_feats.conllu')
del tagger
tagger = NeTagger()
tagger_n.load('misc-ne_model')
tagger.predict('result_feats.conllu', save_to='result.conllu')
del tagger
```
Don't use identical names for input and output file names when you call the
`.predict()` methods. Normally, there will be no problem, because the methods
by default load all the input file in memory before tagging. But if the input
file is large, you may want to use the **split** parameter for the methods
handle the file by parts. In that case, saving of the first part of the
tagging data occurs before loading next. So, identical names will entail data
loss.
The training process is also simple. If you have training corpora and you
don't want any experiments, just run:
```python
from mordl import UposTagger
tagger = UposTagger()
tagger.load_train_corpus(train_corpus)
tagger.load_test_corpus(dev_corpus)
stat = tagger.train('upos_model', device='cuda:0',
stage3_params={'save_as': 'upos_bert_model'})
```
It is a training pipeline for the UPOS tagger; pipelines for other taggers are
identical.
For a more complete understanding of ***MorDL*** toolkit usage, refer to the
Python notebook with the pipeline example in the `examples` directory of the
***MorDL*** GitHub repository. Also, the detailed descriptions are available
in the docs:
[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start)
[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start)
[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start)
[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start)
[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start)
[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start)
[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start)
Also, you can find training pipelines for different taggers in our
[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb).
This project was developed with the focus on Russian language, but a few
nuances we use for it are unlikely to worsen the quality of processing other
languages.
***MorDL's*** supports
[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is
a file), or
[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md)
(if input/output is an object). Also, ***MorDL's*** allows
[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md)
as input.
## License
***MorDL*** is released under the BSD License. See the
[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more
details.
%package help
Summary: Development documents and examples for mordl
Provides: python3-mordl-doc
%description help
MorDL: Morphological Tagger (POS, lemmata, NER etc.)
[![PyPI Version](https://img.shields.io/pypi/v/mordl?color=blue)](https://pypi.org/project/mordl/)
[![Python Version](https://img.shields.io/pypi/pyversions/mordl?color=blue)](https://www.python.org/)
[![License: BSD-3](https://img.shields.io/badge/License-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause)
***MorDL*** is a tool to organize the pipeline for complete morphological
sentence parsing (POS-tagging, lemmatization, morphological feature tagging)
and Named-entity recognition.
Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%`
(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used
`seed=42`. Some other `seed` values may help to achive better results. Models'
hyperparameters are also allowed to tune.
The validation with the
[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py)
of
[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html):
* For the inference on the *SynTagRus* test corpus, when predicted fields were
emptied and all other fields were stayed intact, the scores are the same as
outlined above.
* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with
scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`.
For completeness, we included that script in our distribution, so you can use
it for your model evaluation, too. To simplify it, we also made a wrapper
[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18)
for it.
## Installation
### pip
***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To
install via *pip*, run:
```sh
$ pip install mordl
```
If you currently have a previous version of ***MorDL*** installed, run:
```sh
$ pip install mordl -U
```
### From Source
Alternatively, you can install ***MorDL*** from the source of this *git
repository*:
```sh
$ git clone https://github.com/fostroll/mordl.git
$ cd mordl
$ pip install -e .
```
This gives you access to examples that are not included in the *PyPI* package.
## Usage
Our taggers use separate models, so they can be used independently. But to
achieve best results FEATS tagger uses UPOS tags during training. And LEMMA
and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged
corpus, the tagging pipeline is serially applying the taggers, like shown
below (assuming that our goal is NER and we already have trained taggers of
all types):
```python
from mordl import UposTagger, FeatsTagger, NeTagger
tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger()
tagger_u.load('upos_model')
tagger_f.load('feats_model')
tagger_n.load('misc-ne_model')
tagger_n.predict(
tagger_f.predict(
tagger_u.predict('untagged.conllu')
), save_to='result.conllu'
)
```
Any tagger in our pipeline may be replaced with a better one if you have it.
The weakness of separate taggers is that they take more space. If all models
were created with BERT embeddings, and you load them in memory simultaneously,
they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading,
you can use params **device** and **dataset_device** to distribute your models
on various GPUs. Alternatively, if you need just to tag some corpus once, you
may load models serially:
```python
tagger = UposTagger()
tagger.load('upos_model')
tagger.predict('untagged.conllu', save_to='result_upos.conllu')
del tagger # just for sure
tagger = FeatsTagger()
tagger.load('feats_model')
tagger.predict('result_upos.conllu', save_to='result_feats.conllu')
del tagger
tagger = NeTagger()
tagger_n.load('misc-ne_model')
tagger.predict('result_feats.conllu', save_to='result.conllu')
del tagger
```
Don't use identical names for input and output file names when you call the
`.predict()` methods. Normally, there will be no problem, because the methods
by default load all the input file in memory before tagging. But if the input
file is large, you may want to use the **split** parameter for the methods
handle the file by parts. In that case, saving of the first part of the
tagging data occurs before loading next. So, identical names will entail data
loss.
The training process is also simple. If you have training corpora and you
don't want any experiments, just run:
```python
from mordl import UposTagger
tagger = UposTagger()
tagger.load_train_corpus(train_corpus)
tagger.load_test_corpus(dev_corpus)
stat = tagger.train('upos_model', device='cuda:0',
stage3_params={'save_as': 'upos_bert_model'})
```
It is a training pipeline for the UPOS tagger; pipelines for other taggers are
identical.
For a more complete understanding of ***MorDL*** toolkit usage, refer to the
Python notebook with the pipeline example in the `examples` directory of the
***MorDL*** GitHub repository. Also, the detailed descriptions are available
in the docs:
[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start)
[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start)
[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start)
[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start)
[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start)
[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start)
[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start)
Also, you can find training pipelines for different taggers in our
[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb).
This project was developed with the focus on Russian language, but a few
nuances we use for it are unlikely to worsen the quality of processing other
languages.
***MorDL's*** supports
[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is
a file), or
[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md)
(if input/output is an object). Also, ***MorDL's*** allows
[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md)
as input.
## License
***MorDL*** is released under the BSD License. See the
[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more
details.
%prep
%autosetup -n mordl-2.0.12
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-mordl -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Mon May 15 2023 Python_Bot - 2.0.12-1
- Package Spec generated