diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-15 05:27:39 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-15 05:27:39 +0000 |
| commit | 3a4c53267d7adf3a9a4cfbe9a96a24759ec2e607 (patch) | |
| tree | 099b53f400e157290f91cb0eafb04347e6204a07 | |
| parent | cf817ca5c79c9c158bee050fa38c11119e183467 (diff) | |
automatic import of python-mordl
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-mordl.spec | 591 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 593 insertions, 0 deletions
@@ -0,0 +1 @@ +/mordl-2.0.12.tar.gz diff --git a/python-mordl.spec b/python-mordl.spec new file mode 100644 index 0000000..fb1b6fd --- /dev/null +++ b/python-mordl.spec @@ -0,0 +1,591 @@ +%global _empty_manifest_terminate_build 0 +Name: python-mordl +Version: 2.0.12 +Release: 1 +Summary: Morphological parser (POS, lemmata, NER etc.) +License: BSD +URL: https://github.com/fostroll/mordl +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d5/23/a0c98ba2d3f8e6866ee1fca6cd8df73e7d3f8982223a0e215b201f7552d5/mordl-2.0.12.tar.gz +BuildArch: noarch + +Requires: python3-corpuscula +Requires: python3-gensim +Requires: python3-junky +Requires: python3-morra +Requires: python3-numpy +Requires: python3-Levenshtein +Requires: python3-sklearn +Requires: python3-torch +Requires: python3-transformers + +%description +<h2 align="center">MorDL: Morphological Tagger (POS, lemmata, NER etc.)</h2> +<a name="start"></a> + +[](https://pypi.org/project/mordl/) +[](https://www.python.org/) +[](https://opensource.org/licenses/BSD-3-Clause) + +***MorDL*** is a tool to organize the pipeline for complete morphological +sentence parsing (POS-tagging, lemmatization, morphological feature tagging) +and Named-entity recognition. + +Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%` +(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used +`seed=42`. Some other `seed` values may help to achive better results. Models' +hyperparameters are also allowed to tune. + +The validation with the +[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py) +of +[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html): +* For the inference on the *SynTagRus* test corpus, when predicted fields were +emptied and all other fields were stayed intact, the scores are the same as +outlined above. +* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with +scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`. + +For completeness, we included that script in our distribution, so you can use +it for your model evaluation, too. To simplify it, we also made a wrapper +[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18) +for it. + +## Installation + +### pip + +***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To +install via *pip*, run: +```sh +$ pip install mordl +``` + +If you currently have a previous version of ***MorDL*** installed, run: +```sh +$ pip install mordl -U +``` + +### From Source + +Alternatively, you can install ***MorDL*** from the source of this *git +repository*: +```sh +$ git clone https://github.com/fostroll/mordl.git +$ cd mordl +$ pip install -e . +``` +This gives you access to examples that are not included in the *PyPI* package. + +## Usage + +Our taggers use separate models, so they can be used independently. But to +achieve best results FEATS tagger uses UPOS tags during training. And LEMMA +and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged +corpus, the tagging pipeline is serially applying the taggers, like shown +below (assuming that our goal is NER and we already have trained taggers of +all types): + +```python +from mordl import UposTagger, FeatsTagger, NeTagger + +tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger() +tagger_u.load('upos_model') +tagger_f.load('feats_model') +tagger_n.load('misc-ne_model') + +tagger_n.predict( + tagger_f.predict( + tagger_u.predict('untagged.conllu') + ), save_to='result.conllu' +) +``` + +Any tagger in our pipeline may be replaced with a better one if you have it. +The weakness of separate taggers is that they take more space. If all models +were created with BERT embeddings, and you load them in memory simultaneously, +they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading, +you can use params **device** and **dataset_device** to distribute your models +on various GPUs. Alternatively, if you need just to tag some corpus once, you +may load models serially: + +```python +tagger = UposTagger() +tagger.load('upos_model') +tagger.predict('untagged.conllu', save_to='result_upos.conllu') +del tagger # just for sure +tagger = FeatsTagger() +tagger.load('feats_model') +tagger.predict('result_upos.conllu', save_to='result_feats.conllu') +del tagger +tagger = NeTagger() +tagger_n.load('misc-ne_model') +tagger.predict('result_feats.conllu', save_to='result.conllu') +del tagger +``` + +Don't use identical names for input and output file names when you call the +`.predict()` methods. Normally, there will be no problem, because the methods +by default load all the input file in memory before tagging. But if the input +file is large, you may want to use the **split** parameter for the methods +handle the file by parts. In that case, saving of the first part of the +tagging data occurs before loading next. So, identical names will entail data +loss. + +The training process is also simple. If you have training corpora and you +don't want any experiments, just run: + +```python +from mordl import UposTagger + +tagger = UposTagger() +tagger.load_train_corpus(train_corpus) +tagger.load_test_corpus(dev_corpus) + +stat = tagger.train('upos_model', device='cuda:0', + stage3_params={'save_as': 'upos_bert_model'}) +``` + +It is a training pipeline for the UPOS tagger; pipelines for other taggers are +identical. + +For a more complete understanding of ***MorDL*** toolkit usage, refer to the +Python notebook with the pipeline example in the `examples` directory of the +***MorDL*** GitHub repository. Also, the detailed descriptions are available +in the docs: + +[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start) + +[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start) + +[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start) + +[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start) + +[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start) + +[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start) + +[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start) + +Also, you can find training pipelines for different taggers in our +[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb). + +This project was developed with the focus on Russian language, but a few +nuances we use for it are unlikely to worsen the quality of processing other +languages. + +***MorDL's*** supports +[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is +a file), or +[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md) +(if input/output is an object). Also, ***MorDL's*** allows +[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md) +as input. + +## License + +***MorDL*** is released under the BSD License. See the +[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more +details. + + + + +%package -n python3-mordl +Summary: Morphological parser (POS, lemmata, NER etc.) +Provides: python-mordl +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-mordl +<h2 align="center">MorDL: Morphological Tagger (POS, lemmata, NER etc.)</h2> +<a name="start"></a> + +[](https://pypi.org/project/mordl/) +[](https://www.python.org/) +[](https://opensource.org/licenses/BSD-3-Clause) + +***MorDL*** is a tool to organize the pipeline for complete morphological +sentence parsing (POS-tagging, lemmatization, morphological feature tagging) +and Named-entity recognition. + +Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%` +(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used +`seed=42`. Some other `seed` values may help to achive better results. Models' +hyperparameters are also allowed to tune. + +The validation with the +[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py) +of +[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html): +* For the inference on the *SynTagRus* test corpus, when predicted fields were +emptied and all other fields were stayed intact, the scores are the same as +outlined above. +* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with +scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`. + +For completeness, we included that script in our distribution, so you can use +it for your model evaluation, too. To simplify it, we also made a wrapper +[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18) +for it. + +## Installation + +### pip + +***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To +install via *pip*, run: +```sh +$ pip install mordl +``` + +If you currently have a previous version of ***MorDL*** installed, run: +```sh +$ pip install mordl -U +``` + +### From Source + +Alternatively, you can install ***MorDL*** from the source of this *git +repository*: +```sh +$ git clone https://github.com/fostroll/mordl.git +$ cd mordl +$ pip install -e . +``` +This gives you access to examples that are not included in the *PyPI* package. + +## Usage + +Our taggers use separate models, so they can be used independently. But to +achieve best results FEATS tagger uses UPOS tags during training. And LEMMA +and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged +corpus, the tagging pipeline is serially applying the taggers, like shown +below (assuming that our goal is NER and we already have trained taggers of +all types): + +```python +from mordl import UposTagger, FeatsTagger, NeTagger + +tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger() +tagger_u.load('upos_model') +tagger_f.load('feats_model') +tagger_n.load('misc-ne_model') + +tagger_n.predict( + tagger_f.predict( + tagger_u.predict('untagged.conllu') + ), save_to='result.conllu' +) +``` + +Any tagger in our pipeline may be replaced with a better one if you have it. +The weakness of separate taggers is that they take more space. If all models +were created with BERT embeddings, and you load them in memory simultaneously, +they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading, +you can use params **device** and **dataset_device** to distribute your models +on various GPUs. Alternatively, if you need just to tag some corpus once, you +may load models serially: + +```python +tagger = UposTagger() +tagger.load('upos_model') +tagger.predict('untagged.conllu', save_to='result_upos.conllu') +del tagger # just for sure +tagger = FeatsTagger() +tagger.load('feats_model') +tagger.predict('result_upos.conllu', save_to='result_feats.conllu') +del tagger +tagger = NeTagger() +tagger_n.load('misc-ne_model') +tagger.predict('result_feats.conllu', save_to='result.conllu') +del tagger +``` + +Don't use identical names for input and output file names when you call the +`.predict()` methods. Normally, there will be no problem, because the methods +by default load all the input file in memory before tagging. But if the input +file is large, you may want to use the **split** parameter for the methods +handle the file by parts. In that case, saving of the first part of the +tagging data occurs before loading next. So, identical names will entail data +loss. + +The training process is also simple. If you have training corpora and you +don't want any experiments, just run: + +```python +from mordl import UposTagger + +tagger = UposTagger() +tagger.load_train_corpus(train_corpus) +tagger.load_test_corpus(dev_corpus) + +stat = tagger.train('upos_model', device='cuda:0', + stage3_params={'save_as': 'upos_bert_model'}) +``` + +It is a training pipeline for the UPOS tagger; pipelines for other taggers are +identical. + +For a more complete understanding of ***MorDL*** toolkit usage, refer to the +Python notebook with the pipeline example in the `examples` directory of the +***MorDL*** GitHub repository. Also, the detailed descriptions are available +in the docs: + +[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start) + +[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start) + +[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start) + +[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start) + +[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start) + +[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start) + +[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start) + +Also, you can find training pipelines for different taggers in our +[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb). + +This project was developed with the focus on Russian language, but a few +nuances we use for it are unlikely to worsen the quality of processing other +languages. + +***MorDL's*** supports +[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is +a file), or +[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md) +(if input/output is an object). Also, ***MorDL's*** allows +[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md) +as input. + +## License + +***MorDL*** is released under the BSD License. See the +[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more +details. + + + + +%package help +Summary: Development documents and examples for mordl +Provides: python3-mordl-doc +%description help +<h2 align="center">MorDL: Morphological Tagger (POS, lemmata, NER etc.)</h2> +<a name="start"></a> + +[](https://pypi.org/project/mordl/) +[](https://www.python.org/) +[](https://opensource.org/licenses/BSD-3-Clause) + +***MorDL*** is a tool to organize the pipeline for complete morphological +sentence parsing (POS-tagging, lemmatization, morphological feature tagging) +and Named-entity recognition. + +Scores (accuracy) on *SynTagRus* test dataset: UPOS: `99.35%`; FEATS: `98.87%` +(tokens), `99.31%` (tags); LEMMA: `99.50%`. In all experiments, we used +`seed=42`. Some other `seed` values may help to achive better results. Models' +hyperparameters are also allowed to tune. + +The validation with the +[official evaluation script](http://universaldependencies.org/conll18/conll18_ud_eval.py) +of +[CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html): +* For the inference on the *SynTagRus* test corpus, when predicted fields were +emptied and all other fields were stayed intact, the scores are the same as +outlined above. +* The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with +scores: UPOS: `99.35%`; UFeats: `98.36%`; AllTags: `98.21`; Lemmas: `98.88%`. + +For completeness, we included that script in our distribution, so you can use +it for your model evaluation, too. To simplify it, we also made a wrapper +[`mordl.conll18_ud_eval`](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#conll18) +for it. + +## Installation + +### pip + +***MorDL*** supports *Python 3.6* and *Transformers 4.3.3* or later. To +install via *pip*, run: +```sh +$ pip install mordl +``` + +If you currently have a previous version of ***MorDL*** installed, run: +```sh +$ pip install mordl -U +``` + +### From Source + +Alternatively, you can install ***MorDL*** from the source of this *git +repository*: +```sh +$ git clone https://github.com/fostroll/mordl.git +$ cd mordl +$ pip install -e . +``` +This gives you access to examples that are not included in the *PyPI* package. + +## Usage + +Our taggers use separate models, so they can be used independently. But to +achieve best results FEATS tagger uses UPOS tags during training. And LEMMA +and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged +corpus, the tagging pipeline is serially applying the taggers, like shown +below (assuming that our goal is NER and we already have trained taggers of +all types): + +```python +from mordl import UposTagger, FeatsTagger, NeTagger + +tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger() +tagger_u.load('upos_model') +tagger_f.load('feats_model') +tagger_n.load('misc-ne_model') + +tagger_n.predict( + tagger_f.predict( + tagger_u.predict('untagged.conllu') + ), save_to='result.conllu' +) +``` + +Any tagger in our pipeline may be replaced with a better one if you have it. +The weakness of separate taggers is that they take more space. If all models +were created with BERT embeddings, and you load them in memory simultaneously, +they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading, +you can use params **device** and **dataset_device** to distribute your models +on various GPUs. Alternatively, if you need just to tag some corpus once, you +may load models serially: + +```python +tagger = UposTagger() +tagger.load('upos_model') +tagger.predict('untagged.conllu', save_to='result_upos.conllu') +del tagger # just for sure +tagger = FeatsTagger() +tagger.load('feats_model') +tagger.predict('result_upos.conllu', save_to='result_feats.conllu') +del tagger +tagger = NeTagger() +tagger_n.load('misc-ne_model') +tagger.predict('result_feats.conllu', save_to='result.conllu') +del tagger +``` + +Don't use identical names for input and output file names when you call the +`.predict()` methods. Normally, there will be no problem, because the methods +by default load all the input file in memory before tagging. But if the input +file is large, you may want to use the **split** parameter for the methods +handle the file by parts. In that case, saving of the first part of the +tagging data occurs before loading next. So, identical names will entail data +loss. + +The training process is also simple. If you have training corpora and you +don't want any experiments, just run: + +```python +from mordl import UposTagger + +tagger = UposTagger() +tagger.load_train_corpus(train_corpus) +tagger.load_test_corpus(dev_corpus) + +stat = tagger.train('upos_model', device='cuda:0', + stage3_params={'save_as': 'upos_bert_model'}) +``` + +It is a training pipeline for the UPOS tagger; pipelines for other taggers are +identical. + +For a more complete understanding of ***MorDL*** toolkit usage, refer to the +Python notebook with the pipeline example in the `examples` directory of the +***MorDL*** GitHub repository. Also, the detailed descriptions are available +in the docs: + +[***MorDL*** Basics](https://github.com/fostroll/mordl/blob/master/doc/README_BASICS.md#start) + +[Part of Speech Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_POS.md#start) + +[Single Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEAT.md#start) + +[Multiple Feature Tagging](https://github.com/fostroll/mordl/blob/master/doc/README_FEATS.md#start) + +[Lemmata Prediction](https://github.com/fostroll/mordl/blob/master/doc/README_LEMMA.md#start) + +[Named-entity Recognition](https://github.com/fostroll/mordl/blob/master/doc/README_NER.md#start) + +[Supplements](https://github.com/fostroll/mordl/blob/master/doc/README_SUPPLEMENTS.md#start) + +Also, you can find training pipelines for different taggers in our +[example notebook](https://github.com/fostroll/mordl/blob/master/examples/mordl.ipynb). + +This project was developed with the focus on Russian language, but a few +nuances we use for it are unlikely to worsen the quality of processing other +languages. + +***MorDL's*** supports +[*CoNLL-U*](https://universaldependencies.org/format.html) (if input/output is +a file), or +[*Parsed CoNLL-U*](https://github.com/fostroll/corpuscula/blob/master/doc/README_PARSED_CONLLU.md) +(if input/output is an object). Also, ***MorDL's*** allows +[***Corpuscula***'s corpora wrappers](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md) +as input. + +## License + +***MorDL*** is released under the BSD License. See the +[LICENSE](https://github.com/fostroll/mordl/blob/master/LICENSE) file for more +details. + + + + +%prep +%autosetup -n mordl-2.0.12 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-mordl -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 2.0.12-1 +- Package Spec generated @@ -0,0 +1 @@ +8e507916f59c5862da88a7448a3106ab mordl-2.0.12.tar.gz |
