From 88c37e128bad6454f07435ff7f14d13dd8e9e54f Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Tue, 11 Apr 2023 02:41:30 +0000 Subject: automatic import of python-fugashi --- .gitignore | 1 + python-fugashi.spec | 472 ++++++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 474 insertions(+) create mode 100644 python-fugashi.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..8defca4 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/fugashi-1.2.1.tar.gz diff --git a/python-fugashi.spec b/python-fugashi.spec new file mode 100644 index 0000000..c44ca56 --- /dev/null +++ b/python-fugashi.spec @@ -0,0 +1,472 @@ +%global _empty_manifest_terminate_build 0 +Name: python-fugashi +Version: 1.2.1 +Release: 1 +Summary: A Cython MeCab wrapper for fast, pythonic Japanese tokenization. +License: MIT +URL: https://github.com/polm/fugashi +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/4d/aa/008562fae5099633dfe87b68627f2a532b4f92f5348f75edaeec25c990f4/fugashi-1.2.1.tar.gz + +Requires: python3-unidic +Requires: python3-unidic-lite + +%description +[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py) +[![Current PyPI packages](https://badge.fury.io/py/fugashi.svg)](https://pypi.org/project/fugashi/) +![Test Status](https://github.com/polm/fugashi/workflows/test-manylinux/badge.svg) +[![PyPI - Downloads](https://img.shields.io/pypi/dm/fugashi)](https://pypi.org/project/fugashi/) +![Supported Platforms](https://img.shields.io/badge/platforms-linux%20macosx%20windows-blue) + +# fugashi + +fugashi by Irasutoya + +fugashi is a Cython wrapper for [MeCab](https://taku910.github.io/mecab/), a +Japanese tokenizer and morphological analysis tool. Wheels are provided for +Linux, OSX, and Win64, and UniDic is [easy to install](#installing-a-dictionary). + +**issueを英語で書く必要はありません。** + +Check out the [interactive demo][], see the [blog post](https://www.dampfkraft.com/nlp/fugashi.html) for background +on why fugashi exists and some of the design decisions, or see [this +guide][guide] for a basic introduction to Japanese tokenization. + +[guide]: https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html +[interactive demo]: https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py + +If you are on an unsupported platform (like PowerPC), you'll need to install +MeCab first. It's recommended you install [from +source](https://github.com/taku910/mecab). If you need to build from source on +Windows, [@chezou's fork](https://github.com/chezou/mecab) is recommended; see +[issue #44](https://github.com/polm/fugashi/issues/44#issuecomment-954426115) +for an explanation of the problems with the official repo. + +## Usage + +```python +from fugashi import Tagger + +tagger = Tagger('-Owakati') +text = "麩菓子は、麩を主材料とした日本の菓子。" +tagger.parse(text) +# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。' +for word in tagger(text): + print(word, word.feature.lemma, word.pos, sep='\t') + # "feature" is the Unidic feature data as a named tuple +``` + +## Installing a Dictionary + +fugashi requires a dictionary. [UniDic](https://unidic.ninjal.ac.jp/) is +recommended, and two easy-to-install versions are provided. + + - [unidic-lite](https://github.com/polm/unidic-lite), a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small + - [unidic](https://github.com/polm/unidic-py), the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step + +If you just want to make sure things work you can start with `unidic-lite`, but +for more serious processing `unidic` is recommended. For production use you'll +generally want to generate your own dictionary too; for details see the [MeCab +documentation](https://taku910.github.io/mecab/learn.html). + +To get either of these dictionaries, you can install them directly using `pip` +or do the below: + +```sh +pip install fugashi[unidic-lite] + +# The full version of UniDic requires a separate download step +pip install fugashi[unidic] +python -m unidic download +``` + +For more information on the different MeCab dictionaries available, see [this article](https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html). + +## Dictionary Use + +fugashi is written with the assumption you'll use Unidic to process Japanese, +but it supports arbitrary dictionaries. + +If you're using a dictionary besides Unidic you can use the GenericTagger like this: + +```python +from fugashi import GenericTagger +tagger = GenericTagger() + +# parse can be used as normal +tagger.parse('something') +# features from the dictionary can be accessed by field numbers +for word in tagger(text): + print(word.surface, word.feature[0]) +``` + +You can also create a dictionary wrapper to get feature information as a named tuple. + +```python +from fugashi import GenericTagger, create_feature_wrapper +CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma') +tagger = GenericTagger(wrapper=CustomFeatures) +for word in tagger.parseToNodeList(text): + print(word.surface, word.feature.alpha) +``` + +## Citation + +If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at [the ACL Anthology](https://www.aclweb.org/anthology/2020.nlposs-1.7/) or [on Arxiv](https://arxiv.org/abs/2010.06858). + + @inproceedings{mccann-2020-fugashi, + title = "fugashi, a Tool for Tokenizing {J}apanese in Python", + author = "McCann, Paul", + booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)", + month = nov, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.nlposs-1.7", + pages = "44--51", + abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.", + } + +## Alternatives + +If you have a problem with fugashi feel free to open an issue. However, there +are some cases where it might be better to use a different library. + +- If you don't want to deal with installing MeCab at all, try [SudachiPy](https://github.com/WorksApplications/sudachi.rs). +- If you need to work with Korean, try [pymecab-ko](https://github.com/NoUnique/pymecab-ko) or [KoNLPy](https://konlpy.org/en/latest/). + +## License and Copyright Notice + +fugashi is released under the terms of the [MIT license](./LICENSE). Please +copy it far and wide. + +fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. +MeCab is copyrighted free software by Taku Kudo `` and Nippon +Telegraph and Telephone Corporation, and is redistributed under the [BSD +License](./LICENSE.mecab). + + +%package -n python3-fugashi +Summary: A Cython MeCab wrapper for fast, pythonic Japanese tokenization. +Provides: python-fugashi +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +BuildRequires: python3-cffi +BuildRequires: gcc +BuildRequires: gdb +%description -n python3-fugashi +[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py) +[![Current PyPI packages](https://badge.fury.io/py/fugashi.svg)](https://pypi.org/project/fugashi/) +![Test Status](https://github.com/polm/fugashi/workflows/test-manylinux/badge.svg) +[![PyPI - Downloads](https://img.shields.io/pypi/dm/fugashi)](https://pypi.org/project/fugashi/) +![Supported Platforms](https://img.shields.io/badge/platforms-linux%20macosx%20windows-blue) + +# fugashi + +fugashi by Irasutoya + +fugashi is a Cython wrapper for [MeCab](https://taku910.github.io/mecab/), a +Japanese tokenizer and morphological analysis tool. Wheels are provided for +Linux, OSX, and Win64, and UniDic is [easy to install](#installing-a-dictionary). + +**issueを英語で書く必要はありません。** + +Check out the [interactive demo][], see the [blog post](https://www.dampfkraft.com/nlp/fugashi.html) for background +on why fugashi exists and some of the design decisions, or see [this +guide][guide] for a basic introduction to Japanese tokenization. + +[guide]: https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html +[interactive demo]: https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py + +If you are on an unsupported platform (like PowerPC), you'll need to install +MeCab first. It's recommended you install [from +source](https://github.com/taku910/mecab). If you need to build from source on +Windows, [@chezou's fork](https://github.com/chezou/mecab) is recommended; see +[issue #44](https://github.com/polm/fugashi/issues/44#issuecomment-954426115) +for an explanation of the problems with the official repo. + +## Usage + +```python +from fugashi import Tagger + +tagger = Tagger('-Owakati') +text = "麩菓子は、麩を主材料とした日本の菓子。" +tagger.parse(text) +# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。' +for word in tagger(text): + print(word, word.feature.lemma, word.pos, sep='\t') + # "feature" is the Unidic feature data as a named tuple +``` + +## Installing a Dictionary + +fugashi requires a dictionary. [UniDic](https://unidic.ninjal.ac.jp/) is +recommended, and two easy-to-install versions are provided. + + - [unidic-lite](https://github.com/polm/unidic-lite), a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small + - [unidic](https://github.com/polm/unidic-py), the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step + +If you just want to make sure things work you can start with `unidic-lite`, but +for more serious processing `unidic` is recommended. For production use you'll +generally want to generate your own dictionary too; for details see the [MeCab +documentation](https://taku910.github.io/mecab/learn.html). + +To get either of these dictionaries, you can install them directly using `pip` +or do the below: + +```sh +pip install fugashi[unidic-lite] + +# The full version of UniDic requires a separate download step +pip install fugashi[unidic] +python -m unidic download +``` + +For more information on the different MeCab dictionaries available, see [this article](https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html). + +## Dictionary Use + +fugashi is written with the assumption you'll use Unidic to process Japanese, +but it supports arbitrary dictionaries. + +If you're using a dictionary besides Unidic you can use the GenericTagger like this: + +```python +from fugashi import GenericTagger +tagger = GenericTagger() + +# parse can be used as normal +tagger.parse('something') +# features from the dictionary can be accessed by field numbers +for word in tagger(text): + print(word.surface, word.feature[0]) +``` + +You can also create a dictionary wrapper to get feature information as a named tuple. + +```python +from fugashi import GenericTagger, create_feature_wrapper +CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma') +tagger = GenericTagger(wrapper=CustomFeatures) +for word in tagger.parseToNodeList(text): + print(word.surface, word.feature.alpha) +``` + +## Citation + +If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at [the ACL Anthology](https://www.aclweb.org/anthology/2020.nlposs-1.7/) or [on Arxiv](https://arxiv.org/abs/2010.06858). + + @inproceedings{mccann-2020-fugashi, + title = "fugashi, a Tool for Tokenizing {J}apanese in Python", + author = "McCann, Paul", + booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)", + month = nov, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.nlposs-1.7", + pages = "44--51", + abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.", + } + +## Alternatives + +If you have a problem with fugashi feel free to open an issue. However, there +are some cases where it might be better to use a different library. + +- If you don't want to deal with installing MeCab at all, try [SudachiPy](https://github.com/WorksApplications/sudachi.rs). +- If you need to work with Korean, try [pymecab-ko](https://github.com/NoUnique/pymecab-ko) or [KoNLPy](https://konlpy.org/en/latest/). + +## License and Copyright Notice + +fugashi is released under the terms of the [MIT license](./LICENSE). Please +copy it far and wide. + +fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. +MeCab is copyrighted free software by Taku Kudo `` and Nippon +Telegraph and Telephone Corporation, and is redistributed under the [BSD +License](./LICENSE.mecab). + + +%package help +Summary: Development documents and examples for fugashi +Provides: python3-fugashi-doc +%description help +[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py) +[![Current PyPI packages](https://badge.fury.io/py/fugashi.svg)](https://pypi.org/project/fugashi/) +![Test Status](https://github.com/polm/fugashi/workflows/test-manylinux/badge.svg) +[![PyPI - Downloads](https://img.shields.io/pypi/dm/fugashi)](https://pypi.org/project/fugashi/) +![Supported Platforms](https://img.shields.io/badge/platforms-linux%20macosx%20windows-blue) + +# fugashi + +fugashi by Irasutoya + +fugashi is a Cython wrapper for [MeCab](https://taku910.github.io/mecab/), a +Japanese tokenizer and morphological analysis tool. Wheels are provided for +Linux, OSX, and Win64, and UniDic is [easy to install](#installing-a-dictionary). + +**issueを英語で書く必要はありません。** + +Check out the [interactive demo][], see the [blog post](https://www.dampfkraft.com/nlp/fugashi.html) for background +on why fugashi exists and some of the design decisions, or see [this +guide][guide] for a basic introduction to Japanese tokenization. + +[guide]: https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html +[interactive demo]: https://share.streamlit.io/polm/fugashi-streamlit-demo/main/demo.py + +If you are on an unsupported platform (like PowerPC), you'll need to install +MeCab first. It's recommended you install [from +source](https://github.com/taku910/mecab). If you need to build from source on +Windows, [@chezou's fork](https://github.com/chezou/mecab) is recommended; see +[issue #44](https://github.com/polm/fugashi/issues/44#issuecomment-954426115) +for an explanation of the problems with the official repo. + +## Usage + +```python +from fugashi import Tagger + +tagger = Tagger('-Owakati') +text = "麩菓子は、麩を主材料とした日本の菓子。" +tagger.parse(text) +# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。' +for word in tagger(text): + print(word, word.feature.lemma, word.pos, sep='\t') + # "feature" is the Unidic feature data as a named tuple +``` + +## Installing a Dictionary + +fugashi requires a dictionary. [UniDic](https://unidic.ninjal.ac.jp/) is +recommended, and two easy-to-install versions are provided. + + - [unidic-lite](https://github.com/polm/unidic-lite), a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small + - [unidic](https://github.com/polm/unidic-py), the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step + +If you just want to make sure things work you can start with `unidic-lite`, but +for more serious processing `unidic` is recommended. For production use you'll +generally want to generate your own dictionary too; for details see the [MeCab +documentation](https://taku910.github.io/mecab/learn.html). + +To get either of these dictionaries, you can install them directly using `pip` +or do the below: + +```sh +pip install fugashi[unidic-lite] + +# The full version of UniDic requires a separate download step +pip install fugashi[unidic] +python -m unidic download +``` + +For more information on the different MeCab dictionaries available, see [this article](https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html). + +## Dictionary Use + +fugashi is written with the assumption you'll use Unidic to process Japanese, +but it supports arbitrary dictionaries. + +If you're using a dictionary besides Unidic you can use the GenericTagger like this: + +```python +from fugashi import GenericTagger +tagger = GenericTagger() + +# parse can be used as normal +tagger.parse('something') +# features from the dictionary can be accessed by field numbers +for word in tagger(text): + print(word.surface, word.feature[0]) +``` + +You can also create a dictionary wrapper to get feature information as a named tuple. + +```python +from fugashi import GenericTagger, create_feature_wrapper +CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma') +tagger = GenericTagger(wrapper=CustomFeatures) +for word in tagger.parseToNodeList(text): + print(word.surface, word.feature.alpha) +``` + +## Citation + +If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at [the ACL Anthology](https://www.aclweb.org/anthology/2020.nlposs-1.7/) or [on Arxiv](https://arxiv.org/abs/2010.06858). + + @inproceedings{mccann-2020-fugashi, + title = "fugashi, a Tool for Tokenizing {J}apanese in Python", + author = "McCann, Paul", + booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)", + month = nov, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.nlposs-1.7", + pages = "44--51", + abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.", + } + +## Alternatives + +If you have a problem with fugashi feel free to open an issue. However, there +are some cases where it might be better to use a different library. + +- If you don't want to deal with installing MeCab at all, try [SudachiPy](https://github.com/WorksApplications/sudachi.rs). +- If you need to work with Korean, try [pymecab-ko](https://github.com/NoUnique/pymecab-ko) or [KoNLPy](https://konlpy.org/en/latest/). + +## License and Copyright Notice + +fugashi is released under the terms of the [MIT license](./LICENSE). Please +copy it far and wide. + +fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. +MeCab is copyrighted free software by Taku Kudo `` and Nippon +Telegraph and Telephone Corporation, and is redistributed under the [BSD +License](./LICENSE.mecab). + + +%prep +%autosetup -n fugashi-1.2.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-fugashi -f filelist.lst +%dir %{python3_sitearch}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Apr 11 2023 Python_Bot - 1.2.1-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..225a5d2 --- /dev/null +++ b/sources @@ -0,0 +1 @@ +9edab5c67c3258c8a5724fafae6bebc4 fugashi-1.2.1.tar.gz -- cgit v1.2.3