%global _empty_manifest_terminate_build 0 Name: python-g2pM Version: 0.1.2.5 Release: 1 Summary: g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese License: Apache License 2.0 URL: https://github.com/kakaobrain/g2pM Source0: https://mirrors.aliyun.com/pypi/web/packages/2e/d6/06b20ffa5ea2e2a6c55ada6bf9503c1ee7bae2c64b3f6aa6107396a0a657/g2pM-0.1.2.5.tar.gz BuildArch: noarch %description # g2pM [![Release](https://img.shields.io/badge/release-v0.1.2.4-green)](https://pypi.org/project/g2pM/) [![Downloads](https://pepy.tech/badge/g2pm)](https://pepy.tech/project/g2pm) [![license](https://img.shields.io/badge/license-Apache%202.0-red)](https://github.com/kakaobrain/g2pM/blob/master/LICENSE) This is the official repository of our paper [A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136) (**Interspeech 2020**). ## Install ``` pip install g2pM ``` ## The CPP Dataset In data folder, there are [train/dev/test].sent files and [train/dev/test].lb files. In *.sent file, each lines corresponds to one sentence and a special symbol ▁ (U+2581) is added to the left and right of polyphonic character. The pronunciation of the corresponding character is at the same line from *.lb file. For each sentence, there could be several polyphonic characters, but we randomly choose only one polyphonic character to annotate. ## Requirements * python >= 3.6 * numpy ## Usage If you want to remove all the digits which denote the tones, set tone=False. Default setting is tone=True.
If you want to split all the non Chinese characters (e.g. digit), set char_split=True. Default setting is char_split=False.
``` >>> from g2pM import G2pM >>> model = G2pM() >>> sentence = "然而,他红了20年以后,他竟退出了大家的视线。" >>> model(sentence, tone=True, char_split=False) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] >>> model(sentence, tone=False, char_split=False) ['ran', 'er', ',', 'ta', 'hong', 'le', '2', '0', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。'] >>> model(sentence, tone=True, char_split=True) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '2', '0', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] ``` ## Model Size | Layer | Size | |-----------------------|---------| | Embedding | 64 | | LSTM x1 | 64 | | Fully-Connected x2 | 64 | | Total # of parameters | 477,228 | | Model size | 1.7MB | | Package size | 2.1MB | ## Evaluation Result | Model | Dev. | Test | | :--------------| --------------: |:--------------:| | g2pC | 84.84 | 84.45 | | xpinyin(0.5.6) | 78.74 | 78.56 | | pypinyin(0.36.0) | 85.44 | 86.13 | | Majority Vote | 92.15 | 92.08 | | Chinese Bert | **97.95** | **97.85** | | Ours | 97.36 | 97.31 | ## Reference To cite the code/data/paper, please use this BibTex ```bibtex @article{park2020g2pm, author={Park, Kyubyong and Lee, Seanie}, title = {A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset }, journal={Proc. Interspeech 2020}, url = {https://arxiv.org/abs/2004.03136}, year = {2020} } ``` %package -n python3-g2pM Summary: g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Provides: python-g2pM BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-g2pM # g2pM [![Release](https://img.shields.io/badge/release-v0.1.2.4-green)](https://pypi.org/project/g2pM/) [![Downloads](https://pepy.tech/badge/g2pm)](https://pepy.tech/project/g2pm) [![license](https://img.shields.io/badge/license-Apache%202.0-red)](https://github.com/kakaobrain/g2pM/blob/master/LICENSE) This is the official repository of our paper [A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136) (**Interspeech 2020**). ## Install ``` pip install g2pM ``` ## The CPP Dataset In data folder, there are [train/dev/test].sent files and [train/dev/test].lb files. In *.sent file, each lines corresponds to one sentence and a special symbol ▁ (U+2581) is added to the left and right of polyphonic character. The pronunciation of the corresponding character is at the same line from *.lb file. For each sentence, there could be several polyphonic characters, but we randomly choose only one polyphonic character to annotate. ## Requirements * python >= 3.6 * numpy ## Usage If you want to remove all the digits which denote the tones, set tone=False. Default setting is tone=True.
If you want to split all the non Chinese characters (e.g. digit), set char_split=True. Default setting is char_split=False.
``` >>> from g2pM import G2pM >>> model = G2pM() >>> sentence = "然而,他红了20年以后,他竟退出了大家的视线。" >>> model(sentence, tone=True, char_split=False) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] >>> model(sentence, tone=False, char_split=False) ['ran', 'er', ',', 'ta', 'hong', 'le', '2', '0', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。'] >>> model(sentence, tone=True, char_split=True) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '2', '0', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] ``` ## Model Size | Layer | Size | |-----------------------|---------| | Embedding | 64 | | LSTM x1 | 64 | | Fully-Connected x2 | 64 | | Total # of parameters | 477,228 | | Model size | 1.7MB | | Package size | 2.1MB | ## Evaluation Result | Model | Dev. | Test | | :--------------| --------------: |:--------------:| | g2pC | 84.84 | 84.45 | | xpinyin(0.5.6) | 78.74 | 78.56 | | pypinyin(0.36.0) | 85.44 | 86.13 | | Majority Vote | 92.15 | 92.08 | | Chinese Bert | **97.95** | **97.85** | | Ours | 97.36 | 97.31 | ## Reference To cite the code/data/paper, please use this BibTex ```bibtex @article{park2020g2pm, author={Park, Kyubyong and Lee, Seanie}, title = {A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset }, journal={Proc. Interspeech 2020}, url = {https://arxiv.org/abs/2004.03136}, year = {2020} } ``` %package help Summary: Development documents and examples for g2pM Provides: python3-g2pM-doc %description help # g2pM [![Release](https://img.shields.io/badge/release-v0.1.2.4-green)](https://pypi.org/project/g2pM/) [![Downloads](https://pepy.tech/badge/g2pm)](https://pepy.tech/project/g2pm) [![license](https://img.shields.io/badge/license-Apache%202.0-red)](https://github.com/kakaobrain/g2pM/blob/master/LICENSE) This is the official repository of our paper [A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136) (**Interspeech 2020**). ## Install ``` pip install g2pM ``` ## The CPP Dataset In data folder, there are [train/dev/test].sent files and [train/dev/test].lb files. In *.sent file, each lines corresponds to one sentence and a special symbol ▁ (U+2581) is added to the left and right of polyphonic character. The pronunciation of the corresponding character is at the same line from *.lb file. For each sentence, there could be several polyphonic characters, but we randomly choose only one polyphonic character to annotate. ## Requirements * python >= 3.6 * numpy ## Usage If you want to remove all the digits which denote the tones, set tone=False. Default setting is tone=True.
If you want to split all the non Chinese characters (e.g. digit), set char_split=True. Default setting is char_split=False.
``` >>> from g2pM import G2pM >>> model = G2pM() >>> sentence = "然而,他红了20年以后,他竟退出了大家的视线。" >>> model(sentence, tone=True, char_split=False) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] >>> model(sentence, tone=False, char_split=False) ['ran', 'er', ',', 'ta', 'hong', 'le', '2', '0', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。'] >>> model(sentence, tone=True, char_split=True) ['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '2', '0', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。'] ``` ## Model Size | Layer | Size | |-----------------------|---------| | Embedding | 64 | | LSTM x1 | 64 | | Fully-Connected x2 | 64 | | Total # of parameters | 477,228 | | Model size | 1.7MB | | Package size | 2.1MB | ## Evaluation Result | Model | Dev. | Test | | :--------------| --------------: |:--------------:| | g2pC | 84.84 | 84.45 | | xpinyin(0.5.6) | 78.74 | 78.56 | | pypinyin(0.36.0) | 85.44 | 86.13 | | Majority Vote | 92.15 | 92.08 | | Chinese Bert | **97.95** | **97.85** | | Ours | 97.36 | 97.31 | ## Reference To cite the code/data/paper, please use this BibTex ```bibtex @article{park2020g2pm, author={Park, Kyubyong and Lee, Seanie}, title = {A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset }, journal={Proc. Interspeech 2020}, url = {https://arxiv.org/abs/2004.03136}, year = {2020} } ``` %prep %autosetup -n g2pM-0.1.2.5 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-g2pM -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Thu Jun 08 2023 Python_Bot - 0.1.2.5-1 - Package Spec generated