From 765383f872ddff69bfbe8e09cfa76b486e817fe3 Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Thu, 18 May 2023 05:48:45 +0000 Subject: automatic import of python-word2word --- .gitignore | 1 + python-word2word.spec | 490 ++++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 492 insertions(+) create mode 100644 python-word2word.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..5ff5cb4 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/word2word-1.0.0.tar.gz diff --git a/python-word2word.spec b/python-word2word.spec new file mode 100644 index 0000000..2289dee --- /dev/null +++ b/python-word2word.spec @@ -0,0 +1,490 @@ +%global _empty_manifest_terminate_build 0 +Name: python-word2word +Version: 1.0.0 +Release: 1 +Summary: Easy-to-use word translations for 3,564 language pairs +License: Apache License 2.0 +URL: https://github.com/kakaobrain/word2word +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/1b/c8/6aa4d029236e5e021552ccaa6a01daadaf3d9a4b5b8f9babfb73db589134/word2word-1.0.0.tar.gz +BuildArch: noarch + +Requires: python3-requests +Requires: python3-wget +Requires: python3-numpy +Requires: python3-tqdm + +%description +[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw) + +# word2word + +Easy-to-use word translations for 3,564 language pairs. + +This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019). + +## Summary + +* A large collection of freely & publicly available bilingual lexicons + **for 3,564 language pairs across 62 unique languages.** +* Easy-to-use Python interface for accessing top-k word translations and + for building a new bilingual lexicon from a custom parallel corpus. +* Constructed using a simple approach that yields bilingual lexicons with + high coverage and competitive translation quality. + +## Usage + +First, install the package using `pip`: +```shell script +pip install word2word +``` + +OR + +```shell script +git clone https://github.com/kakaobrain/word2word +python setup.py install +``` + +Then, in Python, download the model and retrieve top-5 word translations +of any given word to the desired language: +```python +from word2word import Word2word +en2fr = Word2word("en", "fr") +print(en2fr("apple")) +# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs'] +``` + +![gif](./word2word.gif) + +## Supported Languages + +We provide top-k word-to-word translations across all available pairs + from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php). +This amounts to a total of 3,564 language pairs across 62 unique languages. + +The full list is provided [here](word2word/supporting_languages.txt). + +## Methodology + +Our approach computes top-k word translations based on +the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. +We additionally introduce a correction term that controls for any confounding effect +coming from other source words within the same sentence. +The resulting method is an efficient and scalable approach that allows us to +construct large bilingual dictionaries from any given parallel corpus. + +For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019). + + +## Building a Bilingual Lexicon on a Custom Parallel Corpus + +The `word2word` package also provides interface for +building a custom bilingual lexicon using a different parallel corpus. +Here, we show an example of building one from +the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE): +```python +from word2word import Word2word + +# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr +my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr") +# ...building... +print(my_en2fr("mitochondrial")) +# out: ['mitochondriale', 'mitochondriales', 'mitochondrial', +# 'cytopathies', 'mitochondriaux'] +``` + +When built from source, the bilingual lexicon can also be constructed from the command line as follows: +```shell script +python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr +``` + +In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python: +```python +from word2word import Word2word +my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr") +# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl +``` + +### Multiprocessing + +In both the Python interface and the command line interface, +`make` uses multiprocessing with 16 CPUs by default. +The number of CPU workers can be adjusted by setting +`num_workers=N` (Python) or `--num_workers N` (command line). + +## References + +If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019): +```bibtex +@inproceedings{choe2020word2word, + author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim}, + title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs}, + booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)}, + year = {2020} +} +``` + +All of our pre-computed bilingual lexicons were constructed from the publicly available + [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset: +```bibtex +@inproceedings{lison-etal-2018-opensubtitles2018, + title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora", + author = {Lison, Pierre and + Tiedemann, J{\"o}rg and + Kouylekov, Milen}, + booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", + month = may, + year = "2018", + address = "Miyazaki, Japan", + publisher = "European Language Resources Association (ELRA)", + url = "https://www.aclweb.org/anthology/L18-1275", +} +``` + +## Authors + +[Kyubyong Park](https://github.com/Kyubyong), +[Dongwoo Kim](https://github.com/kimdwkimdw), and +[YJ Choe](https://github.com/yjchoe) + + + + + +%package -n python3-word2word +Summary: Easy-to-use word translations for 3,564 language pairs +Provides: python-word2word +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-word2word +[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw) + +# word2word + +Easy-to-use word translations for 3,564 language pairs. + +This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019). + +## Summary + +* A large collection of freely & publicly available bilingual lexicons + **for 3,564 language pairs across 62 unique languages.** +* Easy-to-use Python interface for accessing top-k word translations and + for building a new bilingual lexicon from a custom parallel corpus. +* Constructed using a simple approach that yields bilingual lexicons with + high coverage and competitive translation quality. + +## Usage + +First, install the package using `pip`: +```shell script +pip install word2word +``` + +OR + +```shell script +git clone https://github.com/kakaobrain/word2word +python setup.py install +``` + +Then, in Python, download the model and retrieve top-5 word translations +of any given word to the desired language: +```python +from word2word import Word2word +en2fr = Word2word("en", "fr") +print(en2fr("apple")) +# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs'] +``` + +![gif](./word2word.gif) + +## Supported Languages + +We provide top-k word-to-word translations across all available pairs + from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php). +This amounts to a total of 3,564 language pairs across 62 unique languages. + +The full list is provided [here](word2word/supporting_languages.txt). + +## Methodology + +Our approach computes top-k word translations based on +the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. +We additionally introduce a correction term that controls for any confounding effect +coming from other source words within the same sentence. +The resulting method is an efficient and scalable approach that allows us to +construct large bilingual dictionaries from any given parallel corpus. + +For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019). + + +## Building a Bilingual Lexicon on a Custom Parallel Corpus + +The `word2word` package also provides interface for +building a custom bilingual lexicon using a different parallel corpus. +Here, we show an example of building one from +the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE): +```python +from word2word import Word2word + +# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr +my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr") +# ...building... +print(my_en2fr("mitochondrial")) +# out: ['mitochondriale', 'mitochondriales', 'mitochondrial', +# 'cytopathies', 'mitochondriaux'] +``` + +When built from source, the bilingual lexicon can also be constructed from the command line as follows: +```shell script +python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr +``` + +In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python: +```python +from word2word import Word2word +my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr") +# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl +``` + +### Multiprocessing + +In both the Python interface and the command line interface, +`make` uses multiprocessing with 16 CPUs by default. +The number of CPU workers can be adjusted by setting +`num_workers=N` (Python) or `--num_workers N` (command line). + +## References + +If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019): +```bibtex +@inproceedings{choe2020word2word, + author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim}, + title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs}, + booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)}, + year = {2020} +} +``` + +All of our pre-computed bilingual lexicons were constructed from the publicly available + [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset: +```bibtex +@inproceedings{lison-etal-2018-opensubtitles2018, + title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora", + author = {Lison, Pierre and + Tiedemann, J{\"o}rg and + Kouylekov, Milen}, + booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", + month = may, + year = "2018", + address = "Miyazaki, Japan", + publisher = "European Language Resources Association (ELRA)", + url = "https://www.aclweb.org/anthology/L18-1275", +} +``` + +## Authors + +[Kyubyong Park](https://github.com/Kyubyong), +[Dongwoo Kim](https://github.com/kimdwkimdw), and +[YJ Choe](https://github.com/yjchoe) + + + + + +%package help +Summary: Development documents and examples for word2word +Provides: python3-word2word-doc +%description help +[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/) +[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw) + +# word2word + +Easy-to-use word translations for 3,564 language pairs. + +This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019). + +## Summary + +* A large collection of freely & publicly available bilingual lexicons + **for 3,564 language pairs across 62 unique languages.** +* Easy-to-use Python interface for accessing top-k word translations and + for building a new bilingual lexicon from a custom parallel corpus. +* Constructed using a simple approach that yields bilingual lexicons with + high coverage and competitive translation quality. + +## Usage + +First, install the package using `pip`: +```shell script +pip install word2word +``` + +OR + +```shell script +git clone https://github.com/kakaobrain/word2word +python setup.py install +``` + +Then, in Python, download the model and retrieve top-5 word translations +of any given word to the desired language: +```python +from word2word import Word2word +en2fr = Word2word("en", "fr") +print(en2fr("apple")) +# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs'] +``` + +![gif](./word2word.gif) + +## Supported Languages + +We provide top-k word-to-word translations across all available pairs + from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php). +This amounts to a total of 3,564 language pairs across 62 unique languages. + +The full list is provided [here](word2word/supporting_languages.txt). + +## Methodology + +Our approach computes top-k word translations based on +the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. +We additionally introduce a correction term that controls for any confounding effect +coming from other source words within the same sentence. +The resulting method is an efficient and scalable approach that allows us to +construct large bilingual dictionaries from any given parallel corpus. + +For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019). + + +## Building a Bilingual Lexicon on a Custom Parallel Corpus + +The `word2word` package also provides interface for +building a custom bilingual lexicon using a different parallel corpus. +Here, we show an example of building one from +the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE): +```python +from word2word import Word2word + +# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr +my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr") +# ...building... +print(my_en2fr("mitochondrial")) +# out: ['mitochondriale', 'mitochondriales', 'mitochondrial', +# 'cytopathies', 'mitochondriaux'] +``` + +When built from source, the bilingual lexicon can also be constructed from the command line as follows: +```shell script +python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr +``` + +In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python: +```python +from word2word import Word2word +my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr") +# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl +``` + +### Multiprocessing + +In both the Python interface and the command line interface, +`make` uses multiprocessing with 16 CPUs by default. +The number of CPU workers can be adjusted by setting +`num_workers=N` (Python) or `--num_workers N` (command line). + +## References + +If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019): +```bibtex +@inproceedings{choe2020word2word, + author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim}, + title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs}, + booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)}, + year = {2020} +} +``` + +All of our pre-computed bilingual lexicons were constructed from the publicly available + [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset: +```bibtex +@inproceedings{lison-etal-2018-opensubtitles2018, + title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora", + author = {Lison, Pierre and + Tiedemann, J{\"o}rg and + Kouylekov, Milen}, + booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", + month = may, + year = "2018", + address = "Miyazaki, Japan", + publisher = "European Language Resources Association (ELRA)", + url = "https://www.aclweb.org/anthology/L18-1275", +} +``` + +## Authors + +[Kyubyong Park](https://github.com/Kyubyong), +[Dongwoo Kim](https://github.com/kimdwkimdw), and +[YJ Choe](https://github.com/yjchoe) + + + + + +%prep +%autosetup -n word2word-1.0.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-word2word -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu May 18 2023 Python_Bot - 1.0.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..657b5f6 --- /dev/null +++ b/sources @@ -0,0 +1 @@ +1e2379056df7790c4e117e8e92a9347a word2word-1.0.0.tar.gz -- cgit v1.2.3