summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-word2word.spec490
-rw-r--r--sources1
3 files changed, 492 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..5ff5cb4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/word2word-1.0.0.tar.gz
diff --git a/python-word2word.spec b/python-word2word.spec
new file mode 100644
index 0000000..2289dee
--- /dev/null
+++ b/python-word2word.spec
@@ -0,0 +1,490 @@
+%global _empty_manifest_terminate_build 0
+Name: python-word2word
+Version: 1.0.0
+Release: 1
+Summary: Easy-to-use word translations for 3,564 language pairs
+License: Apache License 2.0
+URL: https://github.com/kakaobrain/word2word
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/1b/c8/6aa4d029236e5e021552ccaa6a01daadaf3d9a4b5b8f9babfb73db589134/word2word-1.0.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-requests
+Requires: python3-wget
+Requires: python3-numpy
+Requires: python3-tqdm
+
+%description
+[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw)
+
+# word2word
+
+Easy-to-use word translations for 3,564 language pairs.
+
+This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019).
+
+## Summary
+
+* A large collection of freely & publicly available bilingual lexicons
+ **for 3,564 language pairs across 62 unique languages.**
+* Easy-to-use Python interface for accessing top-k word translations and
+ for building a new bilingual lexicon from a custom parallel corpus.
+* Constructed using a simple approach that yields bilingual lexicons with
+ high coverage and competitive translation quality.
+
+## Usage
+
+First, install the package using `pip`:
+```shell script
+pip install word2word
+```
+
+OR
+
+```shell script
+git clone https://github.com/kakaobrain/word2word
+python setup.py install
+```
+
+Then, in Python, download the model and retrieve top-5 word translations
+of any given word to the desired language:
+```python
+from word2word import Word2word
+en2fr = Word2word("en", "fr")
+print(en2fr("apple"))
+# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs']
+```
+
+![gif](./word2word.gif)
+
+## Supported Languages
+
+We provide top-k word-to-word translations across all available pairs
+ from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php).
+This amounts to a total of 3,564 language pairs across 62 unique languages.
+
+The full list is provided [here](word2word/supporting_languages.txt).
+
+## Methodology
+
+Our approach computes top-k word translations based on
+the co-occurrence statistics between cross-lingual word pairs in a parallel corpus.
+We additionally introduce a correction term that controls for any confounding effect
+coming from other source words within the same sentence.
+The resulting method is an efficient and scalable approach that allows us to
+construct large bilingual dictionaries from any given parallel corpus.
+
+For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019).
+
+
+## Building a Bilingual Lexicon on a Custom Parallel Corpus
+
+The `word2word` package also provides interface for
+building a custom bilingual lexicon using a different parallel corpus.
+Here, we show an example of building one from
+the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE):
+```python
+from word2word import Word2word
+
+# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr
+my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr")
+# ...building...
+print(my_en2fr("mitochondrial"))
+# out: ['mitochondriale', 'mitochondriales', 'mitochondrial',
+# 'cytopathies', 'mitochondriaux']
+```
+
+When built from source, the bilingual lexicon can also be constructed from the command line as follows:
+```shell script
+python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr
+```
+
+In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python:
+```python
+from word2word import Word2word
+my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr")
+# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl
+```
+
+### Multiprocessing
+
+In both the Python interface and the command line interface,
+`make` uses multiprocessing with 16 CPUs by default.
+The number of CPU workers can be adjusted by setting
+`num_workers=N` (Python) or `--num_workers N` (command line).
+
+## References
+
+If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019):
+```bibtex
+@inproceedings{choe2020word2word,
+ author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim},
+ title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs},
+ booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)},
+ year = {2020}
+}
+```
+
+All of our pre-computed bilingual lexicons were constructed from the publicly available
+ [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset:
+```bibtex
+@inproceedings{lison-etal-2018-opensubtitles2018,
+ title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora",
+ author = {Lison, Pierre and
+ Tiedemann, J{\"o}rg and
+ Kouylekov, Milen},
+ booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
+ month = may,
+ year = "2018",
+ address = "Miyazaki, Japan",
+ publisher = "European Language Resources Association (ELRA)",
+ url = "https://www.aclweb.org/anthology/L18-1275",
+}
+```
+
+## Authors
+
+[Kyubyong Park](https://github.com/Kyubyong),
+[Dongwoo Kim](https://github.com/kimdwkimdw), and
+[YJ Choe](https://github.com/yjchoe)
+
+
+
+
+
+%package -n python3-word2word
+Summary: Easy-to-use word translations for 3,564 language pairs
+Provides: python-word2word
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-word2word
+[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw)
+
+# word2word
+
+Easy-to-use word translations for 3,564 language pairs.
+
+This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019).
+
+## Summary
+
+* A large collection of freely & publicly available bilingual lexicons
+ **for 3,564 language pairs across 62 unique languages.**
+* Easy-to-use Python interface for accessing top-k word translations and
+ for building a new bilingual lexicon from a custom parallel corpus.
+* Constructed using a simple approach that yields bilingual lexicons with
+ high coverage and competitive translation quality.
+
+## Usage
+
+First, install the package using `pip`:
+```shell script
+pip install word2word
+```
+
+OR
+
+```shell script
+git clone https://github.com/kakaobrain/word2word
+python setup.py install
+```
+
+Then, in Python, download the model and retrieve top-5 word translations
+of any given word to the desired language:
+```python
+from word2word import Word2word
+en2fr = Word2word("en", "fr")
+print(en2fr("apple"))
+# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs']
+```
+
+![gif](./word2word.gif)
+
+## Supported Languages
+
+We provide top-k word-to-word translations across all available pairs
+ from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php).
+This amounts to a total of 3,564 language pairs across 62 unique languages.
+
+The full list is provided [here](word2word/supporting_languages.txt).
+
+## Methodology
+
+Our approach computes top-k word translations based on
+the co-occurrence statistics between cross-lingual word pairs in a parallel corpus.
+We additionally introduce a correction term that controls for any confounding effect
+coming from other source words within the same sentence.
+The resulting method is an efficient and scalable approach that allows us to
+construct large bilingual dictionaries from any given parallel corpus.
+
+For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019).
+
+
+## Building a Bilingual Lexicon on a Custom Parallel Corpus
+
+The `word2word` package also provides interface for
+building a custom bilingual lexicon using a different parallel corpus.
+Here, we show an example of building one from
+the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE):
+```python
+from word2word import Word2word
+
+# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr
+my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr")
+# ...building...
+print(my_en2fr("mitochondrial"))
+# out: ['mitochondriale', 'mitochondriales', 'mitochondrial',
+# 'cytopathies', 'mitochondriaux']
+```
+
+When built from source, the bilingual lexicon can also be constructed from the command line as follows:
+```shell script
+python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr
+```
+
+In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python:
+```python
+from word2word import Word2word
+my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr")
+# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl
+```
+
+### Multiprocessing
+
+In both the Python interface and the command line interface,
+`make` uses multiprocessing with 16 CPUs by default.
+The number of CPU workers can be adjusted by setting
+`num_workers=N` (Python) or `--num_workers N` (command line).
+
+## References
+
+If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019):
+```bibtex
+@inproceedings{choe2020word2word,
+ author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim},
+ title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs},
+ booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)},
+ year = {2020}
+}
+```
+
+All of our pre-computed bilingual lexicons were constructed from the publicly available
+ [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset:
+```bibtex
+@inproceedings{lison-etal-2018-opensubtitles2018,
+ title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora",
+ author = {Lison, Pierre and
+ Tiedemann, J{\"o}rg and
+ Kouylekov, Milen},
+ booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
+ month = may,
+ year = "2018",
+ address = "Miyazaki, Japan",
+ publisher = "European Language Resources Association (ELRA)",
+ url = "https://www.aclweb.org/anthology/L18-1275",
+}
+```
+
+## Authors
+
+[Kyubyong Park](https://github.com/Kyubyong),
+[Dongwoo Kim](https://github.com/kimdwkimdw), and
+[YJ Choe](https://github.com/yjchoe)
+
+
+
+
+
+%package help
+Summary: Development documents and examples for word2word
+Provides: python3-word2word-doc
+%description help
+[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw)
+
+# word2word
+
+Easy-to-use word translations for 3,564 language pairs.
+
+This is the official code accompanying [our LREC 2020 paper](https://arxiv.org/abs/1911.12019).
+
+## Summary
+
+* A large collection of freely & publicly available bilingual lexicons
+ **for 3,564 language pairs across 62 unique languages.**
+* Easy-to-use Python interface for accessing top-k word translations and
+ for building a new bilingual lexicon from a custom parallel corpus.
+* Constructed using a simple approach that yields bilingual lexicons with
+ high coverage and competitive translation quality.
+
+## Usage
+
+First, install the package using `pip`:
+```shell script
+pip install word2word
+```
+
+OR
+
+```shell script
+git clone https://github.com/kakaobrain/word2word
+python setup.py install
+```
+
+Then, in Python, download the model and retrieve top-5 word translations
+of any given word to the desired language:
+```python
+from word2word import Word2word
+en2fr = Word2word("en", "fr")
+print(en2fr("apple"))
+# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs']
+```
+
+![gif](./word2word.gif)
+
+## Supported Languages
+
+We provide top-k word-to-word translations across all available pairs
+ from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php).
+This amounts to a total of 3,564 language pairs across 62 unique languages.
+
+The full list is provided [here](word2word/supporting_languages.txt).
+
+## Methodology
+
+Our approach computes top-k word translations based on
+the co-occurrence statistics between cross-lingual word pairs in a parallel corpus.
+We additionally introduce a correction term that controls for any confounding effect
+coming from other source words within the same sentence.
+The resulting method is an efficient and scalable approach that allows us to
+construct large bilingual dictionaries from any given parallel corpus.
+
+For more details, see the Methodology section of [our paper](https://arxiv.org/abs/1911.12019).
+
+
+## Building a Bilingual Lexicon on a Custom Parallel Corpus
+
+The `word2word` package also provides interface for
+building a custom bilingual lexicon using a different parallel corpus.
+Here, we show an example of building one from
+the [Medline English-French dataset](https://drive.google.com/drive/folders/0B3UxRWA52hBjQjZmYlRZWHQ4SUE):
+```python
+from word2word import Word2word
+
+# custom parallel data: data/pubmed.en-fr.en, data/pubmed.en-fr.fr
+my_en2fr = Word2word.make("en", "fr", "data/pubmed.en-fr")
+# ...building...
+print(my_en2fr("mitochondrial"))
+# out: ['mitochondriale', 'mitochondriales', 'mitochondrial',
+# 'cytopathies', 'mitochondriaux']
+```
+
+When built from source, the bilingual lexicon can also be constructed from the command line as follows:
+```shell script
+python make.py --lang1 en --lang2 fr --datapref data/pubmed.en-fr
+```
+
+In both cases, the custom lexicon (saved to `datapref/` by default) can be re-loaded in Python:
+```python
+from word2word import Word2word
+my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr")
+# Loaded word2word custom bilingual lexicon from data/pubmed.en-fr/en-fr.pkl
+```
+
+### Multiprocessing
+
+In both the Python interface and the command line interface,
+`make` uses multiprocessing with 16 CPUs by default.
+The number of CPU workers can be adjusted by setting
+`num_workers=N` (Python) or `--num_workers N` (command line).
+
+## References
+
+If you use word2word for research, please cite [our paper](https://arxiv.org/abs/1911.12019):
+```bibtex
+@inproceedings{choe2020word2word,
+ author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim},
+ title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs},
+ booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)},
+ year = {2020}
+}
+```
+
+All of our pre-computed bilingual lexicons were constructed from the publicly available
+ [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset:
+```bibtex
+@inproceedings{lison-etal-2018-opensubtitles2018,
+ title = "{O}pen{S}ubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora",
+ author = {Lison, Pierre and
+ Tiedemann, J{\"o}rg and
+ Kouylekov, Milen},
+ booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
+ month = may,
+ year = "2018",
+ address = "Miyazaki, Japan",
+ publisher = "European Language Resources Association (ELRA)",
+ url = "https://www.aclweb.org/anthology/L18-1275",
+}
+```
+
+## Authors
+
+[Kyubyong Park](https://github.com/Kyubyong),
+[Dongwoo Kim](https://github.com/kimdwkimdw), and
+[YJ Choe](https://github.com/yjchoe)
+
+
+
+
+
+%prep
+%autosetup -n word2word-1.0.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-word2word -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.0-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..657b5f6
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+1e2379056df7790c4e117e8e92a9347a word2word-1.0.0.tar.gz