diff options
Diffstat (limited to 'python-expanda.spec')
-rw-r--r-- | python-expanda.spec | 501 |
1 files changed, 501 insertions, 0 deletions
diff --git a/python-expanda.spec b/python-expanda.spec new file mode 100644 index 0000000..39b2f2b --- /dev/null +++ b/python-expanda.spec @@ -0,0 +1,501 @@ +%global _empty_manifest_terminate_build 0 +Name: python-Expanda +Version: 1.3.1 +Release: 1 +Summary: Integrated Corpus-Building Environment +License: Apache-2.0 +URL: https://github.com/affjljoo3581/Expanda +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d1/3f/37d91da7db21350d7e4885070c746819f0c6500d52bbcdf64d31a5d86eda/Expanda-1.3.1.tar.gz +BuildArch: noarch + +Requires: python3-nltk +Requires: python3-ijson +Requires: python3-tqdm +Requires: python3-mwparserfromhell +Requires: python3-tokenizers +Requires: python3-kss + +%description +# Expanda + +**The universal integrated corpus-building environment.** + +[](https://badge.fury.io/py/Expanda) + +[](https://expanda.readthedocs.io/en/latest/?badge=latest) + +[](https://codecov.io/gh/affjljoo3581/Expanda) +[](https://www.codefactor.io/repository/github/affjljoo3581/expanda) + +## Introduction +**Expanda** is an **integrated corpus-building environment**. Expanda provides +integrated pipelines for building a corpus dataset. Building corpus dataset +requires several complicated pipelines such as parsing, shuffling, and +tokenization. If the corpora are gathered from different applications, it would +be a problem to parse various formats. Expanda helps to build corpus simply at +once by setting build configuration. + +For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). + +## Main Features +* Easy to build, simple for adding new extensions +* Manages build environment systemically +* Fast build through performance optimization (even written in Python) +* Supports multi-processing +* Extremely less memory usage +* Don't need to write new codes for each corpus. Just write one line for adding + a new corpus. + +## Dependencies +* nltk +* ijson +* tqdm>=4.46.0 +* mwparserfromhell>=0.5.4 +* tokenizers>=0.7.0 +* kss==1.3.1 + +## Installation + +### With pip +Expanda can be installed using pip as follows: + +```console +$ pip install expanda +``` + +### From source +You can install from source by cloning the repository and running: + +```console +$ git clone https://github.com/affjljoo3581/Expanda.git +$ cd Expanda +$ python setup.py install +``` + +## Build your first dataset +Let's build **Wikipedia** dataset by using Expanda. First of all, install +Expanda. +```console +$ pip install expanda +``` +Next, create a workspace to build dataset by running: +```console +$ mkdir workspace +$ cd workspace +``` +Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). +In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). +Download the file through the browser, move to `workspace/src` and rename to +`wiki.xml.bz2`. Instead, run below code: +```console +$ mkdir src +$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 +``` +After downloading the dump file, we need to setup the configuration file. +Create ``expanda.cfg`` file and write the below: +```ini +[expanda.ext.wikipedia] +num-cores = 6 + +[tokenization] +unk-token = <unk> +control-tokens = <s> + </s> + <pad> + +[build] +input-files = + --expanda.ext.wikipedia src/wiki.xml.bz2 +``` +The current directory structure of `workspace` should be as follows: +``` +workspace +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` +Now we are ready to build! Run Expanda by using: +```console +$ expanda build +``` +Then we can get the below output: +``` +[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] +[nltk_data] Downloading package punkt to /home/user/nltk_data... +[nltk_data] Unzipping tokenizers/punkt.zip. +[*] merge extracted texts. +[*] start shuffling merged corpus... +[*] optimum stride: 17, buckets: 34 +[*] create temporary bucket files. +[*] successfully shuffle offsets. total offsets: 102936 +[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] +[*] start copying buckets to the output file. +[*] finish copying buckets. remove the buckets... +[*] complete preparing corpus. start training tokenizer... +[00:00:59] Reading files ████████████████████ 100 +[00:00:04] Tokenize words ████████████████████ 405802 / 405802 +[00:00:00] Count pairs ████████████████████ 405802 / 405802 +[00:00:01] Compute merges ████████████████████ 6332 / 6332 + +[*] create tokenized corpus. +[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] +[*] split the corpus into train and test dataset. +[*] remove temporary directory. +[*] finish building corpus. +``` +If you build dataset successfully, you can get the following directory tree: +``` +workspace +├── build +│ ├── corpus.raw.txt +│ ├── corpus.train.txt +│ ├── corpus.test.txt +│ └── vocab.txt +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` + + + + +%package -n python3-Expanda +Summary: Integrated Corpus-Building Environment +Provides: python-Expanda +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-Expanda +# Expanda + +**The universal integrated corpus-building environment.** + +[](https://badge.fury.io/py/Expanda) + +[](https://expanda.readthedocs.io/en/latest/?badge=latest) + +[](https://codecov.io/gh/affjljoo3581/Expanda) +[](https://www.codefactor.io/repository/github/affjljoo3581/expanda) + +## Introduction +**Expanda** is an **integrated corpus-building environment**. Expanda provides +integrated pipelines for building a corpus dataset. Building corpus dataset +requires several complicated pipelines such as parsing, shuffling, and +tokenization. If the corpora are gathered from different applications, it would +be a problem to parse various formats. Expanda helps to build corpus simply at +once by setting build configuration. + +For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). + +## Main Features +* Easy to build, simple for adding new extensions +* Manages build environment systemically +* Fast build through performance optimization (even written in Python) +* Supports multi-processing +* Extremely less memory usage +* Don't need to write new codes for each corpus. Just write one line for adding + a new corpus. + +## Dependencies +* nltk +* ijson +* tqdm>=4.46.0 +* mwparserfromhell>=0.5.4 +* tokenizers>=0.7.0 +* kss==1.3.1 + +## Installation + +### With pip +Expanda can be installed using pip as follows: + +```console +$ pip install expanda +``` + +### From source +You can install from source by cloning the repository and running: + +```console +$ git clone https://github.com/affjljoo3581/Expanda.git +$ cd Expanda +$ python setup.py install +``` + +## Build your first dataset +Let's build **Wikipedia** dataset by using Expanda. First of all, install +Expanda. +```console +$ pip install expanda +``` +Next, create a workspace to build dataset by running: +```console +$ mkdir workspace +$ cd workspace +``` +Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). +In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). +Download the file through the browser, move to `workspace/src` and rename to +`wiki.xml.bz2`. Instead, run below code: +```console +$ mkdir src +$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 +``` +After downloading the dump file, we need to setup the configuration file. +Create ``expanda.cfg`` file and write the below: +```ini +[expanda.ext.wikipedia] +num-cores = 6 + +[tokenization] +unk-token = <unk> +control-tokens = <s> + </s> + <pad> + +[build] +input-files = + --expanda.ext.wikipedia src/wiki.xml.bz2 +``` +The current directory structure of `workspace` should be as follows: +``` +workspace +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` +Now we are ready to build! Run Expanda by using: +```console +$ expanda build +``` +Then we can get the below output: +``` +[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] +[nltk_data] Downloading package punkt to /home/user/nltk_data... +[nltk_data] Unzipping tokenizers/punkt.zip. +[*] merge extracted texts. +[*] start shuffling merged corpus... +[*] optimum stride: 17, buckets: 34 +[*] create temporary bucket files. +[*] successfully shuffle offsets. total offsets: 102936 +[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] +[*] start copying buckets to the output file. +[*] finish copying buckets. remove the buckets... +[*] complete preparing corpus. start training tokenizer... +[00:00:59] Reading files ████████████████████ 100 +[00:00:04] Tokenize words ████████████████████ 405802 / 405802 +[00:00:00] Count pairs ████████████████████ 405802 / 405802 +[00:00:01] Compute merges ████████████████████ 6332 / 6332 + +[*] create tokenized corpus. +[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] +[*] split the corpus into train and test dataset. +[*] remove temporary directory. +[*] finish building corpus. +``` +If you build dataset successfully, you can get the following directory tree: +``` +workspace +├── build +│ ├── corpus.raw.txt +│ ├── corpus.train.txt +│ ├── corpus.test.txt +│ └── vocab.txt +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` + + + + +%package help +Summary: Development documents and examples for Expanda +Provides: python3-Expanda-doc +%description help +# Expanda + +**The universal integrated corpus-building environment.** + +[](https://badge.fury.io/py/Expanda) + +[](https://expanda.readthedocs.io/en/latest/?badge=latest) + +[](https://codecov.io/gh/affjljoo3581/Expanda) +[](https://www.codefactor.io/repository/github/affjljoo3581/expanda) + +## Introduction +**Expanda** is an **integrated corpus-building environment**. Expanda provides +integrated pipelines for building a corpus dataset. Building corpus dataset +requires several complicated pipelines such as parsing, shuffling, and +tokenization. If the corpora are gathered from different applications, it would +be a problem to parse various formats. Expanda helps to build corpus simply at +once by setting build configuration. + +For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). + +## Main Features +* Easy to build, simple for adding new extensions +* Manages build environment systemically +* Fast build through performance optimization (even written in Python) +* Supports multi-processing +* Extremely less memory usage +* Don't need to write new codes for each corpus. Just write one line for adding + a new corpus. + +## Dependencies +* nltk +* ijson +* tqdm>=4.46.0 +* mwparserfromhell>=0.5.4 +* tokenizers>=0.7.0 +* kss==1.3.1 + +## Installation + +### With pip +Expanda can be installed using pip as follows: + +```console +$ pip install expanda +``` + +### From source +You can install from source by cloning the repository and running: + +```console +$ git clone https://github.com/affjljoo3581/Expanda.git +$ cd Expanda +$ python setup.py install +``` + +## Build your first dataset +Let's build **Wikipedia** dataset by using Expanda. First of all, install +Expanda. +```console +$ pip install expanda +``` +Next, create a workspace to build dataset by running: +```console +$ mkdir workspace +$ cd workspace +``` +Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). +In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). +Download the file through the browser, move to `workspace/src` and rename to +`wiki.xml.bz2`. Instead, run below code: +```console +$ mkdir src +$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 +``` +After downloading the dump file, we need to setup the configuration file. +Create ``expanda.cfg`` file and write the below: +```ini +[expanda.ext.wikipedia] +num-cores = 6 + +[tokenization] +unk-token = <unk> +control-tokens = <s> + </s> + <pad> + +[build] +input-files = + --expanda.ext.wikipedia src/wiki.xml.bz2 +``` +The current directory structure of `workspace` should be as follows: +``` +workspace +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` +Now we are ready to build! Run Expanda by using: +```console +$ expanda build +``` +Then we can get the below output: +``` +[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] +[nltk_data] Downloading package punkt to /home/user/nltk_data... +[nltk_data] Unzipping tokenizers/punkt.zip. +[*] merge extracted texts. +[*] start shuffling merged corpus... +[*] optimum stride: 17, buckets: 34 +[*] create temporary bucket files. +[*] successfully shuffle offsets. total offsets: 102936 +[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] +[*] start copying buckets to the output file. +[*] finish copying buckets. remove the buckets... +[*] complete preparing corpus. start training tokenizer... +[00:00:59] Reading files ████████████████████ 100 +[00:00:04] Tokenize words ████████████████████ 405802 / 405802 +[00:00:00] Count pairs ████████████████████ 405802 / 405802 +[00:00:01] Compute merges ████████████████████ 6332 / 6332 + +[*] create tokenized corpus. +[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] +[*] split the corpus into train and test dataset. +[*] remove temporary directory. +[*] finish building corpus. +``` +If you build dataset successfully, you can get the following directory tree: +``` +workspace +├── build +│ ├── corpus.raw.txt +│ ├── corpus.train.txt +│ ├── corpus.test.txt +│ └── vocab.txt +├── src +│ └── wiki.xml.bz2 +└── expanda.cfg +``` + + + + +%prep +%autosetup -n Expanda-1.3.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-Expanda -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.3.1-1 +- Package Spec generated |