%global _empty_manifest_terminate_build 0 Name: python-Expanda Version: 1.3.1 Release: 1 Summary: Integrated Corpus-Building Environment License: Apache-2.0 URL: https://github.com/affjljoo3581/Expanda Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d1/3f/37d91da7db21350d7e4885070c746819f0c6500d52bbcdf64d31a5d86eda/Expanda-1.3.1.tar.gz BuildArch: noarch Requires: python3-nltk Requires: python3-ijson Requires: python3-tqdm Requires: python3-mwparserfromhell Requires: python3-tokenizers Requires: python3-kss %description # Expanda **The universal integrated corpus-building environment.** [![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda) ![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg) [![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest) ![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda) [![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda) [![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda) ## Introduction **Expanda** is an **integrated corpus-building environment**. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration. For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). ## Main Features * Easy to build, simple for adding new extensions * Manages build environment systemically * Fast build through performance optimization (even written in Python) * Supports multi-processing * Extremely less memory usage * Don't need to write new codes for each corpus. Just write one line for adding a new corpus. ## Dependencies * nltk * ijson * tqdm>=4.46.0 * mwparserfromhell>=0.5.4 * tokenizers>=0.7.0 * kss==1.3.1 ## Installation ### With pip Expanda can be installed using pip as follows: ```console $ pip install expanda ``` ### From source You can install from source by cloning the repository and running: ```console $ git clone https://github.com/affjljoo3581/Expanda.git $ cd Expanda $ python setup.py install ``` ## Build your first dataset Let's build **Wikipedia** dataset by using Expanda. First of all, install Expanda. ```console $ pip install expanda ``` Next, create a workspace to build dataset by running: ```console $ mkdir workspace $ cd workspace ``` Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). Download the file through the browser, move to `workspace/src` and rename to `wiki.xml.bz2`. Instead, run below code: ```console $ mkdir src $ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 ``` After downloading the dump file, we need to setup the configuration file. Create ``expanda.cfg`` file and write the below: ```ini [expanda.ext.wikipedia] num-cores = 6 [tokenization] unk-token = control-tokens = [build] input-files = --expanda.ext.wikipedia src/wiki.xml.bz2 ``` The current directory structure of `workspace` should be as follows: ``` workspace ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` Now we are ready to build! Run Expanda by using: ```console $ expanda build ``` Then we can get the below output: ``` [*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] [nltk_data] Downloading package punkt to /home/user/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [*] merge extracted texts. [*] start shuffling merged corpus... [*] optimum stride: 17, buckets: 34 [*] create temporary bucket files. [*] successfully shuffle offsets. total offsets: 102936 [*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] [*] start copying buckets to the output file. [*] finish copying buckets. remove the buckets... [*] complete preparing corpus. start training tokenizer... [00:00:59] Reading files ████████████████████ 100 [00:00:04] Tokenize words ████████████████████ 405802 / 405802 [00:00:00] Count pairs ████████████████████ 405802 / 405802 [00:00:01] Compute merges ████████████████████ 6332 / 6332 [*] create tokenized corpus. [*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] [*] split the corpus into train and test dataset. [*] remove temporary directory. [*] finish building corpus. ``` If you build dataset successfully, you can get the following directory tree: ``` workspace ├── build │ ├── corpus.raw.txt │ ├── corpus.train.txt │ ├── corpus.test.txt │ └── vocab.txt ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` %package -n python3-Expanda Summary: Integrated Corpus-Building Environment Provides: python-Expanda BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-Expanda # Expanda **The universal integrated corpus-building environment.** [![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda) ![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg) [![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest) ![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda) [![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda) [![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda) ## Introduction **Expanda** is an **integrated corpus-building environment**. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration. For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). ## Main Features * Easy to build, simple for adding new extensions * Manages build environment systemically * Fast build through performance optimization (even written in Python) * Supports multi-processing * Extremely less memory usage * Don't need to write new codes for each corpus. Just write one line for adding a new corpus. ## Dependencies * nltk * ijson * tqdm>=4.46.0 * mwparserfromhell>=0.5.4 * tokenizers>=0.7.0 * kss==1.3.1 ## Installation ### With pip Expanda can be installed using pip as follows: ```console $ pip install expanda ``` ### From source You can install from source by cloning the repository and running: ```console $ git clone https://github.com/affjljoo3581/Expanda.git $ cd Expanda $ python setup.py install ``` ## Build your first dataset Let's build **Wikipedia** dataset by using Expanda. First of all, install Expanda. ```console $ pip install expanda ``` Next, create a workspace to build dataset by running: ```console $ mkdir workspace $ cd workspace ``` Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). Download the file through the browser, move to `workspace/src` and rename to `wiki.xml.bz2`. Instead, run below code: ```console $ mkdir src $ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 ``` After downloading the dump file, we need to setup the configuration file. Create ``expanda.cfg`` file and write the below: ```ini [expanda.ext.wikipedia] num-cores = 6 [tokenization] unk-token = control-tokens = [build] input-files = --expanda.ext.wikipedia src/wiki.xml.bz2 ``` The current directory structure of `workspace` should be as follows: ``` workspace ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` Now we are ready to build! Run Expanda by using: ```console $ expanda build ``` Then we can get the below output: ``` [*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] [nltk_data] Downloading package punkt to /home/user/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [*] merge extracted texts. [*] start shuffling merged corpus... [*] optimum stride: 17, buckets: 34 [*] create temporary bucket files. [*] successfully shuffle offsets. total offsets: 102936 [*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] [*] start copying buckets to the output file. [*] finish copying buckets. remove the buckets... [*] complete preparing corpus. start training tokenizer... [00:00:59] Reading files ████████████████████ 100 [00:00:04] Tokenize words ████████████████████ 405802 / 405802 [00:00:00] Count pairs ████████████████████ 405802 / 405802 [00:00:01] Compute merges ████████████████████ 6332 / 6332 [*] create tokenized corpus. [*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] [*] split the corpus into train and test dataset. [*] remove temporary directory. [*] finish building corpus. ``` If you build dataset successfully, you can get the following directory tree: ``` workspace ├── build │ ├── corpus.raw.txt │ ├── corpus.train.txt │ ├── corpus.test.txt │ └── vocab.txt ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` %package help Summary: Development documents and examples for Expanda Provides: python3-Expanda-doc %description help # Expanda **The universal integrated corpus-building environment.** [![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda) ![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg) [![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest) ![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda) [![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda) [![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda) ## Introduction **Expanda** is an **integrated corpus-building environment**. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration. For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/). ## Main Features * Easy to build, simple for adding new extensions * Manages build environment systemically * Fast build through performance optimization (even written in Python) * Supports multi-processing * Extremely less memory usage * Don't need to write new codes for each corpus. Just write one line for adding a new corpus. ## Dependencies * nltk * ijson * tqdm>=4.46.0 * mwparserfromhell>=0.5.4 * tokenizers>=0.7.0 * kss==1.3.1 ## Installation ### With pip Expanda can be installed using pip as follows: ```console $ pip install expanda ``` ### From source You can install from source by cloning the repository and running: ```console $ git clone https://github.com/affjljoo3581/Expanda.git $ cd Expanda $ python setup.py install ``` ## Build your first dataset Let's build **Wikipedia** dataset by using Expanda. First of all, install Expanda. ```console $ pip install expanda ``` Next, create a workspace to build dataset by running: ```console $ mkdir workspace $ cd workspace ``` Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/). In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2). Download the file through the browser, move to `workspace/src` and rename to `wiki.xml.bz2`. Instead, run below code: ```console $ mkdir src $ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2 ``` After downloading the dump file, we need to setup the configuration file. Create ``expanda.cfg`` file and write the below: ```ini [expanda.ext.wikipedia] num-cores = 6 [tokenization] unk-token = control-tokens = [build] input-files = --expanda.ext.wikipedia src/wiki.xml.bz2 ``` The current directory structure of `workspace` should be as follows: ``` workspace ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` Now we are ready to build! Run Expanda by using: ```console $ expanda build ``` Then we can get the below output: ``` [*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2] [nltk_data] Downloading package punkt to /home/user/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [*] merge extracted texts. [*] start shuffling merged corpus... [*] optimum stride: 17, buckets: 34 [*] create temporary bucket files. [*] successfully shuffle offsets. total offsets: 102936 [*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s] [*] start copying buckets to the output file. [*] finish copying buckets. remove the buckets... [*] complete preparing corpus. start training tokenizer... [00:00:59] Reading files ████████████████████ 100 [00:00:04] Tokenize words ████████████████████ 405802 / 405802 [00:00:00] Count pairs ████████████████████ 405802 / 405802 [00:00:01] Compute merges ████████████████████ 6332 / 6332 [*] create tokenized corpus. [*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s] [*] split the corpus into train and test dataset. [*] remove temporary directory. [*] finish building corpus. ``` If you build dataset successfully, you can get the following directory tree: ``` workspace ├── build │ ├── corpus.raw.txt │ ├── corpus.train.txt │ ├── corpus.test.txt │ └── vocab.txt ├── src │ └── wiki.xml.bz2 └── expanda.cfg ``` %prep %autosetup -n Expanda-1.3.1 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-Expanda -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Wed May 31 2023 Python_Bot - 1.3.1-1 - Package Spec generated