summaryrefslogtreecommitdiff
path: root/python-expanda.spec
diff options
context:
space:
mode:
Diffstat (limited to 'python-expanda.spec')
-rw-r--r--python-expanda.spec501
1 files changed, 501 insertions, 0 deletions
diff --git a/python-expanda.spec b/python-expanda.spec
new file mode 100644
index 0000000..39b2f2b
--- /dev/null
+++ b/python-expanda.spec
@@ -0,0 +1,501 @@
+%global _empty_manifest_terminate_build 0
+Name: python-Expanda
+Version: 1.3.1
+Release: 1
+Summary: Integrated Corpus-Building Environment
+License: Apache-2.0
+URL: https://github.com/affjljoo3581/Expanda
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d1/3f/37d91da7db21350d7e4885070c746819f0c6500d52bbcdf64d31a5d86eda/Expanda-1.3.1.tar.gz
+BuildArch: noarch
+
+Requires: python3-nltk
+Requires: python3-ijson
+Requires: python3-tqdm
+Requires: python3-mwparserfromhell
+Requires: python3-tokenizers
+Requires: python3-kss
+
+%description
+# Expanda
+
+**The universal integrated corpus-building environment.**
+
+[![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda)
+![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg)
+[![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest)
+![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda)
+[![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda)
+[![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda)
+
+## Introduction
+**Expanda** is an **integrated corpus-building environment**. Expanda provides
+integrated pipelines for building a corpus dataset. Building corpus dataset
+requires several complicated pipelines such as parsing, shuffling, and
+tokenization. If the corpora are gathered from different applications, it would
+be a problem to parse various formats. Expanda helps to build corpus simply at
+once by setting build configuration.
+
+For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/).
+
+## Main Features
+* Easy to build, simple for adding new extensions
+* Manages build environment systemically
+* Fast build through performance optimization (even written in Python)
+* Supports multi-processing
+* Extremely less memory usage
+* Don't need to write new codes for each corpus. Just write one line for adding
+ a new corpus.
+
+## Dependencies
+* nltk
+* ijson
+* tqdm>=4.46.0
+* mwparserfromhell>=0.5.4
+* tokenizers>=0.7.0
+* kss==1.3.1
+
+## Installation
+
+### With pip
+Expanda can be installed using pip as follows:
+
+```console
+$ pip install expanda
+```
+
+### From source
+You can install from source by cloning the repository and running:
+
+```console
+$ git clone https://github.com/affjljoo3581/Expanda.git
+$ cd Expanda
+$ python setup.py install
+```
+
+## Build your first dataset
+Let's build **Wikipedia** dataset by using Expanda. First of all, install
+Expanda.
+```console
+$ pip install expanda
+```
+Next, create a workspace to build dataset by running:
+```console
+$ mkdir workspace
+$ cd workspace
+```
+Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/).
+In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2).
+Download the file through the browser, move to `workspace/src` and rename to
+`wiki.xml.bz2`. Instead, run below code:
+```console
+$ mkdir src
+$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
+```
+After downloading the dump file, we need to setup the configuration file.
+Create ``expanda.cfg`` file and write the below:
+```ini
+[expanda.ext.wikipedia]
+num-cores = 6
+
+[tokenization]
+unk-token = <unk>
+control-tokens = <s>
+ </s>
+ <pad>
+
+[build]
+input-files =
+ --expanda.ext.wikipedia src/wiki.xml.bz2
+```
+The current directory structure of `workspace` should be as follows:
+```
+workspace
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+Now we are ready to build! Run Expanda by using:
+```console
+$ expanda build
+```
+Then we can get the below output:
+```
+[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
+[nltk_data] Downloading package punkt to /home/user/nltk_data...
+[nltk_data] Unzipping tokenizers/punkt.zip.
+[*] merge extracted texts.
+[*] start shuffling merged corpus...
+[*] optimum stride: 17, buckets: 34
+[*] create temporary bucket files.
+[*] successfully shuffle offsets. total offsets: 102936
+[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
+[*] start copying buckets to the output file.
+[*] finish copying buckets. remove the buckets...
+[*] complete preparing corpus. start training tokenizer...
+[00:00:59] Reading files ████████████████████ 100
+[00:00:04] Tokenize words ████████████████████ 405802 / 405802
+[00:00:00] Count pairs ████████████████████ 405802 / 405802
+[00:00:01] Compute merges ████████████████████ 6332 / 6332
+
+[*] create tokenized corpus.
+[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
+[*] split the corpus into train and test dataset.
+[*] remove temporary directory.
+[*] finish building corpus.
+```
+If you build dataset successfully, you can get the following directory tree:
+```
+workspace
+├── build
+│ ├── corpus.raw.txt
+│ ├── corpus.train.txt
+│ ├── corpus.test.txt
+│ └── vocab.txt
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+
+
+
+
+%package -n python3-Expanda
+Summary: Integrated Corpus-Building Environment
+Provides: python-Expanda
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-Expanda
+# Expanda
+
+**The universal integrated corpus-building environment.**
+
+[![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda)
+![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg)
+[![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest)
+![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda)
+[![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda)
+[![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda)
+
+## Introduction
+**Expanda** is an **integrated corpus-building environment**. Expanda provides
+integrated pipelines for building a corpus dataset. Building corpus dataset
+requires several complicated pipelines such as parsing, shuffling, and
+tokenization. If the corpora are gathered from different applications, it would
+be a problem to parse various formats. Expanda helps to build corpus simply at
+once by setting build configuration.
+
+For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/).
+
+## Main Features
+* Easy to build, simple for adding new extensions
+* Manages build environment systemically
+* Fast build through performance optimization (even written in Python)
+* Supports multi-processing
+* Extremely less memory usage
+* Don't need to write new codes for each corpus. Just write one line for adding
+ a new corpus.
+
+## Dependencies
+* nltk
+* ijson
+* tqdm>=4.46.0
+* mwparserfromhell>=0.5.4
+* tokenizers>=0.7.0
+* kss==1.3.1
+
+## Installation
+
+### With pip
+Expanda can be installed using pip as follows:
+
+```console
+$ pip install expanda
+```
+
+### From source
+You can install from source by cloning the repository and running:
+
+```console
+$ git clone https://github.com/affjljoo3581/Expanda.git
+$ cd Expanda
+$ python setup.py install
+```
+
+## Build your first dataset
+Let's build **Wikipedia** dataset by using Expanda. First of all, install
+Expanda.
+```console
+$ pip install expanda
+```
+Next, create a workspace to build dataset by running:
+```console
+$ mkdir workspace
+$ cd workspace
+```
+Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/).
+In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2).
+Download the file through the browser, move to `workspace/src` and rename to
+`wiki.xml.bz2`. Instead, run below code:
+```console
+$ mkdir src
+$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
+```
+After downloading the dump file, we need to setup the configuration file.
+Create ``expanda.cfg`` file and write the below:
+```ini
+[expanda.ext.wikipedia]
+num-cores = 6
+
+[tokenization]
+unk-token = <unk>
+control-tokens = <s>
+ </s>
+ <pad>
+
+[build]
+input-files =
+ --expanda.ext.wikipedia src/wiki.xml.bz2
+```
+The current directory structure of `workspace` should be as follows:
+```
+workspace
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+Now we are ready to build! Run Expanda by using:
+```console
+$ expanda build
+```
+Then we can get the below output:
+```
+[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
+[nltk_data] Downloading package punkt to /home/user/nltk_data...
+[nltk_data] Unzipping tokenizers/punkt.zip.
+[*] merge extracted texts.
+[*] start shuffling merged corpus...
+[*] optimum stride: 17, buckets: 34
+[*] create temporary bucket files.
+[*] successfully shuffle offsets. total offsets: 102936
+[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
+[*] start copying buckets to the output file.
+[*] finish copying buckets. remove the buckets...
+[*] complete preparing corpus. start training tokenizer...
+[00:00:59] Reading files ████████████████████ 100
+[00:00:04] Tokenize words ████████████████████ 405802 / 405802
+[00:00:00] Count pairs ████████████████████ 405802 / 405802
+[00:00:01] Compute merges ████████████████████ 6332 / 6332
+
+[*] create tokenized corpus.
+[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
+[*] split the corpus into train and test dataset.
+[*] remove temporary directory.
+[*] finish building corpus.
+```
+If you build dataset successfully, you can get the following directory tree:
+```
+workspace
+├── build
+│ ├── corpus.raw.txt
+│ ├── corpus.train.txt
+│ ├── corpus.test.txt
+│ └── vocab.txt
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+
+
+
+
+%package help
+Summary: Development documents and examples for Expanda
+Provides: python3-Expanda-doc
+%description help
+# Expanda
+
+**The universal integrated corpus-building environment.**
+
+[![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda)
+![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg)
+[![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest)
+![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda)
+[![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda)
+[![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda)
+
+## Introduction
+**Expanda** is an **integrated corpus-building environment**. Expanda provides
+integrated pipelines for building a corpus dataset. Building corpus dataset
+requires several complicated pipelines such as parsing, shuffling, and
+tokenization. If the corpora are gathered from different applications, it would
+be a problem to parse various formats. Expanda helps to build corpus simply at
+once by setting build configuration.
+
+For more information, see also [documentation](https://expanda.readthedocs.io/en/latest/).
+
+## Main Features
+* Easy to build, simple for adding new extensions
+* Manages build environment systemically
+* Fast build through performance optimization (even written in Python)
+* Supports multi-processing
+* Extremely less memory usage
+* Don't need to write new codes for each corpus. Just write one line for adding
+ a new corpus.
+
+## Dependencies
+* nltk
+* ijson
+* tqdm>=4.46.0
+* mwparserfromhell>=0.5.4
+* tokenizers>=0.7.0
+* kss==1.3.1
+
+## Installation
+
+### With pip
+Expanda can be installed using pip as follows:
+
+```console
+$ pip install expanda
+```
+
+### From source
+You can install from source by cloning the repository and running:
+
+```console
+$ git clone https://github.com/affjljoo3581/Expanda.git
+$ cd Expanda
+$ python setup.py install
+```
+
+## Build your first dataset
+Let's build **Wikipedia** dataset by using Expanda. First of all, install
+Expanda.
+```console
+$ pip install expanda
+```
+Next, create a workspace to build dataset by running:
+```console
+$ mkdir workspace
+$ cd workspace
+```
+Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/).
+In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2).
+Download the file through the browser, move to `workspace/src` and rename to
+`wiki.xml.bz2`. Instead, run below code:
+```console
+$ mkdir src
+$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
+```
+After downloading the dump file, we need to setup the configuration file.
+Create ``expanda.cfg`` file and write the below:
+```ini
+[expanda.ext.wikipedia]
+num-cores = 6
+
+[tokenization]
+unk-token = <unk>
+control-tokens = <s>
+ </s>
+ <pad>
+
+[build]
+input-files =
+ --expanda.ext.wikipedia src/wiki.xml.bz2
+```
+The current directory structure of `workspace` should be as follows:
+```
+workspace
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+Now we are ready to build! Run Expanda by using:
+```console
+$ expanda build
+```
+Then we can get the below output:
+```
+[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
+[nltk_data] Downloading package punkt to /home/user/nltk_data...
+[nltk_data] Unzipping tokenizers/punkt.zip.
+[*] merge extracted texts.
+[*] start shuffling merged corpus...
+[*] optimum stride: 17, buckets: 34
+[*] create temporary bucket files.
+[*] successfully shuffle offsets. total offsets: 102936
+[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
+[*] start copying buckets to the output file.
+[*] finish copying buckets. remove the buckets...
+[*] complete preparing corpus. start training tokenizer...
+[00:00:59] Reading files ████████████████████ 100
+[00:00:04] Tokenize words ████████████████████ 405802 / 405802
+[00:00:00] Count pairs ████████████████████ 405802 / 405802
+[00:00:01] Compute merges ████████████████████ 6332 / 6332
+
+[*] create tokenized corpus.
+[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
+[*] split the corpus into train and test dataset.
+[*] remove temporary directory.
+[*] finish building corpus.
+```
+If you build dataset successfully, you can get the following directory tree:
+```
+workspace
+├── build
+│ ├── corpus.raw.txt
+│ ├── corpus.train.txt
+│ ├── corpus.test.txt
+│ └── vocab.txt
+├── src
+│ └── wiki.xml.bz2
+└── expanda.cfg
+```
+
+
+
+
+%prep
+%autosetup -n Expanda-1.3.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-Expanda -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.3.1-1
+- Package Spec generated