diff options
author | CoprDistGit <infra@openeuler.org> | 2023-05-15 06:19:56 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-05-15 06:19:56 +0000 |
commit | 1c64dd5646c8e4734aa3606c882bcb2d17c10dbe (patch) | |
tree | be5d210250a77d29ddb889645151d771f8600ac3 | |
parent | f391aad14b8770eec4674980111da8c1d520bcc7 (diff) |
automatic import of python-textaugment
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-textaugment.spec | 851 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 853 insertions, 0 deletions
@@ -0,0 +1 @@ +/textaugment-1.3.4.tar.gz diff --git a/python-textaugment.spec b/python-textaugment.spec new file mode 100644 index 0000000..4f61cda --- /dev/null +++ b/python-textaugment.spec @@ -0,0 +1,851 @@ +%global _empty_manifest_terminate_build 0 +Name: python-textaugment +Version: 1.3.4 +Release: 1 +Summary: A library for augmenting text for natural language processing applications. +License: MIT +URL: https://github.com/dsfsi/textaugment +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/fd/5b/287bc5b562dbee88376472d98701e7cbc68ea4bbdf68a71f12e53d13348a/textaugment-1.3.4.tar.gz +BuildArch: noarch + +Requires: python3-nltk +Requires: python3-gensim +Requires: python3-textblob +Requires: python3-numpy +Requires: python3-googletrans + +%description + + +# [TextAugment: Improving Short Text Classification through Global Augmentation Methods](https://arxiv.org/abs/1907.03752) + +[](https://github.com/dsfsi/textaugment/blob/master/LICENCE) [](https://github.com/dsfsi/textaugment/releases) [](https://pypi.python.org/pypi/textaugment) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) [](https://arxiv.org/abs/1907.03752) + +## You have just found TextAugment. + +TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them. + +# Table of Contents + +- [Features](#Features) +- [Citation Paper](#citation-paper) + - [Requirements](#Requirements) + - [Installation](#Installation) + - [How to use](#How-to-use) + - [Word2vec-based augmentation](#Word2vec-based-augmentation) + - [WordNet-based augmentation](#WordNet-based-augmentation) + - [RTT-based augmentation](#RTT-based-augmentation) +- [Easy data augmentation (EDA)](#eda-easy-data-augmentation-techniques-for-boosting-performance-on-text-classification-tasks) +- [Mixup augmentation](#mixup-augmentation) + - [Implementation](#Implementation) +- [Acknowledgements](#Acknowledgements) + +## Features + +- Generate synthetic data for improving model performance without manual effort +- Simple, lightweight, easy-to-use library. +- Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn) +- Support textual data + +## Citation Paper + +**[Improving short text classification through global augmentation methods](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21)**. + + + + + +### Requirements + +* Python 3 + +The following software packages are dependencies and will be installed automatically. + +```shell +$ pip install numpy nltk gensim textblob googletrans + +``` +The following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html). +```python +nltk.download('wordnet') +``` +The following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. +```python +nltk.download('punkt') +``` +The following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word. +```python +nltk.download('averaged_perceptron_tagger') +``` +Use gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). +```python +import gensim +model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True) +``` +You can also use gensim to load Facebook's Fasttext [English](https://fasttext.cc/docs/en/english-vectors.html) and [Multilingual models](https://fasttext.cc/docs/en/crawl-vectors.html) +``` +import gensim +model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz') +``` + +Or training one from scratch using your data or the following public dataset: + +- [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip) + +- [Dataset from "One Billion Word Language Modeling Benchmark"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz) + +### Installation + +Install from pip [Recommended] +```sh +$ pip install textaugment +or install latest release +$ pip install git+git@github.com:dsfsi/textaugment.git +``` + +Install from source +```sh +$ git clone git@github.com:dsfsi/textaugment.git +$ cd textaugment +$ python setup.py install +``` + +### How to use + +There are three types of augmentations which can be used: + +- word2vec + +```python +from textaugment import Word2vec +``` + +- wordnet +```python +from textaugment import Wordnet +``` +- translate (This will require internet access) +```python +from textaugment import Translate +``` +#### Word2vec-based augmentation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/word2vec_example.ipynb) + +**Basic example** + +```python +>>> from textaugment import Word2vec +>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself') +>>> t.augment('The stories are good') +The films are good +``` +**Advanced example** + +```python +>>> runs = 1 # By default. +>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf) +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5) +>>> t.augment('The stories are good') +The movies are excellent +``` +#### WordNet-based augmentation +**Basic example** +```python +>>> import nltk +>>> nltk.download('punkt') +>>> nltk.download('wordnet') +>>> from textaugment import Wordnet +>>> t = Wordnet() +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, John is walking to town +``` +**Advanced example** + +```python +>>> v = True # enable verbs augmentation. By default is True. +>>> n = False # enable nouns augmentation. By default is False. +>>> runs = 1 # number of times to augment a sentence. By default is 1. +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Wordnet(v=False ,n=True, p=0.5) +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, Joseph is going to town. +``` +#### RTT-based augmentation +**Example** +```python +>>> src = "en" # source language of the sentence +>>> to = "fr" # target language +>>> from textaugment import Translate +>>> t = Translate(src="en", to="fr") +>>> t.augment('In the afternoon, John is going to town') +In the afternoon John goes to town +``` +# EDA: Easy data augmentation techniques for boosting performance on text classification tasks +## This is the implementation of EDA by Jason Wei and Kai Zou. + +https://www.aclweb.org/anthology/D19-1670.pdf + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/eda_example.ipynb) + +#### Synonym Replacement +Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with +one of its synonyms chosen at random. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.synonym_replacement("John is going to town") +John is give out to town +``` + +#### Random Deletion +Randomly remove each word in the sentence with probability *p*. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_deletion("John is going to town", p=0.2) +is going to town +``` + +#### Random Swap +Randomly choose two words in the sentence and swap their positions. Do this n times. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_swap("John is going to town") +John town going to is +``` + +#### Random Insertion +Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_insertion("John is going to town") +John is going to make up town +``` + +# Mixup augmentation + +This is the implementation of mixup augmentation by [Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz](https://openreview.net/forum?id=r1Ddp1-Rb) adapted to NLP. + +Used in [Augmenting Data with Mixup for Sentence Classification: An Empirical Study](https://arxiv.org/abs/1905.08941). + +Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples. + +## Implementation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/mixup_example_using_IMDB_sentiment.ipynb) + +## Built with ❤ on +* [Python](http://python.org/) + +## Authors +* [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za) +* [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za) + +## Acknowledgements +Cite this [paper](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) when using this library. [Arxiv Version](https://arxiv.org/abs/1907.03752) + +``` +@inproceedings{marivate2020improving, + title={Improving short text classification through global augmentation methods}, + author={Marivate, Vukosi and Sefara, Tshephisho}, + booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction}, + pages={385--399}, + year={2020}, + organization={Springer} +} +``` + +## Licence +MIT licensed. See the bundled [LICENCE](https://github.com/dsfsi/textaugment/blob/master/LICENCE) file for more details. + + + + +%package -n python3-textaugment +Summary: A library for augmenting text for natural language processing applications. +Provides: python-textaugment +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-textaugment + + +# [TextAugment: Improving Short Text Classification through Global Augmentation Methods](https://arxiv.org/abs/1907.03752) + +[](https://github.com/dsfsi/textaugment/blob/master/LICENCE) [](https://github.com/dsfsi/textaugment/releases) [](https://pypi.python.org/pypi/textaugment) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) [](https://arxiv.org/abs/1907.03752) + +## You have just found TextAugment. + +TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them. + +# Table of Contents + +- [Features](#Features) +- [Citation Paper](#citation-paper) + - [Requirements](#Requirements) + - [Installation](#Installation) + - [How to use](#How-to-use) + - [Word2vec-based augmentation](#Word2vec-based-augmentation) + - [WordNet-based augmentation](#WordNet-based-augmentation) + - [RTT-based augmentation](#RTT-based-augmentation) +- [Easy data augmentation (EDA)](#eda-easy-data-augmentation-techniques-for-boosting-performance-on-text-classification-tasks) +- [Mixup augmentation](#mixup-augmentation) + - [Implementation](#Implementation) +- [Acknowledgements](#Acknowledgements) + +## Features + +- Generate synthetic data for improving model performance without manual effort +- Simple, lightweight, easy-to-use library. +- Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn) +- Support textual data + +## Citation Paper + +**[Improving short text classification through global augmentation methods](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21)**. + + + + + +### Requirements + +* Python 3 + +The following software packages are dependencies and will be installed automatically. + +```shell +$ pip install numpy nltk gensim textblob googletrans + +``` +The following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html). +```python +nltk.download('wordnet') +``` +The following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. +```python +nltk.download('punkt') +``` +The following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word. +```python +nltk.download('averaged_perceptron_tagger') +``` +Use gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). +```python +import gensim +model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True) +``` +You can also use gensim to load Facebook's Fasttext [English](https://fasttext.cc/docs/en/english-vectors.html) and [Multilingual models](https://fasttext.cc/docs/en/crawl-vectors.html) +``` +import gensim +model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz') +``` + +Or training one from scratch using your data or the following public dataset: + +- [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip) + +- [Dataset from "One Billion Word Language Modeling Benchmark"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz) + +### Installation + +Install from pip [Recommended] +```sh +$ pip install textaugment +or install latest release +$ pip install git+git@github.com:dsfsi/textaugment.git +``` + +Install from source +```sh +$ git clone git@github.com:dsfsi/textaugment.git +$ cd textaugment +$ python setup.py install +``` + +### How to use + +There are three types of augmentations which can be used: + +- word2vec + +```python +from textaugment import Word2vec +``` + +- wordnet +```python +from textaugment import Wordnet +``` +- translate (This will require internet access) +```python +from textaugment import Translate +``` +#### Word2vec-based augmentation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/word2vec_example.ipynb) + +**Basic example** + +```python +>>> from textaugment import Word2vec +>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself') +>>> t.augment('The stories are good') +The films are good +``` +**Advanced example** + +```python +>>> runs = 1 # By default. +>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf) +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5) +>>> t.augment('The stories are good') +The movies are excellent +``` +#### WordNet-based augmentation +**Basic example** +```python +>>> import nltk +>>> nltk.download('punkt') +>>> nltk.download('wordnet') +>>> from textaugment import Wordnet +>>> t = Wordnet() +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, John is walking to town +``` +**Advanced example** + +```python +>>> v = True # enable verbs augmentation. By default is True. +>>> n = False # enable nouns augmentation. By default is False. +>>> runs = 1 # number of times to augment a sentence. By default is 1. +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Wordnet(v=False ,n=True, p=0.5) +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, Joseph is going to town. +``` +#### RTT-based augmentation +**Example** +```python +>>> src = "en" # source language of the sentence +>>> to = "fr" # target language +>>> from textaugment import Translate +>>> t = Translate(src="en", to="fr") +>>> t.augment('In the afternoon, John is going to town') +In the afternoon John goes to town +``` +# EDA: Easy data augmentation techniques for boosting performance on text classification tasks +## This is the implementation of EDA by Jason Wei and Kai Zou. + +https://www.aclweb.org/anthology/D19-1670.pdf + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/eda_example.ipynb) + +#### Synonym Replacement +Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with +one of its synonyms chosen at random. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.synonym_replacement("John is going to town") +John is give out to town +``` + +#### Random Deletion +Randomly remove each word in the sentence with probability *p*. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_deletion("John is going to town", p=0.2) +is going to town +``` + +#### Random Swap +Randomly choose two words in the sentence and swap their positions. Do this n times. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_swap("John is going to town") +John town going to is +``` + +#### Random Insertion +Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_insertion("John is going to town") +John is going to make up town +``` + +# Mixup augmentation + +This is the implementation of mixup augmentation by [Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz](https://openreview.net/forum?id=r1Ddp1-Rb) adapted to NLP. + +Used in [Augmenting Data with Mixup for Sentence Classification: An Empirical Study](https://arxiv.org/abs/1905.08941). + +Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples. + +## Implementation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/mixup_example_using_IMDB_sentiment.ipynb) + +## Built with ❤ on +* [Python](http://python.org/) + +## Authors +* [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za) +* [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za) + +## Acknowledgements +Cite this [paper](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) when using this library. [Arxiv Version](https://arxiv.org/abs/1907.03752) + +``` +@inproceedings{marivate2020improving, + title={Improving short text classification through global augmentation methods}, + author={Marivate, Vukosi and Sefara, Tshephisho}, + booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction}, + pages={385--399}, + year={2020}, + organization={Springer} +} +``` + +## Licence +MIT licensed. See the bundled [LICENCE](https://github.com/dsfsi/textaugment/blob/master/LICENCE) file for more details. + + + + +%package help +Summary: Development documents and examples for textaugment +Provides: python3-textaugment-doc +%description help + + +# [TextAugment: Improving Short Text Classification through Global Augmentation Methods](https://arxiv.org/abs/1907.03752) + +[](https://github.com/dsfsi/textaugment/blob/master/LICENCE) [](https://github.com/dsfsi/textaugment/releases) [](https://pypi.python.org/pypi/textaugment) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://pypi.org/project/textaugment/) [](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) [](https://arxiv.org/abs/1907.03752) + +## You have just found TextAugment. + +TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them. + +# Table of Contents + +- [Features](#Features) +- [Citation Paper](#citation-paper) + - [Requirements](#Requirements) + - [Installation](#Installation) + - [How to use](#How-to-use) + - [Word2vec-based augmentation](#Word2vec-based-augmentation) + - [WordNet-based augmentation](#WordNet-based-augmentation) + - [RTT-based augmentation](#RTT-based-augmentation) +- [Easy data augmentation (EDA)](#eda-easy-data-augmentation-techniques-for-boosting-performance-on-text-classification-tasks) +- [Mixup augmentation](#mixup-augmentation) + - [Implementation](#Implementation) +- [Acknowledgements](#Acknowledgements) + +## Features + +- Generate synthetic data for improving model performance without manual effort +- Simple, lightweight, easy-to-use library. +- Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn) +- Support textual data + +## Citation Paper + +**[Improving short text classification through global augmentation methods](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21)**. + + + + + +### Requirements + +* Python 3 + +The following software packages are dependencies and will be installed automatically. + +```shell +$ pip install numpy nltk gensim textblob googletrans + +``` +The following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html). +```python +nltk.download('wordnet') +``` +The following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. +```python +nltk.download('punkt') +``` +The following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word. +```python +nltk.download('averaged_perceptron_tagger') +``` +Use gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). +```python +import gensim +model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True) +``` +You can also use gensim to load Facebook's Fasttext [English](https://fasttext.cc/docs/en/english-vectors.html) and [Multilingual models](https://fasttext.cc/docs/en/crawl-vectors.html) +``` +import gensim +model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz') +``` + +Or training one from scratch using your data or the following public dataset: + +- [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip) + +- [Dataset from "One Billion Word Language Modeling Benchmark"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz) + +### Installation + +Install from pip [Recommended] +```sh +$ pip install textaugment +or install latest release +$ pip install git+git@github.com:dsfsi/textaugment.git +``` + +Install from source +```sh +$ git clone git@github.com:dsfsi/textaugment.git +$ cd textaugment +$ python setup.py install +``` + +### How to use + +There are three types of augmentations which can be used: + +- word2vec + +```python +from textaugment import Word2vec +``` + +- wordnet +```python +from textaugment import Wordnet +``` +- translate (This will require internet access) +```python +from textaugment import Translate +``` +#### Word2vec-based augmentation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/word2vec_example.ipynb) + +**Basic example** + +```python +>>> from textaugment import Word2vec +>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself') +>>> t.augment('The stories are good') +The films are good +``` +**Advanced example** + +```python +>>> runs = 1 # By default. +>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf) +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5) +>>> t.augment('The stories are good') +The movies are excellent +``` +#### WordNet-based augmentation +**Basic example** +```python +>>> import nltk +>>> nltk.download('punkt') +>>> nltk.download('wordnet') +>>> from textaugment import Wordnet +>>> t = Wordnet() +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, John is walking to town +``` +**Advanced example** + +```python +>>> v = True # enable verbs augmentation. By default is True. +>>> n = False # enable nouns augmentation. By default is False. +>>> runs = 1 # number of times to augment a sentence. By default is 1. +>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence. + +>>> t = Wordnet(v=False ,n=True, p=0.5) +>>> t.augment('In the afternoon, John is going to town') +In the afternoon, Joseph is going to town. +``` +#### RTT-based augmentation +**Example** +```python +>>> src = "en" # source language of the sentence +>>> to = "fr" # target language +>>> from textaugment import Translate +>>> t = Translate(src="en", to="fr") +>>> t.augment('In the afternoon, John is going to town') +In the afternoon John goes to town +``` +# EDA: Easy data augmentation techniques for boosting performance on text classification tasks +## This is the implementation of EDA by Jason Wei and Kai Zou. + +https://www.aclweb.org/anthology/D19-1670.pdf + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/eda_example.ipynb) + +#### Synonym Replacement +Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with +one of its synonyms chosen at random. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.synonym_replacement("John is going to town") +John is give out to town +``` + +#### Random Deletion +Randomly remove each word in the sentence with probability *p*. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_deletion("John is going to town", p=0.2) +is going to town +``` + +#### Random Swap +Randomly choose two words in the sentence and swap their positions. Do this n times. + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_swap("John is going to town") +John town going to is +``` + +#### Random Insertion +Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times + +**Basic example** +```python +>>> from textaugment import EDA +>>> t = EDA() +>>> t.random_insertion("John is going to town") +John is going to make up town +``` + +# Mixup augmentation + +This is the implementation of mixup augmentation by [Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz](https://openreview.net/forum?id=r1Ddp1-Rb) adapted to NLP. + +Used in [Augmenting Data with Mixup for Sentence Classification: An Empirical Study](https://arxiv.org/abs/1905.08941). + +Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples. + +## Implementation + +[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/mixup_example_using_IMDB_sentiment.ipynb) + +## Built with ❤ on +* [Python](http://python.org/) + +## Authors +* [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za) +* [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za) + +## Acknowledgements +Cite this [paper](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) when using this library. [Arxiv Version](https://arxiv.org/abs/1907.03752) + +``` +@inproceedings{marivate2020improving, + title={Improving short text classification through global augmentation methods}, + author={Marivate, Vukosi and Sefara, Tshephisho}, + booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction}, + pages={385--399}, + year={2020}, + organization={Springer} +} +``` + +## Licence +MIT licensed. See the bundled [LICENCE](https://github.com/dsfsi/textaugment/blob/master/LICENCE) file for more details. + + + + +%prep +%autosetup -n textaugment-1.3.4 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-textaugment -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 1.3.4-1 +- Package Spec generated @@ -0,0 +1 @@ +76b7a9253ba385fe7df25437402b578f textaugment-1.3.4.tar.gz |