%global _empty_manifest_terminate_build 0
Name:		python-neattext
Version:	0.1.3
Release:	1
Summary:	Neattext - a simple NLP package for cleaning text
License:	MIT
URL:		https://github.com/Jcharis/neattext
Source0:	https://mirrors.aliyun.com/pypi/web/packages/49/cd/b1224076d1b370a0172643f96155f97017ac5aba2c8b56f2999ba4061056/neattext-0.1.3.tar.gz
BuildArch:	noarch


%description
# neattext
NeatText:a simple NLP package for cleaning textual data and text preprocessing.
Simplifying Text Cleaning For NLP & ML

[![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext)

[![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE)

#### Problem
+ Cleaning of unstructured text data
+ Reduce noise [special characters,stopwords]
+ Reducing repetition of using the same code for text preprocessing

#### Solution
+ convert the already known solution for cleaning text into a reuseable package

#### Docs
+ Check out the full docs [here](https://jcharis.github.io/neattext/)

#### Installation
```bash
pip install neattext
```

### Usage
+ The OOP Way(Object Oriented Way)
+ NeatText offers 5 main classes for working with text data
	- TextFrame : a frame-like object for cleaning text
	- TextCleaner: remove or replace specifics
	- TextExtractor: extract unwanted text data
	- TextMetrics: word stats and metrics
	- TextPipeline: combine multiple functions in a pipeline

### Overall Components of NeatText
![](images/neattext_features_jcharistech.png)

### Using TextFrame
+ Keeps the text as `TextFrame` object. This allows us to do more with our text. 
+ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data.
+ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too.


```python
>>> import neattext as nt 
>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(text=mytext)
>>> docx.text 
"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>>
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14             
>>> 
>>> docx.length
73
>>> # Scan Percentage of Noise(Unclean data) in text
>>> d.noise_scan()
{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14}
>>> 
>>> docs.head(16)
'This is the mail'
>>> docx.tail()
>>> docx.count_vowels()
>>> docx.count_stopwords()
>>> docx.count_consonants()
>>> docx.nlongest()
>>> docx.nshortest()
>>> docx.readability()
```
#### Basic NLP Task (Tokenization,Ngram,Text Generation)
```python
>>> docx.word_tokens()
>>>
>>> docx.sent_tokens()
>>>
>>> docx.term_freq()
>>>
>>> docx.bow()
```
#### Basic Text Preprocessing
```python
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '

>>> docx.remove_puncts()
>>> docx.remove_stopwords()
>>> docx.remove_html_tags()
>>> docx.remove_special_characters()
>>> docx.remove_emojis()
>>> docx.fix_contractions()
```

##### Handling Files with NeatText
+ Read txt file directly into TextFrame
```python
>>> import neattext as nt 
>>> docx_df = nt.read_txt('file.txt')
```
+ Alternatively you can instantiate a TextFrame and read a text file into it
```python
>>> import neattext as nt 
>>> docx_df = nt.TextFrame().read_txt('file.txt')
```

##### Chaining Methods on TextFrame
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextFrame(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'
```


#### Clean Text
+ Clean text by removing emails,numbers,stopwords,emojis,etc
+ A simplified method for cleaning text by specifying as True/False what to clean from a text
```python
>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'
```
+ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.

+ You can choose to remove or not remove punctuations by setting to True/False respectively

```python
>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>> 
```
+ You can also remove the other non-needed items accordingly
```python
>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>> 

```

#### Removing Punctuations [A Very Common Text Preprocessing Step]
+ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
+ Alternatively you can also remove all known punctuations from a text.
```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")
```

#### Removing Stopwords [A Very Common Text Preprocessing Step]
+ You can remove stopwords from a text by specifying the language. The default language is English
+ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de)

```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_stopwords(lang='en')
TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!")
```


#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc 
```python
>>> print(docx.remove_emails())
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> print(docx.remove_stopwords())
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> print(docx.remove_numbers())
>>> docx.remove_phone_numbers()
>>> docx.remove_btc_address()
```


#### Remove Special Characters
```python
>>> docx.remove_special_characters()
```

#### Remove Emojis
```python
>>> print(docx.remove_emojis())
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'
```


#### Remove Custom Pattern
+ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function
```python
>>> import neattext.functions as nfx 
>>> ex = "Last !RT tweeter multiple &#7777"
>>> 
>>> nfx.remove_custom_pattern(e,r'&#\d+')
'Last !RT tweeter multiple  '


```

#### Replace Emails,Numbers,Phone Numbers
```python
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()
```

#### Chain Multiple Methods
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextCleaner(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'

```

### Using TextExtractor
+ To Extract emails,phone numbers,numbers,urls,emojis from text
```python
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']
```


### Using TextMetrics
+ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
```python
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
>>> docx.memory_usage()
```

### Usage 
+ The MOP(method/function oriented way) Way

```python
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['example@gmail.com']
```

+ Alternatively you can also use this approach
```python
>>> import neattext.functions as nfx 
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['example@gmail.com']
```

### Explainer
+ Explain an emoji or unicode for emoji 
	- emoji_explainer()
	- emojify()
	- unicode_2_emoji()


```python
>>> from neattext.explainer import emojify
>>> emojify('Smiley')
>>> '😃'
```

```python
>>> from neattext.explainer import emoji_explainer
>>> emoji_explainer('😃')
>>> 'SMILING FACE WITH OPEN MOUTH'
```

```python
>>> from neattext.explainer import unicode_2_emoji
>>> unicode_2_emoji('0x1f49b')
	'FLUSHED FACE'
```

### Usage 
+ The Pipeline Way

```python
>>> from neattext.pipeline import TextPipeline
>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU"""

>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis])
>>> p.fit(t1)
'This is the mail  ,our WEBSITE is https://example.com . This is visa     and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard    . Send it to PO Box , KNU'

```
+ Check For steps and named steps
```python
>>> p.steps
>>> p.named_steps
```

+ Alternatively you can also use this approach


### Documentation
Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check 
out our readthedocs page [here](https://jcharis.github.io/neattext/)


### More Features To Add
+ basic nlp task
+ currency normalizer

#### Acknowledgements
+ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech


#### NB
+ Contributions Are Welcomed
+ Notice a bug, please let us know.
+ Thanks A lot


#### By
+ Jesse E.Agbe(JCharis)
+ Jesus Saves @JCharisTech


%package -n python3-neattext
Summary:	Neattext - a simple NLP package for cleaning text
Provides:	python-neattext
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-neattext
# neattext
NeatText:a simple NLP package for cleaning textual data and text preprocessing.
Simplifying Text Cleaning For NLP & ML

[![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext)

[![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE)

#### Problem
+ Cleaning of unstructured text data
+ Reduce noise [special characters,stopwords]
+ Reducing repetition of using the same code for text preprocessing

#### Solution
+ convert the already known solution for cleaning text into a reuseable package

#### Docs
+ Check out the full docs [here](https://jcharis.github.io/neattext/)

#### Installation
```bash
pip install neattext
```

### Usage
+ The OOP Way(Object Oriented Way)
+ NeatText offers 5 main classes for working with text data
	- TextFrame : a frame-like object for cleaning text
	- TextCleaner: remove or replace specifics
	- TextExtractor: extract unwanted text data
	- TextMetrics: word stats and metrics
	- TextPipeline: combine multiple functions in a pipeline

### Overall Components of NeatText
![](images/neattext_features_jcharistech.png)

### Using TextFrame
+ Keeps the text as `TextFrame` object. This allows us to do more with our text. 
+ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data.
+ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too.


```python
>>> import neattext as nt 
>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(text=mytext)
>>> docx.text 
"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>>
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14             
>>> 
>>> docx.length
73
>>> # Scan Percentage of Noise(Unclean data) in text
>>> d.noise_scan()
{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14}
>>> 
>>> docs.head(16)
'This is the mail'
>>> docx.tail()
>>> docx.count_vowels()
>>> docx.count_stopwords()
>>> docx.count_consonants()
>>> docx.nlongest()
>>> docx.nshortest()
>>> docx.readability()
```
#### Basic NLP Task (Tokenization,Ngram,Text Generation)
```python
>>> docx.word_tokens()
>>>
>>> docx.sent_tokens()
>>>
>>> docx.term_freq()
>>>
>>> docx.bow()
```
#### Basic Text Preprocessing
```python
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '

>>> docx.remove_puncts()
>>> docx.remove_stopwords()
>>> docx.remove_html_tags()
>>> docx.remove_special_characters()
>>> docx.remove_emojis()
>>> docx.fix_contractions()
```

##### Handling Files with NeatText
+ Read txt file directly into TextFrame
```python
>>> import neattext as nt 
>>> docx_df = nt.read_txt('file.txt')
```
+ Alternatively you can instantiate a TextFrame and read a text file into it
```python
>>> import neattext as nt 
>>> docx_df = nt.TextFrame().read_txt('file.txt')
```

##### Chaining Methods on TextFrame
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextFrame(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'
```


#### Clean Text
+ Clean text by removing emails,numbers,stopwords,emojis,etc
+ A simplified method for cleaning text by specifying as True/False what to clean from a text
```python
>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'
```
+ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.

+ You can choose to remove or not remove punctuations by setting to True/False respectively

```python
>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>> 
```
+ You can also remove the other non-needed items accordingly
```python
>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>> 

```

#### Removing Punctuations [A Very Common Text Preprocessing Step]
+ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
+ Alternatively you can also remove all known punctuations from a text.
```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")
```

#### Removing Stopwords [A Very Common Text Preprocessing Step]
+ You can remove stopwords from a text by specifying the language. The default language is English
+ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de)

```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_stopwords(lang='en')
TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!")
```


#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc 
```python
>>> print(docx.remove_emails())
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> print(docx.remove_stopwords())
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> print(docx.remove_numbers())
>>> docx.remove_phone_numbers()
>>> docx.remove_btc_address()
```


#### Remove Special Characters
```python
>>> docx.remove_special_characters()
```

#### Remove Emojis
```python
>>> print(docx.remove_emojis())
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'
```


#### Remove Custom Pattern
+ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function
```python
>>> import neattext.functions as nfx 
>>> ex = "Last !RT tweeter multiple &#7777"
>>> 
>>> nfx.remove_custom_pattern(e,r'&#\d+')
'Last !RT tweeter multiple  '


```

#### Replace Emails,Numbers,Phone Numbers
```python
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()
```

#### Chain Multiple Methods
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextCleaner(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'

```

### Using TextExtractor
+ To Extract emails,phone numbers,numbers,urls,emojis from text
```python
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']
```


### Using TextMetrics
+ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
```python
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
>>> docx.memory_usage()
```

### Usage 
+ The MOP(method/function oriented way) Way

```python
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['example@gmail.com']
```

+ Alternatively you can also use this approach
```python
>>> import neattext.functions as nfx 
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['example@gmail.com']
```

### Explainer
+ Explain an emoji or unicode for emoji 
	- emoji_explainer()
	- emojify()
	- unicode_2_emoji()


```python
>>> from neattext.explainer import emojify
>>> emojify('Smiley')
>>> '😃'
```

```python
>>> from neattext.explainer import emoji_explainer
>>> emoji_explainer('😃')
>>> 'SMILING FACE WITH OPEN MOUTH'
```

```python
>>> from neattext.explainer import unicode_2_emoji
>>> unicode_2_emoji('0x1f49b')
	'FLUSHED FACE'
```

### Usage 
+ The Pipeline Way

```python
>>> from neattext.pipeline import TextPipeline
>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU"""

>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis])
>>> p.fit(t1)
'This is the mail  ,our WEBSITE is https://example.com . This is visa     and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard    . Send it to PO Box , KNU'

```
+ Check For steps and named steps
```python
>>> p.steps
>>> p.named_steps
```

+ Alternatively you can also use this approach


### Documentation
Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check 
out our readthedocs page [here](https://jcharis.github.io/neattext/)


### More Features To Add
+ basic nlp task
+ currency normalizer

#### Acknowledgements
+ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech


#### NB
+ Contributions Are Welcomed
+ Notice a bug, please let us know.
+ Thanks A lot


#### By
+ Jesse E.Agbe(JCharis)
+ Jesus Saves @JCharisTech


%package help
Summary:	Development documents and examples for neattext
Provides:	python3-neattext-doc
%description help
# neattext
NeatText:a simple NLP package for cleaning textual data and text preprocessing.
Simplifying Text Cleaning For NLP & ML

[![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext)

[![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE)

#### Problem
+ Cleaning of unstructured text data
+ Reduce noise [special characters,stopwords]
+ Reducing repetition of using the same code for text preprocessing

#### Solution
+ convert the already known solution for cleaning text into a reuseable package

#### Docs
+ Check out the full docs [here](https://jcharis.github.io/neattext/)

#### Installation
```bash
pip install neattext
```

### Usage
+ The OOP Way(Object Oriented Way)
+ NeatText offers 5 main classes for working with text data
	- TextFrame : a frame-like object for cleaning text
	- TextCleaner: remove or replace specifics
	- TextExtractor: extract unwanted text data
	- TextMetrics: word stats and metrics
	- TextPipeline: combine multiple functions in a pipeline

### Overall Components of NeatText
![](images/neattext_features_jcharistech.png)

### Using TextFrame
+ Keeps the text as `TextFrame` object. This allows us to do more with our text. 
+ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data.
+ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too.


```python
>>> import neattext as nt 
>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(text=mytext)
>>> docx.text 
"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>>
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14             
>>> 
>>> docx.length
73
>>> # Scan Percentage of Noise(Unclean data) in text
>>> d.noise_scan()
{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14}
>>> 
>>> docs.head(16)
'This is the mail'
>>> docx.tail()
>>> docx.count_vowels()
>>> docx.count_stopwords()
>>> docx.count_consonants()
>>> docx.nlongest()
>>> docx.nshortest()
>>> docx.readability()
```
#### Basic NLP Task (Tokenization,Ngram,Text Generation)
```python
>>> docx.word_tokens()
>>>
>>> docx.sent_tokens()
>>>
>>> docx.term_freq()
>>>
>>> docx.bow()
```
#### Basic Text Preprocessing
```python
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '

>>> docx.remove_puncts()
>>> docx.remove_stopwords()
>>> docx.remove_html_tags()
>>> docx.remove_special_characters()
>>> docx.remove_emojis()
>>> docx.fix_contractions()
```

##### Handling Files with NeatText
+ Read txt file directly into TextFrame
```python
>>> import neattext as nt 
>>> docx_df = nt.read_txt('file.txt')
```
+ Alternatively you can instantiate a TextFrame and read a text file into it
```python
>>> import neattext as nt 
>>> docx_df = nt.TextFrame().read_txt('file.txt')
```

##### Chaining Methods on TextFrame
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextFrame(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'
```


#### Clean Text
+ Clean text by removing emails,numbers,stopwords,emojis,etc
+ A simplified method for cleaning text by specifying as True/False what to clean from a text
```python
>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'
```
+ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.

+ You can choose to remove or not remove punctuations by setting to True/False respectively

```python
>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>> 
```
+ You can also remove the other non-needed items accordingly
```python
>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>> 

```

#### Removing Punctuations [A Very Common Text Preprocessing Step]
+ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
+ Alternatively you can also remove all known punctuations from a text.
```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")
```

#### Removing Stopwords [A Very Common Text Preprocessing Step]
+ You can remove stopwords from a text by specifying the language. The default language is English
+ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de)

```python
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_stopwords(lang='en')
TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!")
```


#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc 
```python
>>> print(docx.remove_emails())
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> print(docx.remove_stopwords())
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> print(docx.remove_numbers())
>>> docx.remove_phone_numbers()
>>> docx.remove_btc_address()
```


#### Remove Special Characters
```python
>>> docx.remove_special_characters()
```

#### Remove Emojis
```python
>>> print(docx.remove_emojis())
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'
```


#### Remove Custom Pattern
+ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function
```python
>>> import neattext.functions as nfx 
>>> ex = "Last !RT tweeter multiple &#7777"
>>> 
>>> nfx.remove_custom_pattern(e,r'&#\d+')
'Last !RT tweeter multiple  '


```

#### Replace Emails,Numbers,Phone Numbers
```python
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()
```

#### Chain Multiple Methods
```python
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextCleaner(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'

```

### Using TextExtractor
+ To Extract emails,phone numbers,numbers,urls,emojis from text
```python
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']
```


### Using TextMetrics
+ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
```python
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
>>> docx.memory_usage()
```

### Usage 
+ The MOP(method/function oriented way) Way

```python
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['example@gmail.com']
```

+ Alternatively you can also use this approach
```python
>>> import neattext.functions as nfx 
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['example@gmail.com']
```

### Explainer
+ Explain an emoji or unicode for emoji 
	- emoji_explainer()
	- emojify()
	- unicode_2_emoji()


```python
>>> from neattext.explainer import emojify
>>> emojify('Smiley')
>>> '😃'
```

```python
>>> from neattext.explainer import emoji_explainer
>>> emoji_explainer('😃')
>>> 'SMILING FACE WITH OPEN MOUTH'
```

```python
>>> from neattext.explainer import unicode_2_emoji
>>> unicode_2_emoji('0x1f49b')
	'FLUSHED FACE'
```

### Usage 
+ The Pipeline Way

```python
>>> from neattext.pipeline import TextPipeline
>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU"""

>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis])
>>> p.fit(t1)
'This is the mail  ,our WEBSITE is https://example.com . This is visa     and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard    . Send it to PO Box , KNU'

```
+ Check For steps and named steps
```python
>>> p.steps
>>> p.named_steps
```

+ Alternatively you can also use this approach


### Documentation
Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check 
out our readthedocs page [here](https://jcharis.github.io/neattext/)


### More Features To Add
+ basic nlp task
+ currency normalizer

#### Acknowledgements
+ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech


#### NB
+ Contributions Are Welcomed
+ Notice a bug, please let us know.
+ Thanks A lot


#### By
+ Jesse E.Agbe(JCharis)
+ Jesus Saves @JCharisTech


%prep
%autosetup -n neattext-0.1.3

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-neattext -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Fri Jun 09 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.3-1
- Package Spec generated