diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-neattext.spec | 1179 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 1181 insertions, 0 deletions
@@ -0,0 +1 @@ +/neattext-0.1.3.tar.gz diff --git a/python-neattext.spec b/python-neattext.spec new file mode 100644 index 0000000..736162e --- /dev/null +++ b/python-neattext.spec @@ -0,0 +1,1179 @@ +%global _empty_manifest_terminate_build 0 +Name: python-neattext +Version: 0.1.3 +Release: 1 +Summary: Neattext - a simple NLP package for cleaning text +License: MIT +URL: https://github.com/Jcharis/neattext +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/49/cd/b1224076d1b370a0172643f96155f97017ac5aba2c8b56f2999ba4061056/neattext-0.1.3.tar.gz +BuildArch: noarch + + +%description +# neattext +NeatText:a simple NLP package for cleaning textual data and text preprocessing. +Simplifying Text Cleaning For NLP & ML + +[](https://travis-ci.org/Jcharis/neattext) + +[](https://github.com/Jcharis/neattext/blob/master/LICENSE) + +#### Problem ++ Cleaning of unstructured text data ++ Reduce noise [special characters,stopwords] ++ Reducing repetition of using the same code for text preprocessing + +#### Solution ++ convert the already known solution for cleaning text into a reuseable package + +#### Docs ++ Check out the full docs [here](https://jcharis.github.io/neattext/) + +#### Installation +```bash +pip install neattext +``` + +### Usage ++ The OOP Way(Object Oriented Way) ++ NeatText offers 5 main classes for working with text data + - TextFrame : a frame-like object for cleaning text + - TextCleaner: remove or replace specifics + - TextExtractor: extract unwanted text data + - TextMetrics: word stats and metrics + - TextPipeline: combine multiple functions in a pipeline + +### Overall Components of NeatText + + +### Using TextFrame ++ Keeps the text as `TextFrame` object. This allows us to do more with our text. ++ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. ++ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. + + +```python +>>> import neattext as nt +>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx = nt.TextFrame(text=mytext) +>>> docx.text +"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> docx.describe() +Key Value +Length : 73 +vowels : 21 +consonants: 34 +stopwords: 4 +punctuations: 8 +special_char: 8 +tokens(whitespace): 10 +tokens(words): 14 +>>> +>>> docx.length +73 +>>> # Scan Percentage of Noise(Unclean data) in text +>>> d.noise_scan() +{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} +>>> +>>> docs.head(16) +'This is the mail' +>>> docx.tail() +>>> docx.count_vowels() +>>> docx.count_stopwords() +>>> docx.count_consonants() +>>> docx.nlongest() +>>> docx.nshortest() +>>> docx.readability() +``` +#### Basic NLP Task (Tokenization,Ngram,Text Generation) +```python +>>> docx.word_tokens() +>>> +>>> docx.sent_tokens() +>>> +>>> docx.term_freq() +>>> +>>> docx.bow() +``` +#### Basic Text Preprocessing +```python +>>> docx.normalize() +'this is the mail example@gmail.com ,our website is https://example.com 😊.' +>>> docx.normalize(level='deep') +'this is the mail examplegmailcom our website is httpsexamplecom ' + +>>> docx.remove_puncts() +>>> docx.remove_stopwords() +>>> docx.remove_html_tags() +>>> docx.remove_special_characters() +>>> docx.remove_emojis() +>>> docx.fix_contractions() +``` + +##### Handling Files with NeatText ++ Read txt file directly into TextFrame +```python +>>> import neattext as nt +>>> docx_df = nt.read_txt('file.txt') +``` ++ Alternatively you can instantiate a TextFrame and read a text file into it +```python +>>> import neattext as nt +>>> docx_df = nt.TextFrame().read_txt('file.txt') +``` + +##### Chaining Methods on TextFrame +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextFrame(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' +``` + + + +#### Clean Text ++ Clean text by removing emails,numbers,stopwords,emojis,etc ++ A simplified method for cleaning text by specifying as True/False what to clean from a text +```python +>>> from neattext.functions import clean_text +>>> +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> clean_text(mytext) +'mail example@gmail.com ,our website https://example.com .' +``` ++ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + ++ You can choose to remove or not remove punctuations by setting to True/False respectively + +```python +>>> clean_text(mytext,puncts=True) +'mail example@gmailcom website https://examplecom ' +>>> +>>> clean_text(mytext,puncts=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,puncts=False,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +``` ++ You can also remove the other non-needed items accordingly +```python +>>> clean_text(mytext,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +>>> clean_text(mytext,urls=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,urls=True) +'mail example@gmail.com ,our website .' +>>> + +``` + +#### Removing Punctuations [A Very Common Text Preprocessing Step] ++ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default ++ Alternatively you can also remove all known punctuations from a text. +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_puncts() +TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") + +>>> docx.remove_puncts(most_common=False) +TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") +``` + +#### Removing Stopwords [A Very Common Text Preprocessing Step] ++ You can remove stopwords from a text by specifying the language. The default language is English ++ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) + +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_stopwords(lang='en') +TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") +``` + + +#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc +```python +>>> print(docx.remove_emails()) +>>> 'This is the mail ,our WEBSITE is https://example.com 😊.' +>>> +>>> print(docx.remove_stopwords()) +>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' +>>> +>>> print(docx.remove_numbers()) +>>> docx.remove_phone_numbers() +>>> docx.remove_btc_address() +``` + + +#### Remove Special Characters +```python +>>> docx.remove_special_characters() +``` + +#### Remove Emojis +```python +>>> print(docx.remove_emojis()) +>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' +``` + + +#### Remove Custom Pattern ++ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function +```python +>>> import neattext.functions as nfx +>>> ex = "Last !RT tweeter multiple ṡ" +>>> +>>> nfx.remove_custom_pattern(e,r'&#\d+') +'Last !RT tweeter multiple ' + + + +``` + +#### Replace Emails,Numbers,Phone Numbers +```python +>>> docx.replace_emails() +>>> docx.replace_numbers() +>>> docx.replace_phone_numbers() +``` + +#### Chain Multiple Methods +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextCleaner(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' + +``` + +### Using TextExtractor ++ To Extract emails,phone numbers,numbers,urls,emojis from text +```python +>>> from neattext import TextExtractor +>>> docx = TextExtractor() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.extract_emails() +>>> ['example@gmail.com'] +>>> +>>> docx.extract_emojis() +>>> ['😊'] +``` + + +### Using TextMetrics ++ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats +```python +>>> from neattext import TextMetrics +>>> docx = TextMetrics() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.count_vowels() +>>> docx.count_consonants() +>>> docx.count_stopwords() +>>> docx.word_stats() +>>> docx.memory_usage() +``` + +### Usage ++ The MOP(method/function oriented way) Way + +```python +>>> from neattext.functions import clean_text,extract_emails +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> extract_emails(t1) +>>> ['example@gmail.com'] +``` + ++ Alternatively you can also use this approach +```python +>>> import neattext.functions as nfx +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> nfx.clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> nfx.extract_emails(t1) +>>> ['example@gmail.com'] +``` + +### Explainer ++ Explain an emoji or unicode for emoji + - emoji_explainer() + - emojify() + - unicode_2_emoji() + + +```python +>>> from neattext.explainer import emojify +>>> emojify('Smiley') +>>> '😃' +``` + +```python +>>> from neattext.explainer import emoji_explainer +>>> emoji_explainer('😃') +>>> 'SMILING FACE WITH OPEN MOUTH' +``` + +```python +>>> from neattext.explainer import unicode_2_emoji +>>> unicode_2_emoji('0x1f49b') + 'FLUSHED FACE' +``` + +### Usage ++ The Pipeline Way + +```python +>>> from neattext.pipeline import TextPipeline +>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" + +>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) +>>> p.fit(t1) +'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' + +``` ++ Check For steps and named steps +```python +>>> p.steps +>>> p.named_steps +``` + ++ Alternatively you can also use this approach + + + + +### Documentation +Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check +out our readthedocs page [here](https://jcharis.github.io/neattext/) + + +### More Features To Add ++ basic nlp task ++ currency normalizer + +#### Acknowledgements ++ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech + + +#### NB ++ Contributions Are Welcomed ++ Notice a bug, please let us know. ++ Thanks A lot + + +#### By ++ Jesse E.Agbe(JCharis) ++ Jesus Saves @JCharisTech + + + + + +%package -n python3-neattext +Summary: Neattext - a simple NLP package for cleaning text +Provides: python-neattext +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-neattext +# neattext +NeatText:a simple NLP package for cleaning textual data and text preprocessing. +Simplifying Text Cleaning For NLP & ML + +[](https://travis-ci.org/Jcharis/neattext) + +[](https://github.com/Jcharis/neattext/blob/master/LICENSE) + +#### Problem ++ Cleaning of unstructured text data ++ Reduce noise [special characters,stopwords] ++ Reducing repetition of using the same code for text preprocessing + +#### Solution ++ convert the already known solution for cleaning text into a reuseable package + +#### Docs ++ Check out the full docs [here](https://jcharis.github.io/neattext/) + +#### Installation +```bash +pip install neattext +``` + +### Usage ++ The OOP Way(Object Oriented Way) ++ NeatText offers 5 main classes for working with text data + - TextFrame : a frame-like object for cleaning text + - TextCleaner: remove or replace specifics + - TextExtractor: extract unwanted text data + - TextMetrics: word stats and metrics + - TextPipeline: combine multiple functions in a pipeline + +### Overall Components of NeatText + + +### Using TextFrame ++ Keeps the text as `TextFrame` object. This allows us to do more with our text. ++ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. ++ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. + + +```python +>>> import neattext as nt +>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx = nt.TextFrame(text=mytext) +>>> docx.text +"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> docx.describe() +Key Value +Length : 73 +vowels : 21 +consonants: 34 +stopwords: 4 +punctuations: 8 +special_char: 8 +tokens(whitespace): 10 +tokens(words): 14 +>>> +>>> docx.length +73 +>>> # Scan Percentage of Noise(Unclean data) in text +>>> d.noise_scan() +{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} +>>> +>>> docs.head(16) +'This is the mail' +>>> docx.tail() +>>> docx.count_vowels() +>>> docx.count_stopwords() +>>> docx.count_consonants() +>>> docx.nlongest() +>>> docx.nshortest() +>>> docx.readability() +``` +#### Basic NLP Task (Tokenization,Ngram,Text Generation) +```python +>>> docx.word_tokens() +>>> +>>> docx.sent_tokens() +>>> +>>> docx.term_freq() +>>> +>>> docx.bow() +``` +#### Basic Text Preprocessing +```python +>>> docx.normalize() +'this is the mail example@gmail.com ,our website is https://example.com 😊.' +>>> docx.normalize(level='deep') +'this is the mail examplegmailcom our website is httpsexamplecom ' + +>>> docx.remove_puncts() +>>> docx.remove_stopwords() +>>> docx.remove_html_tags() +>>> docx.remove_special_characters() +>>> docx.remove_emojis() +>>> docx.fix_contractions() +``` + +##### Handling Files with NeatText ++ Read txt file directly into TextFrame +```python +>>> import neattext as nt +>>> docx_df = nt.read_txt('file.txt') +``` ++ Alternatively you can instantiate a TextFrame and read a text file into it +```python +>>> import neattext as nt +>>> docx_df = nt.TextFrame().read_txt('file.txt') +``` + +##### Chaining Methods on TextFrame +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextFrame(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' +``` + + + +#### Clean Text ++ Clean text by removing emails,numbers,stopwords,emojis,etc ++ A simplified method for cleaning text by specifying as True/False what to clean from a text +```python +>>> from neattext.functions import clean_text +>>> +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> clean_text(mytext) +'mail example@gmail.com ,our website https://example.com .' +``` ++ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + ++ You can choose to remove or not remove punctuations by setting to True/False respectively + +```python +>>> clean_text(mytext,puncts=True) +'mail example@gmailcom website https://examplecom ' +>>> +>>> clean_text(mytext,puncts=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,puncts=False,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +``` ++ You can also remove the other non-needed items accordingly +```python +>>> clean_text(mytext,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +>>> clean_text(mytext,urls=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,urls=True) +'mail example@gmail.com ,our website .' +>>> + +``` + +#### Removing Punctuations [A Very Common Text Preprocessing Step] ++ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default ++ Alternatively you can also remove all known punctuations from a text. +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_puncts() +TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") + +>>> docx.remove_puncts(most_common=False) +TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") +``` + +#### Removing Stopwords [A Very Common Text Preprocessing Step] ++ You can remove stopwords from a text by specifying the language. The default language is English ++ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) + +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_stopwords(lang='en') +TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") +``` + + +#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc +```python +>>> print(docx.remove_emails()) +>>> 'This is the mail ,our WEBSITE is https://example.com 😊.' +>>> +>>> print(docx.remove_stopwords()) +>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' +>>> +>>> print(docx.remove_numbers()) +>>> docx.remove_phone_numbers() +>>> docx.remove_btc_address() +``` + + +#### Remove Special Characters +```python +>>> docx.remove_special_characters() +``` + +#### Remove Emojis +```python +>>> print(docx.remove_emojis()) +>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' +``` + + +#### Remove Custom Pattern ++ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function +```python +>>> import neattext.functions as nfx +>>> ex = "Last !RT tweeter multiple ṡ" +>>> +>>> nfx.remove_custom_pattern(e,r'&#\d+') +'Last !RT tweeter multiple ' + + + +``` + +#### Replace Emails,Numbers,Phone Numbers +```python +>>> docx.replace_emails() +>>> docx.replace_numbers() +>>> docx.replace_phone_numbers() +``` + +#### Chain Multiple Methods +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextCleaner(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' + +``` + +### Using TextExtractor ++ To Extract emails,phone numbers,numbers,urls,emojis from text +```python +>>> from neattext import TextExtractor +>>> docx = TextExtractor() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.extract_emails() +>>> ['example@gmail.com'] +>>> +>>> docx.extract_emojis() +>>> ['😊'] +``` + + +### Using TextMetrics ++ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats +```python +>>> from neattext import TextMetrics +>>> docx = TextMetrics() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.count_vowels() +>>> docx.count_consonants() +>>> docx.count_stopwords() +>>> docx.word_stats() +>>> docx.memory_usage() +``` + +### Usage ++ The MOP(method/function oriented way) Way + +```python +>>> from neattext.functions import clean_text,extract_emails +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> extract_emails(t1) +>>> ['example@gmail.com'] +``` + ++ Alternatively you can also use this approach +```python +>>> import neattext.functions as nfx +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> nfx.clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> nfx.extract_emails(t1) +>>> ['example@gmail.com'] +``` + +### Explainer ++ Explain an emoji or unicode for emoji + - emoji_explainer() + - emojify() + - unicode_2_emoji() + + +```python +>>> from neattext.explainer import emojify +>>> emojify('Smiley') +>>> '😃' +``` + +```python +>>> from neattext.explainer import emoji_explainer +>>> emoji_explainer('😃') +>>> 'SMILING FACE WITH OPEN MOUTH' +``` + +```python +>>> from neattext.explainer import unicode_2_emoji +>>> unicode_2_emoji('0x1f49b') + 'FLUSHED FACE' +``` + +### Usage ++ The Pipeline Way + +```python +>>> from neattext.pipeline import TextPipeline +>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" + +>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) +>>> p.fit(t1) +'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' + +``` ++ Check For steps and named steps +```python +>>> p.steps +>>> p.named_steps +``` + ++ Alternatively you can also use this approach + + + + +### Documentation +Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check +out our readthedocs page [here](https://jcharis.github.io/neattext/) + + +### More Features To Add ++ basic nlp task ++ currency normalizer + +#### Acknowledgements ++ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech + + +#### NB ++ Contributions Are Welcomed ++ Notice a bug, please let us know. ++ Thanks A lot + + +#### By ++ Jesse E.Agbe(JCharis) ++ Jesus Saves @JCharisTech + + + + + +%package help +Summary: Development documents and examples for neattext +Provides: python3-neattext-doc +%description help +# neattext +NeatText:a simple NLP package for cleaning textual data and text preprocessing. +Simplifying Text Cleaning For NLP & ML + +[](https://travis-ci.org/Jcharis/neattext) + +[](https://github.com/Jcharis/neattext/blob/master/LICENSE) + +#### Problem ++ Cleaning of unstructured text data ++ Reduce noise [special characters,stopwords] ++ Reducing repetition of using the same code for text preprocessing + +#### Solution ++ convert the already known solution for cleaning text into a reuseable package + +#### Docs ++ Check out the full docs [here](https://jcharis.github.io/neattext/) + +#### Installation +```bash +pip install neattext +``` + +### Usage ++ The OOP Way(Object Oriented Way) ++ NeatText offers 5 main classes for working with text data + - TextFrame : a frame-like object for cleaning text + - TextCleaner: remove or replace specifics + - TextExtractor: extract unwanted text data + - TextMetrics: word stats and metrics + - TextPipeline: combine multiple functions in a pipeline + +### Overall Components of NeatText + + +### Using TextFrame ++ Keeps the text as `TextFrame` object. This allows us to do more with our text. ++ It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. ++ This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. + + +```python +>>> import neattext as nt +>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx = nt.TextFrame(text=mytext) +>>> docx.text +"This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> docx.describe() +Key Value +Length : 73 +vowels : 21 +consonants: 34 +stopwords: 4 +punctuations: 8 +special_char: 8 +tokens(whitespace): 10 +tokens(words): 14 +>>> +>>> docx.length +73 +>>> # Scan Percentage of Noise(Unclean data) in text +>>> d.noise_scan() +{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} +>>> +>>> docs.head(16) +'This is the mail' +>>> docx.tail() +>>> docx.count_vowels() +>>> docx.count_stopwords() +>>> docx.count_consonants() +>>> docx.nlongest() +>>> docx.nshortest() +>>> docx.readability() +``` +#### Basic NLP Task (Tokenization,Ngram,Text Generation) +```python +>>> docx.word_tokens() +>>> +>>> docx.sent_tokens() +>>> +>>> docx.term_freq() +>>> +>>> docx.bow() +``` +#### Basic Text Preprocessing +```python +>>> docx.normalize() +'this is the mail example@gmail.com ,our website is https://example.com 😊.' +>>> docx.normalize(level='deep') +'this is the mail examplegmailcom our website is httpsexamplecom ' + +>>> docx.remove_puncts() +>>> docx.remove_stopwords() +>>> docx.remove_html_tags() +>>> docx.remove_special_characters() +>>> docx.remove_emojis() +>>> docx.fix_contractions() +``` + +##### Handling Files with NeatText ++ Read txt file directly into TextFrame +```python +>>> import neattext as nt +>>> docx_df = nt.read_txt('file.txt') +``` ++ Alternatively you can instantiate a TextFrame and read a text file into it +```python +>>> import neattext as nt +>>> docx_df = nt.TextFrame().read_txt('file.txt') +``` + +##### Chaining Methods on TextFrame +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextFrame(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' +``` + + + +#### Clean Text ++ Clean text by removing emails,numbers,stopwords,emojis,etc ++ A simplified method for cleaning text by specifying as True/False what to clean from a text +```python +>>> from neattext.functions import clean_text +>>> +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> +>>> clean_text(mytext) +'mail example@gmail.com ,our website https://example.com .' +``` ++ You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + ++ You can choose to remove or not remove punctuations by setting to True/False respectively + +```python +>>> clean_text(mytext,puncts=True) +'mail example@gmailcom website https://examplecom ' +>>> +>>> clean_text(mytext,puncts=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,puncts=False,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +``` ++ You can also remove the other non-needed items accordingly +```python +>>> clean_text(mytext,stopwords=False) +'this is the mail example@gmail.com ,our website is https://example.com .' +>>> +>>> clean_text(mytext,urls=False) +'mail example@gmail.com ,our website https://example.com .' +>>> +>>> clean_text(mytext,urls=True) +'mail example@gmail.com ,our website .' +>>> + +``` + +#### Removing Punctuations [A Very Common Text Preprocessing Step] ++ You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default ++ Alternatively you can also remove all known punctuations from a text. +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_puncts() +TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") + +>>> docx.remove_puncts(most_common=False) +TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") +``` + +#### Removing Stopwords [A Very Common Text Preprocessing Step] ++ You can remove stopwords from a text by specifying the language. The default language is English ++ Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) + +```python +>>> import neattext as nt +>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" +>>> docx = nt.TextFrame(mytext) +>>> docx.remove_stopwords(lang='en') +TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") +``` + + +#### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc +```python +>>> print(docx.remove_emails()) +>>> 'This is the mail ,our WEBSITE is https://example.com 😊.' +>>> +>>> print(docx.remove_stopwords()) +>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' +>>> +>>> print(docx.remove_numbers()) +>>> docx.remove_phone_numbers() +>>> docx.remove_btc_address() +``` + + +#### Remove Special Characters +```python +>>> docx.remove_special_characters() +``` + +#### Remove Emojis +```python +>>> print(docx.remove_emojis()) +>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' +``` + + +#### Remove Custom Pattern ++ You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function +```python +>>> import neattext.functions as nfx +>>> ex = "Last !RT tweeter multiple ṡ" +>>> +>>> nfx.remove_custom_pattern(e,r'&#\d+') +'Last !RT tweeter multiple ' + + + +``` + +#### Replace Emails,Numbers,Phone Numbers +```python +>>> docx.replace_emails() +>>> docx.replace_numbers() +>>> docx.replace_phone_numbers() +``` + +#### Chain Multiple Methods +```python +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." +>>> docx = TextCleaner(t1) +>>> result = docx.remove_emails().remove_urls().remove_emojis() +>>> print(result) +'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' + +``` + +### Using TextExtractor ++ To Extract emails,phone numbers,numbers,urls,emojis from text +```python +>>> from neattext import TextExtractor +>>> docx = TextExtractor() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.extract_emails() +>>> ['example@gmail.com'] +>>> +>>> docx.extract_emojis() +>>> ['😊'] +``` + + +### Using TextMetrics ++ To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats +```python +>>> from neattext import TextMetrics +>>> docx = TextMetrics() +>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." +>>> docx.count_vowels() +>>> docx.count_consonants() +>>> docx.count_stopwords() +>>> docx.word_stats() +>>> docx.memory_usage() +``` + +### Usage ++ The MOP(method/function oriented way) Way + +```python +>>> from neattext.functions import clean_text,extract_emails +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> extract_emails(t1) +>>> ['example@gmail.com'] +``` + ++ Alternatively you can also use this approach +```python +>>> import neattext.functions as nfx +>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." +>>> nfx.clean_text(t1,puncts=True,stopwords=True) +>>>'this mail examplegmailcom website httpsexamplecom' +>>> nfx.extract_emails(t1) +>>> ['example@gmail.com'] +``` + +### Explainer ++ Explain an emoji or unicode for emoji + - emoji_explainer() + - emojify() + - unicode_2_emoji() + + +```python +>>> from neattext.explainer import emojify +>>> emojify('Smiley') +>>> '😃' +``` + +```python +>>> from neattext.explainer import emoji_explainer +>>> emoji_explainer('😃') +>>> 'SMILING FACE WITH OPEN MOUTH' +``` + +```python +>>> from neattext.explainer import unicode_2_emoji +>>> unicode_2_emoji('0x1f49b') + 'FLUSHED FACE' +``` + +### Usage ++ The Pipeline Way + +```python +>>> from neattext.pipeline import TextPipeline +>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" + +>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) +>>> p.fit(t1) +'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' + +``` ++ Check For steps and named steps +```python +>>> p.steps +>>> p.named_steps +``` + ++ Alternatively you can also use this approach + + + + +### Documentation +Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check +out our readthedocs page [here](https://jcharis.github.io/neattext/) + + +### More Features To Add ++ basic nlp task ++ currency normalizer + +#### Acknowledgements ++ Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech + + +#### NB ++ Contributions Are Welcomed ++ Notice a bug, please let us know. ++ Thanks A lot + + +#### By ++ Jesse E.Agbe(JCharis) ++ Jesus Saves @JCharisTech + + + + + +%prep +%autosetup -n neattext-0.1.3 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-neattext -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.3-1 +- Package Spec generated @@ -0,0 +1 @@ +43dc7e1df9d75a1fa0ef42c6339a91a1 neattext-0.1.3.tar.gz |
