%global _empty_manifest_terminate_build 0 Name: python-neattext Version: 0.1.3 Release: 1 Summary: Neattext - a simple NLP package for cleaning text License: MIT URL: https://github.com/Jcharis/neattext Source0: https://mirrors.aliyun.com/pypi/web/packages/49/cd/b1224076d1b370a0172643f96155f97017ac5aba2c8b56f2999ba4061056/neattext-0.1.3.tar.gz BuildArch: noarch %description # neattext NeatText:a simple NLP package for cleaning textual data and text preprocessing. Simplifying Text Cleaning For NLP & ML [![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext) [![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE) #### Problem + Cleaning of unstructured text data + Reduce noise [special characters,stopwords] + Reducing repetition of using the same code for text preprocessing #### Solution + convert the already known solution for cleaning text into a reuseable package #### Docs + Check out the full docs [here](https://jcharis.github.io/neattext/) #### Installation ```bash pip install neattext ``` ### Usage + The OOP Way(Object Oriented Way) + NeatText offers 5 main classes for working with text data - TextFrame : a frame-like object for cleaning text - TextCleaner: remove or replace specifics - TextExtractor: extract unwanted text data - TextMetrics: word stats and metrics - TextPipeline: combine multiple functions in a pipeline ### Overall Components of NeatText ![](images/neattext_features_jcharistech.png) ### Using TextFrame + Keeps the text as `TextFrame` object. This allows us to do more with our text. + It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. + This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. ```python >>> import neattext as nt >> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx = nt.TextFrame(text=mytext) >>> docx.text "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> docx.describe() Key Value Length : 73 vowels : 21 consonants: 34 stopwords: 4 punctuations: 8 special_char: 8 tokens(whitespace): 10 tokens(words): 14 >>> >>> docx.length 73 >>> # Scan Percentage of Noise(Unclean data) in text >>> d.noise_scan() {'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} >>> >>> docs.head(16) 'This is the mail' >>> docx.tail() >>> docx.count_vowels() >>> docx.count_stopwords() >>> docx.count_consonants() >>> docx.nlongest() >>> docx.nshortest() >>> docx.readability() ``` #### Basic NLP Task (Tokenization,Ngram,Text Generation) ```python >>> docx.word_tokens() >>> >>> docx.sent_tokens() >>> >>> docx.term_freq() >>> >>> docx.bow() ``` #### Basic Text Preprocessing ```python >>> docx.normalize() 'this is the mail example@gmail.com ,our website is https://example.com 😊.' >>> docx.normalize(level='deep') 'this is the mail examplegmailcom our website is httpsexamplecom ' >>> docx.remove_puncts() >>> docx.remove_stopwords() >>> docx.remove_html_tags() >>> docx.remove_special_characters() >>> docx.remove_emojis() >>> docx.fix_contractions() ``` ##### Handling Files with NeatText + Read txt file directly into TextFrame ```python >>> import neattext as nt >>> docx_df = nt.read_txt('file.txt') ``` + Alternatively you can instantiate a TextFrame and read a text file into it ```python >>> import neattext as nt >>> docx_df = nt.TextFrame().read_txt('file.txt') ``` ##### Chaining Methods on TextFrame ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextFrame(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` #### Clean Text + Clean text by removing emails,numbers,stopwords,emojis,etc + A simplified method for cleaning text by specifying as True/False what to clean from a text ```python >>> from neattext.functions import clean_text >>> >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> clean_text(mytext) 'mail example@gmail.com ,our website https://example.com .' ``` + You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + You can choose to remove or not remove punctuations by setting to True/False respectively ```python >>> clean_text(mytext,puncts=True) 'mail example@gmailcom website https://examplecom ' >>> >>> clean_text(mytext,puncts=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,puncts=False,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> ``` + You can also remove the other non-needed items accordingly ```python >>> clean_text(mytext,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> >>> clean_text(mytext,urls=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,urls=True) 'mail example@gmail.com ,our website .' >>> ``` #### Removing Punctuations [A Very Common Text Preprocessing Step] + You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default + Alternatively you can also remove all known punctuations from a text. ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_puncts() TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") >>> docx.remove_puncts(most_common=False) TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") ``` #### Removing Stopwords [A Very Common Text Preprocessing Step] + You can remove stopwords from a text by specifying the language. The default language is English + Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_stopwords(lang='en') TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") ``` #### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc ```python >>> print(docx.remove_emails()) >>> 'This is the mail ,our WEBSITE is https://example.com 😊.' >>> >>> print(docx.remove_stopwords()) >>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' >>> >>> print(docx.remove_numbers()) >>> docx.remove_phone_numbers() >>> docx.remove_btc_address() ``` #### Remove Special Characters ```python >>> docx.remove_special_characters() ``` #### Remove Emojis ```python >>> print(docx.remove_emojis()) >>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' ``` #### Remove Custom Pattern + You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function ```python >>> import neattext.functions as nfx >>> ex = "Last !RT tweeter multiple ṡ" >>> >>> nfx.remove_custom_pattern(e,r'&#\d+') 'Last !RT tweeter multiple ' ``` #### Replace Emails,Numbers,Phone Numbers ```python >>> docx.replace_emails() >>> docx.replace_numbers() >>> docx.replace_phone_numbers() ``` #### Chain Multiple Methods ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextCleaner(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` ### Using TextExtractor + To Extract emails,phone numbers,numbers,urls,emojis from text ```python >>> from neattext import TextExtractor >>> docx = TextExtractor() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.extract_emails() >>> ['example@gmail.com'] >>> >>> docx.extract_emojis() >>> ['😊'] ``` ### Using TextMetrics + To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats ```python >>> from neattext import TextMetrics >>> docx = TextMetrics() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.count_vowels() >>> docx.count_consonants() >>> docx.count_stopwords() >>> docx.word_stats() >>> docx.memory_usage() ``` ### Usage + The MOP(method/function oriented way) Way ```python >>> from neattext.functions import clean_text,extract_emails >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> extract_emails(t1) >>> ['example@gmail.com'] ``` + Alternatively you can also use this approach ```python >>> import neattext.functions as nfx >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> nfx.clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> nfx.extract_emails(t1) >>> ['example@gmail.com'] ``` ### Explainer + Explain an emoji or unicode for emoji - emoji_explainer() - emojify() - unicode_2_emoji() ```python >>> from neattext.explainer import emojify >>> emojify('Smiley') >>> '😃' ``` ```python >>> from neattext.explainer import emoji_explainer >>> emoji_explainer('😃') >>> 'SMILING FACE WITH OPEN MOUTH' ``` ```python >>> from neattext.explainer import unicode_2_emoji >>> unicode_2_emoji('0x1f49b') 'FLUSHED FACE' ``` ### Usage + The Pipeline Way ```python >>> from neattext.pipeline import TextPipeline >>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" >>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) >>> p.fit(t1) 'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' ``` + Check For steps and named steps ```python >>> p.steps >>> p.named_steps ``` + Alternatively you can also use this approach ### Documentation Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check out our readthedocs page [here](https://jcharis.github.io/neattext/) ### More Features To Add + basic nlp task + currency normalizer #### Acknowledgements + Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech #### NB + Contributions Are Welcomed + Notice a bug, please let us know. + Thanks A lot #### By + Jesse E.Agbe(JCharis) + Jesus Saves @JCharisTech %package -n python3-neattext Summary: Neattext - a simple NLP package for cleaning text Provides: python-neattext BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-neattext # neattext NeatText:a simple NLP package for cleaning textual data and text preprocessing. Simplifying Text Cleaning For NLP & ML [![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext) [![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE) #### Problem + Cleaning of unstructured text data + Reduce noise [special characters,stopwords] + Reducing repetition of using the same code for text preprocessing #### Solution + convert the already known solution for cleaning text into a reuseable package #### Docs + Check out the full docs [here](https://jcharis.github.io/neattext/) #### Installation ```bash pip install neattext ``` ### Usage + The OOP Way(Object Oriented Way) + NeatText offers 5 main classes for working with text data - TextFrame : a frame-like object for cleaning text - TextCleaner: remove or replace specifics - TextExtractor: extract unwanted text data - TextMetrics: word stats and metrics - TextPipeline: combine multiple functions in a pipeline ### Overall Components of NeatText ![](images/neattext_features_jcharistech.png) ### Using TextFrame + Keeps the text as `TextFrame` object. This allows us to do more with our text. + It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. + This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. ```python >>> import neattext as nt >> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx = nt.TextFrame(text=mytext) >>> docx.text "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> docx.describe() Key Value Length : 73 vowels : 21 consonants: 34 stopwords: 4 punctuations: 8 special_char: 8 tokens(whitespace): 10 tokens(words): 14 >>> >>> docx.length 73 >>> # Scan Percentage of Noise(Unclean data) in text >>> d.noise_scan() {'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} >>> >>> docs.head(16) 'This is the mail' >>> docx.tail() >>> docx.count_vowels() >>> docx.count_stopwords() >>> docx.count_consonants() >>> docx.nlongest() >>> docx.nshortest() >>> docx.readability() ``` #### Basic NLP Task (Tokenization,Ngram,Text Generation) ```python >>> docx.word_tokens() >>> >>> docx.sent_tokens() >>> >>> docx.term_freq() >>> >>> docx.bow() ``` #### Basic Text Preprocessing ```python >>> docx.normalize() 'this is the mail example@gmail.com ,our website is https://example.com 😊.' >>> docx.normalize(level='deep') 'this is the mail examplegmailcom our website is httpsexamplecom ' >>> docx.remove_puncts() >>> docx.remove_stopwords() >>> docx.remove_html_tags() >>> docx.remove_special_characters() >>> docx.remove_emojis() >>> docx.fix_contractions() ``` ##### Handling Files with NeatText + Read txt file directly into TextFrame ```python >>> import neattext as nt >>> docx_df = nt.read_txt('file.txt') ``` + Alternatively you can instantiate a TextFrame and read a text file into it ```python >>> import neattext as nt >>> docx_df = nt.TextFrame().read_txt('file.txt') ``` ##### Chaining Methods on TextFrame ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextFrame(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` #### Clean Text + Clean text by removing emails,numbers,stopwords,emojis,etc + A simplified method for cleaning text by specifying as True/False what to clean from a text ```python >>> from neattext.functions import clean_text >>> >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> clean_text(mytext) 'mail example@gmail.com ,our website https://example.com .' ``` + You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + You can choose to remove or not remove punctuations by setting to True/False respectively ```python >>> clean_text(mytext,puncts=True) 'mail example@gmailcom website https://examplecom ' >>> >>> clean_text(mytext,puncts=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,puncts=False,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> ``` + You can also remove the other non-needed items accordingly ```python >>> clean_text(mytext,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> >>> clean_text(mytext,urls=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,urls=True) 'mail example@gmail.com ,our website .' >>> ``` #### Removing Punctuations [A Very Common Text Preprocessing Step] + You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default + Alternatively you can also remove all known punctuations from a text. ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_puncts() TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") >>> docx.remove_puncts(most_common=False) TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") ``` #### Removing Stopwords [A Very Common Text Preprocessing Step] + You can remove stopwords from a text by specifying the language. The default language is English + Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_stopwords(lang='en') TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") ``` #### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc ```python >>> print(docx.remove_emails()) >>> 'This is the mail ,our WEBSITE is https://example.com 😊.' >>> >>> print(docx.remove_stopwords()) >>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' >>> >>> print(docx.remove_numbers()) >>> docx.remove_phone_numbers() >>> docx.remove_btc_address() ``` #### Remove Special Characters ```python >>> docx.remove_special_characters() ``` #### Remove Emojis ```python >>> print(docx.remove_emojis()) >>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' ``` #### Remove Custom Pattern + You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function ```python >>> import neattext.functions as nfx >>> ex = "Last !RT tweeter multiple ṡ" >>> >>> nfx.remove_custom_pattern(e,r'&#\d+') 'Last !RT tweeter multiple ' ``` #### Replace Emails,Numbers,Phone Numbers ```python >>> docx.replace_emails() >>> docx.replace_numbers() >>> docx.replace_phone_numbers() ``` #### Chain Multiple Methods ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextCleaner(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` ### Using TextExtractor + To Extract emails,phone numbers,numbers,urls,emojis from text ```python >>> from neattext import TextExtractor >>> docx = TextExtractor() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.extract_emails() >>> ['example@gmail.com'] >>> >>> docx.extract_emojis() >>> ['😊'] ``` ### Using TextMetrics + To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats ```python >>> from neattext import TextMetrics >>> docx = TextMetrics() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.count_vowels() >>> docx.count_consonants() >>> docx.count_stopwords() >>> docx.word_stats() >>> docx.memory_usage() ``` ### Usage + The MOP(method/function oriented way) Way ```python >>> from neattext.functions import clean_text,extract_emails >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> extract_emails(t1) >>> ['example@gmail.com'] ``` + Alternatively you can also use this approach ```python >>> import neattext.functions as nfx >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> nfx.clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> nfx.extract_emails(t1) >>> ['example@gmail.com'] ``` ### Explainer + Explain an emoji or unicode for emoji - emoji_explainer() - emojify() - unicode_2_emoji() ```python >>> from neattext.explainer import emojify >>> emojify('Smiley') >>> '😃' ``` ```python >>> from neattext.explainer import emoji_explainer >>> emoji_explainer('😃') >>> 'SMILING FACE WITH OPEN MOUTH' ``` ```python >>> from neattext.explainer import unicode_2_emoji >>> unicode_2_emoji('0x1f49b') 'FLUSHED FACE' ``` ### Usage + The Pipeline Way ```python >>> from neattext.pipeline import TextPipeline >>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" >>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) >>> p.fit(t1) 'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' ``` + Check For steps and named steps ```python >>> p.steps >>> p.named_steps ``` + Alternatively you can also use this approach ### Documentation Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check out our readthedocs page [here](https://jcharis.github.io/neattext/) ### More Features To Add + basic nlp task + currency normalizer #### Acknowledgements + Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech #### NB + Contributions Are Welcomed + Notice a bug, please let us know. + Thanks A lot #### By + Jesse E.Agbe(JCharis) + Jesus Saves @JCharisTech %package help Summary: Development documents and examples for neattext Provides: python3-neattext-doc %description help # neattext NeatText:a simple NLP package for cleaning textual data and text preprocessing. Simplifying Text Cleaning For NLP & ML [![Build Status](https://travis-ci.org/Jcharis/neattext.svg?branch=master)](https://travis-ci.org/Jcharis/neattext) [![GitHub license](https://img.shields.io/github/license/Jcharis/neattext)](https://github.com/Jcharis/neattext/blob/master/LICENSE) #### Problem + Cleaning of unstructured text data + Reduce noise [special characters,stopwords] + Reducing repetition of using the same code for text preprocessing #### Solution + convert the already known solution for cleaning text into a reuseable package #### Docs + Check out the full docs [here](https://jcharis.github.io/neattext/) #### Installation ```bash pip install neattext ``` ### Usage + The OOP Way(Object Oriented Way) + NeatText offers 5 main classes for working with text data - TextFrame : a frame-like object for cleaning text - TextCleaner: remove or replace specifics - TextExtractor: extract unwanted text data - TextMetrics: word stats and metrics - TextPipeline: combine multiple functions in a pipeline ### Overall Components of NeatText ![](images/neattext_features_jcharistech.png) ### Using TextFrame + Keeps the text as `TextFrame` object. This allows us to do more with our text. + It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data. + This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too. ```python >>> import neattext as nt >> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx = nt.TextFrame(text=mytext) >>> docx.text "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> docx.describe() Key Value Length : 73 vowels : 21 consonants: 34 stopwords: 4 punctuations: 8 special_char: 8 tokens(whitespace): 10 tokens(words): 14 >>> >>> docx.length 73 >>> # Scan Percentage of Noise(Unclean data) in text >>> d.noise_scan() {'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14} >>> >>> docs.head(16) 'This is the mail' >>> docx.tail() >>> docx.count_vowels() >>> docx.count_stopwords() >>> docx.count_consonants() >>> docx.nlongest() >>> docx.nshortest() >>> docx.readability() ``` #### Basic NLP Task (Tokenization,Ngram,Text Generation) ```python >>> docx.word_tokens() >>> >>> docx.sent_tokens() >>> >>> docx.term_freq() >>> >>> docx.bow() ``` #### Basic Text Preprocessing ```python >>> docx.normalize() 'this is the mail example@gmail.com ,our website is https://example.com 😊.' >>> docx.normalize(level='deep') 'this is the mail examplegmailcom our website is httpsexamplecom ' >>> docx.remove_puncts() >>> docx.remove_stopwords() >>> docx.remove_html_tags() >>> docx.remove_special_characters() >>> docx.remove_emojis() >>> docx.fix_contractions() ``` ##### Handling Files with NeatText + Read txt file directly into TextFrame ```python >>> import neattext as nt >>> docx_df = nt.read_txt('file.txt') ``` + Alternatively you can instantiate a TextFrame and read a text file into it ```python >>> import neattext as nt >>> docx_df = nt.TextFrame().read_txt('file.txt') ``` ##### Chaining Methods on TextFrame ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextFrame(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` #### Clean Text + Clean text by removing emails,numbers,stopwords,emojis,etc + A simplified method for cleaning text by specifying as True/False what to clean from a text ```python >>> from neattext.functions import clean_text >>> >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> >>> clean_text(mytext) 'mail example@gmail.com ,our website https://example.com .' ``` + You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True. + You can choose to remove or not remove punctuations by setting to True/False respectively ```python >>> clean_text(mytext,puncts=True) 'mail example@gmailcom website https://examplecom ' >>> >>> clean_text(mytext,puncts=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,puncts=False,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> ``` + You can also remove the other non-needed items accordingly ```python >>> clean_text(mytext,stopwords=False) 'this is the mail example@gmail.com ,our website is https://example.com .' >>> >>> clean_text(mytext,urls=False) 'mail example@gmail.com ,our website https://example.com .' >>> >>> clean_text(mytext,urls=True) 'mail example@gmail.com ,our website .' >>> ``` #### Removing Punctuations [A Very Common Text Preprocessing Step] + You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default + Alternatively you can also remove all known punctuations from a text. ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_puncts() TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ") >>> docx.remove_puncts(most_common=False) TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ") ``` #### Removing Stopwords [A Very Common Text Preprocessing Step] + You can remove stopwords from a text by specifying the language. The default language is English + Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de) ```python >>> import neattext as nt >>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!" >>> docx = nt.TextFrame(mytext) >>> docx.remove_stopwords(lang='en') TextFrame(text="mail example@gmail.com ,our WEBSITE https://example.com 😊. forget email enter !!!!!") ``` #### Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc ```python >>> print(docx.remove_emails()) >>> 'This is the mail ,our WEBSITE is https://example.com 😊.' >>> >>> print(docx.remove_stopwords()) >>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.' >>> >>> print(docx.remove_numbers()) >>> docx.remove_phone_numbers() >>> docx.remove_btc_address() ``` #### Remove Special Characters ```python >>> docx.remove_special_characters() ``` #### Remove Emojis ```python >>> print(docx.remove_emojis()) >>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .' ``` #### Remove Custom Pattern + You can also specify your own custom pattern, incase you cannot find what you need in the functions using the `remove_custom_pattern()` function ```python >>> import neattext.functions as nfx >>> ex = "Last !RT tweeter multiple ṡ" >>> >>> nfx.remove_custom_pattern(e,r'&#\d+') 'Last !RT tweeter multiple ' ``` #### Replace Emails,Numbers,Phone Numbers ```python >>> docx.replace_emails() >>> docx.replace_numbers() >>> docx.replace_phone_numbers() ``` #### Chain Multiple Methods ```python >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe." >>> docx = TextCleaner(t1) >>> result = docx.remove_emails().remove_urls().remove_emojis() >>> print(result) 'This is the mail ,our WEBSITE is and it will cost $100 to subscribe.' ``` ### Using TextExtractor + To Extract emails,phone numbers,numbers,urls,emojis from text ```python >>> from neattext import TextExtractor >>> docx = TextExtractor() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.extract_emails() >>> ['example@gmail.com'] >>> >>> docx.extract_emojis() >>> ['😊'] ``` ### Using TextMetrics + To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats ```python >>> from neattext import TextMetrics >>> docx = TextMetrics() >>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊." >>> docx.count_vowels() >>> docx.count_consonants() >>> docx.count_stopwords() >>> docx.word_stats() >>> docx.memory_usage() ``` ### Usage + The MOP(method/function oriented way) Way ```python >>> from neattext.functions import clean_text,extract_emails >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> extract_emails(t1) >>> ['example@gmail.com'] ``` + Alternatively you can also use this approach ```python >>> import neattext.functions as nfx >>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ." >>> nfx.clean_text(t1,puncts=True,stopwords=True) >>>'this mail examplegmailcom website httpsexamplecom' >>> nfx.extract_emails(t1) >>> ['example@gmail.com'] ``` ### Explainer + Explain an emoji or unicode for emoji - emoji_explainer() - emojify() - unicode_2_emoji() ```python >>> from neattext.explainer import emojify >>> emojify('Smiley') >>> '😃' ``` ```python >>> from neattext.explainer import emoji_explainer >>> emoji_explainer('😃') >>> 'SMILING FACE WITH OPEN MOUTH' ``` ```python >>> from neattext.explainer import unicode_2_emoji >>> unicode_2_emoji('0x1f49b') 'FLUSHED FACE' ``` ### Usage + The Pipeline Way ```python >>> from neattext.pipeline import TextPipeline >>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU""" >>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis]) >>> p.fit(t1) 'This is the mail ,our WEBSITE is https://example.com . This is visa and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard . Send it to PO Box , KNU' ``` + Check For steps and named steps ```python >>> p.steps >>> p.named_steps ``` + Alternatively you can also use this approach ### Documentation Please read the [documentation](https://github.com/Jcharis/neattext/wiki) for more information on what neattext does and how to use is for your needs.You can also check out our readthedocs page [here](https://jcharis.github.io/neattext/) ### More Features To Add + basic nlp task + currency normalizer #### Acknowledgements + Inspired by packages like `clean-text` from Johannes Fillter and `textify` by JCharisTech #### NB + Contributions Are Welcomed + Notice a bug, please let us know. + Thanks A lot #### By + Jesse E.Agbe(JCharis) + Jesus Saves @JCharisTech %prep %autosetup -n neattext-0.1.3 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-neattext -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Fri Jun 09 2023 Python_Bot - 0.1.3-1 - Package Spec generated