diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-contextualspellcheck.spec | 811 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 813 insertions, 0 deletions
@@ -0,0 +1 @@ +/contextualSpellCheck-0.4.3.tar.gz diff --git a/python-contextualspellcheck.spec b/python-contextualspellcheck.spec new file mode 100644 index 0000000..985cdc0 --- /dev/null +++ b/python-contextualspellcheck.spec @@ -0,0 +1,811 @@ +%global _empty_manifest_terminate_build 0 +Name: python-contextualSpellCheck +Version: 0.4.3 +Release: 1 +Summary: Contextual spell correction using BERT (bidirectional representations) +License: MIT License +URL: https://github.com/R1j1t/contextualSpellCheck +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/09/1d/96d5d6be2c4d5eda07eb1d5f21a3480259dd0c2f066edb466f0d9b9e7309/contextualSpellCheck-0.4.3.tar.gz +BuildArch: noarch + +Requires: python3-torch +Requires: python3-editdistance +Requires: python3-transformers +Requires: python3-spacy + +%description +# spellCheck +<a href="https://github.com/R1j1t/contextualSpellCheck"><img src="https://user-images.githubusercontent.com/22280243/82138959-2852cd00-9842-11ea-918a-49b2a7873ef6.png" width="276" height="120" align="right" /></a> + +Contextual word checker for better suggestions + +[](https://github.com/R1j1t/contextualSpellCheck/blob/master/LICENSE) +[](https://pypi.org/project/contextualSpellCheck/) +[](https://github.com/R1j1t/contextualSpellCheck#install) +[](https://pepy.tech/project/contextualspellcheck) +[](https://github.com/R1j1t/contextualSpellCheck/graphs/contributors) +[](https://github.com/R1j1t/contextualSpellCheck#task-list) +[](https://zenodo.org/badge/latestdoi/254703118) + +## Types of spelling mistakes + +It is essential to understand that identifying whether a candidate is a spelling error is a big task. + +> Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE. +> +> -- [Monojit Choudhury et. al. (2007)][1] + +This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation. + +## Install + +The package can be installed using [pip](https://pypi.org/project/contextualSpellCheck/). You would require python 3.6+ + +```bash +pip install contextualSpellCheck +``` + +## Usage + +**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder. + +### How to load the package in spacy pipeline + +```python +>>> import contextualSpellCheck +>>> import spacy +>>> nlp = spacy.load("en_core_web_sm") +>>> +>>> ## We require NER to identify if a token is a PERSON +>>> ## also require parser because we use `Token.sent` for context +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> contextualSpellCheck.add_to_pipe(nlp) +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> doc._.outcome_spellCheck +'Income was $9.4 million compared to the prior year of $2.7 million.' +``` + +Or you can add to spaCy pipeline manually! + +```python +>>> import spacy +>>> import contextualSpellCheck +>>> +>>> nlp = spacy.load("en_core_web_sm") +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> # You can pass the optional parameters to the contextualSpellCheck +>>> # eg. pass max edit distance use config={"max_edit_dist": 3} +>>> nlp.add_pipe("contextual spellchecker") +<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0> +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.") +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +``` + +After adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions). + +### Using the pipeline + +```python +>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> +>>> # Doc Extention +>>> print(doc._.contextual_spellCheck) +True +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.suggestions_spellCheck) +{milion: 'million', milion: 'million'} +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +>>> print(doc._.score_spellCheck) +{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]} +>>> +>>> # Token Extention +>>> print(doc[4]._.get_require_spellCheck) +True +>>> print(doc[4]._.get_suggestion_spellCheck) +'million' +>>> print(doc[4]._.score_spellCheck) +[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)] +>>> +>>> # Span Extention +>>> print(doc[2:6]._.get_has_spellCheck) +True +>>> print(doc[2:6]._.score_spellCheck) +{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []} +``` + +## Extensions + +To make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. The below tables summarise the extensions. + +### `spaCy.Doc` level extensions + +| Extension | Type | Description | Default | +|------------------------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------| +| doc._.contextual_spellCheck | `Boolean` | To check whether contextualSpellCheck is added as extension | `True` | +| doc._.performed_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction | `False` | +| doc._.suggestions_spellCheck | `{Spacy.Token:str}` | if corrections are performed, it returns the mapping of misspell token (`spaCy.Token`) with suggested word(`str`) | `{}` | +| doc._.outcome_spellCheck | `str` | corrected sentence(`str`) as output | `""` | +| doc._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction | `None` | + +### `spaCy.Span` level extensions +| Extension | Type | Description | Default | +|-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| +| span._.get_has_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction in this span | `False` | +| span._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction for tokens in this `span` | `{spaCy.Token: []}` | + +### `spaCy.Token` level extensions + +| Extension | Type | Description | Default | +|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|---------| +| token._.get_require_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction on this `token` | `False` | +| token._.get_suggestion_spellCheck | `str` | if corrections are performed, it returns the suggested word(`str`) | `""` | +| token._.score_spellCheck | `[(str,float)]` | if corrections are identified, it returns suggested words(`str`) and probability(`float`) of that correction | `[]` | + +## API + +At present, there is a simple GET API to get you started. You can run the app in your local and play with it. + +Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY +Note: Your browser can handle the text encoding + +``` +GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion. +``` + +Response: + +```json +{ + "success": true, + "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "suggestion_score": { + "milion": [ + [ + "million", + 0.59422 + ], + [ + "billion", + 0.24349 + ], + ... + ], + "milion:1": [ + [ + "billion", + 0.65934 + ], + [ + "million", + 0.26185 + ], + ... + ] + } +} +``` + +## Task List + +- [ ] use cython for part of the code to improve performance ([#39](https://github.com/R1j1t/contextualSpellCheck/issues/39)) +- [ ] Improve metric for candidate selection ([#40](https://github.com/R1j1t/contextualSpellCheck/issues/40)) +- [ ] Add examples for other langauges ([#41](https://github.com/R1j1t/contextualSpellCheck/issues/41)) +- [ ] Update the logic of misspell identification (OOV) ([#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)) +- [ ] better candidate generation (solved by [#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)?) +- [ ] add metric by testing on datasets +- [ ] Improve documentation +- [ ] Improve logging in code +- [ ] Add support for Real Word Error (RWE) (Big Task) +- [ ] add multi mask out capability + +<details><summary>Completed Task</summary> +<p> + +- [x] specify maximum edit distance for `candidateRanking` +- [x] allow user to specify bert model +- [x] Include transformers deTokenizer to get better suggestions +- [x] dependency version in setup.py ([#38](https://github.com/R1j1t/contextualSpellCheck/issues/38)) + +</p> +</details> + +## Support and contribution + +If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an [issue](https://github.com/R1j1t/contextualSpellCheck/issues). If you can help with any of the above tasks, please open a [PR](https://github.com/R1j1t/contextualSpellCheck/pulls) with necessary changes to documentation and tests. + +## Cite + +If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry: + +```bibtex +@misc{Goel_Contextual_Spell_Check_2021, +author = {Goel, Rajat}, +doi = {10.5281/zenodo.4642379}, +month = {3}, +title = {{Contextual Spell Check}}, +url = {https://github.com/R1j1t/contextualSpellCheck}, +year = {2021} +} +``` + + + +## Reference + +Below are some of the projects/work I referred to while developing this package + +1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api. +2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007). +3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL]. +4. Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref. +5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3. +6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522. +7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html. +8. Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL]. +9. Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6. + +[1]: <http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf> + + +%package -n python3-contextualSpellCheck +Summary: Contextual spell correction using BERT (bidirectional representations) +Provides: python-contextualSpellCheck +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-contextualSpellCheck +# spellCheck +<a href="https://github.com/R1j1t/contextualSpellCheck"><img src="https://user-images.githubusercontent.com/22280243/82138959-2852cd00-9842-11ea-918a-49b2a7873ef6.png" width="276" height="120" align="right" /></a> + +Contextual word checker for better suggestions + +[](https://github.com/R1j1t/contextualSpellCheck/blob/master/LICENSE) +[](https://pypi.org/project/contextualSpellCheck/) +[](https://github.com/R1j1t/contextualSpellCheck#install) +[](https://pepy.tech/project/contextualspellcheck) +[](https://github.com/R1j1t/contextualSpellCheck/graphs/contributors) +[](https://github.com/R1j1t/contextualSpellCheck#task-list) +[](https://zenodo.org/badge/latestdoi/254703118) + +## Types of spelling mistakes + +It is essential to understand that identifying whether a candidate is a spelling error is a big task. + +> Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE. +> +> -- [Monojit Choudhury et. al. (2007)][1] + +This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation. + +## Install + +The package can be installed using [pip](https://pypi.org/project/contextualSpellCheck/). You would require python 3.6+ + +```bash +pip install contextualSpellCheck +``` + +## Usage + +**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder. + +### How to load the package in spacy pipeline + +```python +>>> import contextualSpellCheck +>>> import spacy +>>> nlp = spacy.load("en_core_web_sm") +>>> +>>> ## We require NER to identify if a token is a PERSON +>>> ## also require parser because we use `Token.sent` for context +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> contextualSpellCheck.add_to_pipe(nlp) +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> doc._.outcome_spellCheck +'Income was $9.4 million compared to the prior year of $2.7 million.' +``` + +Or you can add to spaCy pipeline manually! + +```python +>>> import spacy +>>> import contextualSpellCheck +>>> +>>> nlp = spacy.load("en_core_web_sm") +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> # You can pass the optional parameters to the contextualSpellCheck +>>> # eg. pass max edit distance use config={"max_edit_dist": 3} +>>> nlp.add_pipe("contextual spellchecker") +<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0> +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.") +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +``` + +After adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions). + +### Using the pipeline + +```python +>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> +>>> # Doc Extention +>>> print(doc._.contextual_spellCheck) +True +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.suggestions_spellCheck) +{milion: 'million', milion: 'million'} +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +>>> print(doc._.score_spellCheck) +{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]} +>>> +>>> # Token Extention +>>> print(doc[4]._.get_require_spellCheck) +True +>>> print(doc[4]._.get_suggestion_spellCheck) +'million' +>>> print(doc[4]._.score_spellCheck) +[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)] +>>> +>>> # Span Extention +>>> print(doc[2:6]._.get_has_spellCheck) +True +>>> print(doc[2:6]._.score_spellCheck) +{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []} +``` + +## Extensions + +To make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. The below tables summarise the extensions. + +### `spaCy.Doc` level extensions + +| Extension | Type | Description | Default | +|------------------------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------| +| doc._.contextual_spellCheck | `Boolean` | To check whether contextualSpellCheck is added as extension | `True` | +| doc._.performed_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction | `False` | +| doc._.suggestions_spellCheck | `{Spacy.Token:str}` | if corrections are performed, it returns the mapping of misspell token (`spaCy.Token`) with suggested word(`str`) | `{}` | +| doc._.outcome_spellCheck | `str` | corrected sentence(`str`) as output | `""` | +| doc._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction | `None` | + +### `spaCy.Span` level extensions +| Extension | Type | Description | Default | +|-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| +| span._.get_has_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction in this span | `False` | +| span._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction for tokens in this `span` | `{spaCy.Token: []}` | + +### `spaCy.Token` level extensions + +| Extension | Type | Description | Default | +|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|---------| +| token._.get_require_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction on this `token` | `False` | +| token._.get_suggestion_spellCheck | `str` | if corrections are performed, it returns the suggested word(`str`) | `""` | +| token._.score_spellCheck | `[(str,float)]` | if corrections are identified, it returns suggested words(`str`) and probability(`float`) of that correction | `[]` | + +## API + +At present, there is a simple GET API to get you started. You can run the app in your local and play with it. + +Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY +Note: Your browser can handle the text encoding + +``` +GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion. +``` + +Response: + +```json +{ + "success": true, + "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "suggestion_score": { + "milion": [ + [ + "million", + 0.59422 + ], + [ + "billion", + 0.24349 + ], + ... + ], + "milion:1": [ + [ + "billion", + 0.65934 + ], + [ + "million", + 0.26185 + ], + ... + ] + } +} +``` + +## Task List + +- [ ] use cython for part of the code to improve performance ([#39](https://github.com/R1j1t/contextualSpellCheck/issues/39)) +- [ ] Improve metric for candidate selection ([#40](https://github.com/R1j1t/contextualSpellCheck/issues/40)) +- [ ] Add examples for other langauges ([#41](https://github.com/R1j1t/contextualSpellCheck/issues/41)) +- [ ] Update the logic of misspell identification (OOV) ([#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)) +- [ ] better candidate generation (solved by [#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)?) +- [ ] add metric by testing on datasets +- [ ] Improve documentation +- [ ] Improve logging in code +- [ ] Add support for Real Word Error (RWE) (Big Task) +- [ ] add multi mask out capability + +<details><summary>Completed Task</summary> +<p> + +- [x] specify maximum edit distance for `candidateRanking` +- [x] allow user to specify bert model +- [x] Include transformers deTokenizer to get better suggestions +- [x] dependency version in setup.py ([#38](https://github.com/R1j1t/contextualSpellCheck/issues/38)) + +</p> +</details> + +## Support and contribution + +If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an [issue](https://github.com/R1j1t/contextualSpellCheck/issues). If you can help with any of the above tasks, please open a [PR](https://github.com/R1j1t/contextualSpellCheck/pulls) with necessary changes to documentation and tests. + +## Cite + +If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry: + +```bibtex +@misc{Goel_Contextual_Spell_Check_2021, +author = {Goel, Rajat}, +doi = {10.5281/zenodo.4642379}, +month = {3}, +title = {{Contextual Spell Check}}, +url = {https://github.com/R1j1t/contextualSpellCheck}, +year = {2021} +} +``` + + + +## Reference + +Below are some of the projects/work I referred to while developing this package + +1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api. +2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007). +3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL]. +4. Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref. +5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3. +6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522. +7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html. +8. Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL]. +9. Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6. + +[1]: <http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf> + + +%package help +Summary: Development documents and examples for contextualSpellCheck +Provides: python3-contextualSpellCheck-doc +%description help +# spellCheck +<a href="https://github.com/R1j1t/contextualSpellCheck"><img src="https://user-images.githubusercontent.com/22280243/82138959-2852cd00-9842-11ea-918a-49b2a7873ef6.png" width="276" height="120" align="right" /></a> + +Contextual word checker for better suggestions + +[](https://github.com/R1j1t/contextualSpellCheck/blob/master/LICENSE) +[](https://pypi.org/project/contextualSpellCheck/) +[](https://github.com/R1j1t/contextualSpellCheck#install) +[](https://pepy.tech/project/contextualspellcheck) +[](https://github.com/R1j1t/contextualSpellCheck/graphs/contributors) +[](https://github.com/R1j1t/contextualSpellCheck#task-list) +[](https://zenodo.org/badge/latestdoi/254703118) + +## Types of spelling mistakes + +It is essential to understand that identifying whether a candidate is a spelling error is a big task. + +> Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE. +> +> -- [Monojit Choudhury et. al. (2007)][1] + +This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation. + +## Install + +The package can be installed using [pip](https://pypi.org/project/contextualSpellCheck/). You would require python 3.6+ + +```bash +pip install contextualSpellCheck +``` + +## Usage + +**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder. + +### How to load the package in spacy pipeline + +```python +>>> import contextualSpellCheck +>>> import spacy +>>> nlp = spacy.load("en_core_web_sm") +>>> +>>> ## We require NER to identify if a token is a PERSON +>>> ## also require parser because we use `Token.sent` for context +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> contextualSpellCheck.add_to_pipe(nlp) +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> doc._.outcome_spellCheck +'Income was $9.4 million compared to the prior year of $2.7 million.' +``` + +Or you can add to spaCy pipeline manually! + +```python +>>> import spacy +>>> import contextualSpellCheck +>>> +>>> nlp = spacy.load("en_core_web_sm") +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] +>>> # You can pass the optional parameters to the contextualSpellCheck +>>> # eg. pass max edit distance use config={"max_edit_dist": 3} +>>> nlp.add_pipe("contextual spellchecker") +<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0> +>>> nlp.pipe_names +['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker'] +>>> +>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.") +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +``` + +After adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions). + +### Using the pipeline + +```python +>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.') +>>> +>>> # Doc Extention +>>> print(doc._.contextual_spellCheck) +True +>>> print(doc._.performed_spellCheck) +True +>>> print(doc._.suggestions_spellCheck) +{milion: 'million', milion: 'million'} +>>> print(doc._.outcome_spellCheck) +Income was $9.4 million compared to the prior year of $2.7 million. +>>> print(doc._.score_spellCheck) +{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]} +>>> +>>> # Token Extention +>>> print(doc[4]._.get_require_spellCheck) +True +>>> print(doc[4]._.get_suggestion_spellCheck) +'million' +>>> print(doc[4]._.score_spellCheck) +[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)] +>>> +>>> # Span Extention +>>> print(doc[2:6]._.get_has_spellCheck) +True +>>> print(doc[2:6]._.score_spellCheck) +{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []} +``` + +## Extensions + +To make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. The below tables summarise the extensions. + +### `spaCy.Doc` level extensions + +| Extension | Type | Description | Default | +|------------------------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------| +| doc._.contextual_spellCheck | `Boolean` | To check whether contextualSpellCheck is added as extension | `True` | +| doc._.performed_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction | `False` | +| doc._.suggestions_spellCheck | `{Spacy.Token:str}` | if corrections are performed, it returns the mapping of misspell token (`spaCy.Token`) with suggested word(`str`) | `{}` | +| doc._.outcome_spellCheck | `str` | corrected sentence(`str`) as output | `""` | +| doc._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction | `None` | + +### `spaCy.Span` level extensions +| Extension | Type | Description | Default | +|-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| +| span._.get_has_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction in this span | `False` | +| span._.score_spellCheck | `{Spacy.Token:List(str,float)}` | if corrections are identified, it returns the mapping of misspell token (`spaCy.Token`) with suggested words(`str`) and probability of that correction for tokens in this `span` | `{spaCy.Token: []}` | + +### `spaCy.Token` level extensions + +| Extension | Type | Description | Default | +|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|---------| +| token._.get_require_spellCheck | `Boolean` | To check whether contextualSpellCheck identified any misspells and performed correction on this `token` | `False` | +| token._.get_suggestion_spellCheck | `str` | if corrections are performed, it returns the suggested word(`str`) | `""` | +| token._.score_spellCheck | `[(str,float)]` | if corrections are identified, it returns suggested words(`str`) and probability(`float`) of that correction | `[]` | + +## API + +At present, there is a simple GET API to get you started. You can run the app in your local and play with it. + +Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY +Note: Your browser can handle the text encoding + +``` +GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion. +``` + +Response: + +```json +{ + "success": true, + "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.", + "suggestion_score": { + "milion": [ + [ + "million", + 0.59422 + ], + [ + "billion", + 0.24349 + ], + ... + ], + "milion:1": [ + [ + "billion", + 0.65934 + ], + [ + "million", + 0.26185 + ], + ... + ] + } +} +``` + +## Task List + +- [ ] use cython for part of the code to improve performance ([#39](https://github.com/R1j1t/contextualSpellCheck/issues/39)) +- [ ] Improve metric for candidate selection ([#40](https://github.com/R1j1t/contextualSpellCheck/issues/40)) +- [ ] Add examples for other langauges ([#41](https://github.com/R1j1t/contextualSpellCheck/issues/41)) +- [ ] Update the logic of misspell identification (OOV) ([#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)) +- [ ] better candidate generation (solved by [#44](https://github.com/R1j1t/contextualSpellCheck/issues/44)?) +- [ ] add metric by testing on datasets +- [ ] Improve documentation +- [ ] Improve logging in code +- [ ] Add support for Real Word Error (RWE) (Big Task) +- [ ] add multi mask out capability + +<details><summary>Completed Task</summary> +<p> + +- [x] specify maximum edit distance for `candidateRanking` +- [x] allow user to specify bert model +- [x] Include transformers deTokenizer to get better suggestions +- [x] dependency version in setup.py ([#38](https://github.com/R1j1t/contextualSpellCheck/issues/38)) + +</p> +</details> + +## Support and contribution + +If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an [issue](https://github.com/R1j1t/contextualSpellCheck/issues). If you can help with any of the above tasks, please open a [PR](https://github.com/R1j1t/contextualSpellCheck/pulls) with necessary changes to documentation and tests. + +## Cite + +If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry: + +```bibtex +@misc{Goel_Contextual_Spell_Check_2021, +author = {Goel, Rajat}, +doi = {10.5281/zenodo.4642379}, +month = {3}, +title = {{Contextual Spell Check}}, +url = {https://github.com/R1j1t/contextualSpellCheck}, +year = {2021} +} +``` + + + +## Reference + +Below are some of the projects/work I referred to while developing this package + +1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api. +2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007). +3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL]. +4. Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref. +5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3. +6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522. +7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html. +8. Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL]. +9. Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6. + +[1]: <http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf> + + +%prep +%autosetup -n contextualSpellCheck-0.4.3 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-contextualSpellCheck -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.4.3-1 +- Package Spec generated @@ -0,0 +1 @@ +3a060695fddd9b2fb0865e312f0c9f43 contextualSpellCheck-0.4.3.tar.gz |
