summaryrefslogtreecommitdiff
path: root/python-ocrfixr.spec
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-30 17:18:27 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-30 17:18:27 +0000
commit1000be9b7e7651492a1e22713838f657e2a08919 (patch)
tree156e6d3a466067a1b5ad75e71aac423913465f11 /python-ocrfixr.spec
parent10ae2f915516b7a42d2862e2868691cd4e538b6a (diff)
automatic import of python-ocrfixr
Diffstat (limited to 'python-ocrfixr.spec')
-rw-r--r--python-ocrfixr.spec415
1 files changed, 415 insertions, 0 deletions
diff --git a/python-ocrfixr.spec b/python-ocrfixr.spec
new file mode 100644
index 0000000..0070335
--- /dev/null
+++ b/python-ocrfixr.spec
@@ -0,0 +1,415 @@
+%global _empty_manifest_terminate_build 0
+Name: python-OCRfixr
+Version: 1.5.1
+Release: 1
+Summary: A contextual spellchecker for OCR output
+License: GNU General Public License v3
+URL: https://github.com/ja-mcm/ocrfixr
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz
+BuildArch: noarch
+
+Requires: python3-transformers
+Requires: python3-tensorflow
+Requires: python3-numpy
+Requires: python3-symspellpy
+Requires: python3-importlib-resources
+Requires: python3-metaphone
+Requires: python3-tqdm
+
+%description
+<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt="python versions supported">
+
+# OCRfixr
+
+## OVERVIEW
+This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects
+
+
+## Correcting OCR Misreads
+OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".
+
+As written in book:
+> _"The birds flevv south"_
+
+Corrected text:
+> _"The birds flew south"_
+
+### How OCRfixr Works:
+OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:
+
+As written in book:
+> _"Days there were when small trade came to the __stoie__. Then the young clerk read._"
+
+| Method | Plausible Replacements |
+| --------------- | --------------- |
+| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |
+| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |
+
+Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word.
+
+Corrected text:
+> _"Days there were when small trade came to the __store__. Then the young clerk read._"
+
+For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F").
+
+### Using OCRfixr
+
+The package can be installed using [pip](https://pypi.org/project/OCRfixr/).
+
+```bash
+pip install OCRfixr
+```
+
+By default, OCRfixr only returns the original string, with all changes incorporated:
+```python
+>>> from ocrfixr import spellcheck
+
+>>> text = "The birds flevv south"
+>>> spellcheck(text).fix()
+'The birds flew south'
+```
+
+Use __return_fixes__ to also include all corrections made to the text, with associated counts for each:
+```python
+>>> spellcheck(text, return_fixes = "T").fix()
+['The birds flew south', {("flevv","flew"):1}]
+```
+
+_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_
+
+
+### Interactive Mode
+OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:
+
+```python
+>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
+>>> spellcheck(text, interactive = "T").fix()
+```
+
+<img width="723" alt="Suggestion 1" src="https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png">
+
+Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.
+
+<img width="723" alt="Suggestion 2" src="https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png">
+
+```python
+>>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI
+'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents'
+```
+
+This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.
+
+### Command-Line
+OCRfixr is also callable via command-line (intended for Guiguts use):
+
+```python
+>>> ocrfixr input_text.txt output_filename.txt
+```
+
+The output file will list the line number and position of all suggested changes.
+
+
+### Avoiding "Damn You, Autocorrect!"
+By design, OCRfixr is change-averse:
+- If spellcheck/context do not line up, no update is made.
+- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
+- Only the top 15 context suggestions are considered, to limit low-probability matches.
+- If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings
+- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.
+
+Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model.
+
+
+
+## Credits
+
+- __symspellpy__ powers spellcheck suggestions
+- __transformers__ does the heavy lifting for BERT context modelling
+- __DataMunging__ provided a very useful list of common scanning errors
+- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.
+- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>
+
+
+
+%package -n python3-OCRfixr
+Summary: A contextual spellchecker for OCR output
+Provides: python-OCRfixr
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-OCRfixr
+<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt="python versions supported">
+
+# OCRfixr
+
+## OVERVIEW
+This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects
+
+
+## Correcting OCR Misreads
+OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".
+
+As written in book:
+> _"The birds flevv south"_
+
+Corrected text:
+> _"The birds flew south"_
+
+### How OCRfixr Works:
+OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:
+
+As written in book:
+> _"Days there were when small trade came to the __stoie__. Then the young clerk read._"
+
+| Method | Plausible Replacements |
+| --------------- | --------------- |
+| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |
+| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |
+
+Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word.
+
+Corrected text:
+> _"Days there were when small trade came to the __store__. Then the young clerk read._"
+
+For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F").
+
+### Using OCRfixr
+
+The package can be installed using [pip](https://pypi.org/project/OCRfixr/).
+
+```bash
+pip install OCRfixr
+```
+
+By default, OCRfixr only returns the original string, with all changes incorporated:
+```python
+>>> from ocrfixr import spellcheck
+
+>>> text = "The birds flevv south"
+>>> spellcheck(text).fix()
+'The birds flew south'
+```
+
+Use __return_fixes__ to also include all corrections made to the text, with associated counts for each:
+```python
+>>> spellcheck(text, return_fixes = "T").fix()
+['The birds flew south', {("flevv","flew"):1}]
+```
+
+_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_
+
+
+### Interactive Mode
+OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:
+
+```python
+>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
+>>> spellcheck(text, interactive = "T").fix()
+```
+
+<img width="723" alt="Suggestion 1" src="https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png">
+
+Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.
+
+<img width="723" alt="Suggestion 2" src="https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png">
+
+```python
+>>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI
+'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents'
+```
+
+This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.
+
+### Command-Line
+OCRfixr is also callable via command-line (intended for Guiguts use):
+
+```python
+>>> ocrfixr input_text.txt output_filename.txt
+```
+
+The output file will list the line number and position of all suggested changes.
+
+
+### Avoiding "Damn You, Autocorrect!"
+By design, OCRfixr is change-averse:
+- If spellcheck/context do not line up, no update is made.
+- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
+- Only the top 15 context suggestions are considered, to limit low-probability matches.
+- If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings
+- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.
+
+Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model.
+
+
+
+## Credits
+
+- __symspellpy__ powers spellcheck suggestions
+- __transformers__ does the heavy lifting for BERT context modelling
+- __DataMunging__ provided a very useful list of common scanning errors
+- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.
+- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>
+
+
+
+%package help
+Summary: Development documents and examples for OCRfixr
+Provides: python3-OCRfixr-doc
+%description help
+<img src=https://img.shields.io/badge/Python-3.6%2B-blue alt="python versions supported">
+
+# OCRfixr
+
+## OVERVIEW
+This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects
+
+
+## Correcting OCR Misreads
+OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn".
+
+As written in book:
+> _"The birds flevv south"_
+
+Corrected text:
+> _"The birds flew south"_
+
+### How OCRfixr Works:
+OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake:
+
+As written in book:
+> _"Days there were when small trade came to the __stoie__. Then the young clerk read._"
+
+| Method | Plausible Replacements |
+| --------------- | --------------- |
+| Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie |
+| Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area |
+
+Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word.
+
+Corrected text:
+> _"Days there were when small trade came to the __store__. Then the young clerk read._"
+
+For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F").
+
+### Using OCRfixr
+
+The package can be installed using [pip](https://pypi.org/project/OCRfixr/).
+
+```bash
+pip install OCRfixr
+```
+
+By default, OCRfixr only returns the original string, with all changes incorporated:
+```python
+>>> from ocrfixr import spellcheck
+
+>>> text = "The birds flevv south"
+>>> spellcheck(text).fix()
+'The birds flew south'
+```
+
+Use __return_fixes__ to also include all corrections made to the text, with associated counts for each:
+```python
+>>> spellcheck(text, return_fixes = "T").fix()
+['The birds flew south', {("flevv","flew"):1}]
+```
+
+_(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_
+
+
+### Interactive Mode
+OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text:
+
+```python
+>>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents"
+>>> spellcheck(text, interactive = "T").fix()
+```
+
+<img width="723" alt="Suggestion 1" src="https://user-images.githubusercontent.com/67446041/107133270-7918c300-68b4-11eb-9de5-5b6282510c16.png">
+
+Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits.
+
+<img width="723" alt="Suggestion 2" src="https://user-images.githubusercontent.com/67446041/115068768-af7c4b00-9ec0-11eb-9c7a-65b518718ec4.png">
+
+```python
+>>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI
+'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents'
+```
+
+This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text.
+
+### Command-Line
+OCRfixr is also callable via command-line (intended for Guiguts use):
+
+```python
+>>> ocrfixr input_text.txt output_filename.txt
+```
+
+The output file will list the line number and position of all suggested changes.
+
+
+### Avoiding "Damn You, Autocorrect!"
+By design, OCRfixr is change-averse:
+- If spellcheck/context do not line up, no update is made.
+- Likewise, if there is >1 word that lines up for spellcheck/context, no update is made.
+- Only the top 15 context suggestions are considered, to limit low-probability matches.
+- If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings
+- Proper nouns (anything starting with a capital letter) are not evaluated for spelling.
+
+Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model.
+
+
+
+## Credits
+
+- __symspellpy__ powers spellcheck suggestions
+- __transformers__ does the heavy lifting for BERT context modelling
+- __DataMunging__ provided a very useful list of common scanning errors
+- __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson.
+- This project was created to help __Distributed Proofreaders__. Support them here: <https://www.pgdp.net/c/>
+
+
+
+%prep
+%autosetup -n OCRfixr-1.5.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-OCRfixr -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Tue May 30 2023 Python_Bot <Python_Bot@openeuler.org> - 1.5.1-1
+- Package Spec generated