%global _empty_manifest_terminate_build 0 Name: python-OCRfixr Version: 1.5.1 Release: 1 Summary: A contextual spellchecker for OCR output License: GNU General Public License v3 URL: https://github.com/ja-mcm/ocrfixr Source0: https://mirrors.nju.edu.cn/pypi/web/packages/5a/ee/40f5fcb864530ebecb7393e59db95aea92ac187730ff478cf7b23a35f390/OCRfixr-1.5.1.tar.gz BuildArch: noarch Requires: python3-transformers Requires: python3-tensorflow Requires: python3-numpy Requires: python3-symspellpy Requires: python3-importlib-resources Requires: python3-metaphone Requires: python3-tqdm %description python versions supported # OCRfixr ## OVERVIEW This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects ## Correcting OCR Misreads OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn". As written in book: > _"The birds flevv south"_ Corrected text: > _"The birds flew south"_ ### How OCRfixr Works: OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake: As written in book: > _"Days there were when small trade came to the __stoie__. Then the young clerk read._" | Method | Plausible Replacements | | --------------- | --------------- | | Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie | | Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area | Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. Corrected text: > _"Days there were when small trade came to the __store__. Then the young clerk read._" For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F"). ### Using OCRfixr The package can be installed using [pip](https://pypi.org/project/OCRfixr/). ```bash pip install OCRfixr ``` By default, OCRfixr only returns the original string, with all changes incorporated: ```python >>> from ocrfixr import spellcheck >>> text = "The birds flevv south" >>> spellcheck(text).fix() 'The birds flew south' ``` Use __return_fixes__ to also include all corrections made to the text, with associated counts for each: ```python >>> spellcheck(text, return_fixes = "T").fix() ['The birds flew south', {("flevv","flew"):1}] ``` _(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_ ### Interactive Mode OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text: ```python >>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents" >>> spellcheck(text, interactive = "T").fix() ``` Suggestion 1 Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits. Suggestion 2 ```python >>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI 'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents' ``` This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text. ### Command-Line OCRfixr is also callable via command-line (intended for Guiguts use): ```python >>> ocrfixr input_text.txt output_filename.txt ``` The output file will list the line number and position of all suggested changes. ### Avoiding "Damn You, Autocorrect!" By design, OCRfixr is change-averse: - If spellcheck/context do not line up, no update is made. - Likewise, if there is >1 word that lines up for spellcheck/context, no update is made. - Only the top 15 context suggestions are considered, to limit low-probability matches. - If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings - Proper nouns (anything starting with a capital letter) are not evaluated for spelling. Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model. ## Credits - __symspellpy__ powers spellcheck suggestions - __transformers__ does the heavy lifting for BERT context modelling - __DataMunging__ provided a very useful list of common scanning errors - __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson. - This project was created to help __Distributed Proofreaders__. Support them here: %package -n python3-OCRfixr Summary: A contextual spellchecker for OCR output Provides: python-OCRfixr BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-OCRfixr python versions supported # OCRfixr ## OVERVIEW This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects ## Correcting OCR Misreads OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn". As written in book: > _"The birds flevv south"_ Corrected text: > _"The birds flew south"_ ### How OCRfixr Works: OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake: As written in book: > _"Days there were when small trade came to the __stoie__. Then the young clerk read._" | Method | Plausible Replacements | | --------------- | --------------- | | Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie | | Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area | Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. Corrected text: > _"Days there were when small trade came to the __store__. Then the young clerk read._" For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F"). ### Using OCRfixr The package can be installed using [pip](https://pypi.org/project/OCRfixr/). ```bash pip install OCRfixr ``` By default, OCRfixr only returns the original string, with all changes incorporated: ```python >>> from ocrfixr import spellcheck >>> text = "The birds flevv south" >>> spellcheck(text).fix() 'The birds flew south' ``` Use __return_fixes__ to also include all corrections made to the text, with associated counts for each: ```python >>> spellcheck(text, return_fixes = "T").fix() ['The birds flew south', {("flevv","flew"):1}] ``` _(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_ ### Interactive Mode OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text: ```python >>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents" >>> spellcheck(text, interactive = "T").fix() ``` Suggestion 1 Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits. Suggestion 2 ```python >>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI 'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents' ``` This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text. ### Command-Line OCRfixr is also callable via command-line (intended for Guiguts use): ```python >>> ocrfixr input_text.txt output_filename.txt ``` The output file will list the line number and position of all suggested changes. ### Avoiding "Damn You, Autocorrect!" By design, OCRfixr is change-averse: - If spellcheck/context do not line up, no update is made. - Likewise, if there is >1 word that lines up for spellcheck/context, no update is made. - Only the top 15 context suggestions are considered, to limit low-probability matches. - If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings - Proper nouns (anything starting with a capital letter) are not evaluated for spelling. Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model. ## Credits - __symspellpy__ powers spellcheck suggestions - __transformers__ does the heavy lifting for BERT context modelling - __DataMunging__ provided a very useful list of common scanning errors - __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson. - This project was created to help __Distributed Proofreaders__. Support them here: %package help Summary: Development documents and examples for OCRfixr Provides: python3-OCRfixr-doc %description help python versions supported # OCRfixr ## OVERVIEW This project aims to help automate the challenging work of manually correcting OCR output from Distributed Proofreaders' book digitization projects ## Correcting OCR Misreads OCRs can sometimes mistake similar-looking characters when scanning a book. For example, "l" and "1" are easily confused, potentially causing the OCR to misread the word "learn" as "1earn". As written in book: > _"The birds flevv south"_ Corrected text: > _"The birds flew south"_ ### How OCRfixr Works: OCRfixr fixes misreads by checking __1) possible spell corrections__ against the __2) local context__ of the word. For example, here's how OCRfixr would evaluate the following OCR mistake: As written in book: > _"Days there were when small trade came to the __stoie__. Then the young clerk read._" | Method | Plausible Replacements | | --------------- | --------------- | | Spellcheck (symspellpy) | stone, __store__, stoke, stove, stowe, stole, soie | | Context (BERT) | market, shop, town, city, __store__, table, village, door, light, markets, surface, place, window, docks, area | Since there is match for both a plausible spellcheck replacement and that word reasonably matches the context of the sentence, OCRfixr updates the word. Corrected text: > _"Days there were when small trade came to the __store__. Then the young clerk read._" For very common scanning errors where it is clear what the word should have been (ex: 'onlv' --> 'only'), OCRfixr skips the context check and relies solely on a static mapping of common corrections. This helps to maximize the number of successful edits \& decrease compute time. (You can disable this by setting common_scannos to "F"). ### Using OCRfixr The package can be installed using [pip](https://pypi.org/project/OCRfixr/). ```bash pip install OCRfixr ``` By default, OCRfixr only returns the original string, with all changes incorporated: ```python >>> from ocrfixr import spellcheck >>> text = "The birds flevv south" >>> spellcheck(text).fix() 'The birds flew south' ``` Use __return_fixes__ to also include all corrections made to the text, with associated counts for each: ```python >>> spellcheck(text, return_fixes = "T").fix() ['The birds flew south', {("flevv","flew"):1}] ``` _(Note: OCRfixr resets its BERT context window at the start of each new paragraph, so splitting by paragraph may be a useful debug feature)_ ### Interactive Mode OCRfixr also has an option for the user to interactively accept/reject suggested changes to the text: ```python >>> text = "The birds flevv down\n south, but wefe quickly apprehended\n by border patrol agents" >>> spellcheck(text, interactive = "T").fix() ``` Suggestion 1 Each suggestion provides the local context around the garbled text, so that the user can determine if the suggestion fits. Suggestion 2 ```python >>> ### User accepts change to "flevv", but rejects change to "wefe" in GUI 'The birds flew down\n south, but wefe quickly apprehended\n by border patrol agents' ``` This returns the text with all accepted changes reflected. All rejected suggestions are left as-is in the text. ### Command-Line OCRfixr is also callable via command-line (intended for Guiguts use): ```python >>> ocrfixr input_text.txt output_filename.txt ``` The output file will list the line number and position of all suggested changes. ### Avoiding "Damn You, Autocorrect!" By design, OCRfixr is change-averse: - If spellcheck/context do not line up, no update is made. - Likewise, if there is >1 word that lines up for spellcheck/context, no update is made. - Only the top 15 context suggestions are considered, to limit low-probability matches. - If the suggestion is a homophone of the original word, it is ignored (original: coupla --> suggestion: couple). These are assumed to be 'stylistic' or phonetic misspellings - Proper nouns (anything starting with a capital letter) are not evaluated for spelling. Word context is drawn from all sentences in the current paragraph (designated by a '\n'), to maximize available information, while also not bogging down the BERT model. ## Credits - __symspellpy__ powers spellcheck suggestions - __transformers__ does the heavy lifting for BERT context modelling - __DataMunging__ provided a very useful list of common scanning errors - __SCOWL__ word list is Copyright 2000-2019 by Kevin Atkinson. - This project was created to help __Distributed Proofreaders__. Support them here: %prep %autosetup -n OCRfixr-1.5.1 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-OCRfixr -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue May 30 2023 Python_Bot - 1.5.1-1 - Package Spec generated