diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-18 03:46:40 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-18 03:46:40 +0000 |
| commit | 8788b77b00dd7e4dd14444f3db59bd1b3c1a106a (patch) | |
| tree | fa86bfb8cd0a3812daf718203e0f3ba9e15d4f46 | |
| parent | 53fcf37f0733545402bf7ea3a626bfa610517be7 (diff) | |
automatic import of python-pypdfocr
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-pypdfocr.spec | 753 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 755 insertions, 0 deletions
@@ -0,0 +1 @@ +/pypdfocr-0.9.1.tar.gz diff --git a/python-pypdfocr.spec b/python-pypdfocr.spec new file mode 100644 index 0000000..a4416bb --- /dev/null +++ b/python-pypdfocr.spec @@ -0,0 +1,753 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pypdfocr +Version: 0.9.1 +Release: 1 +Summary: Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript +License: ASL 2.0 +URL: https://pypi.org/project/pypdfocr/ +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c3/23/1bf42cb12af63d498fcd425882815c21efef37800514dbad9fa28918df5e/pypdfocr-0.9.1.tar.gz +BuildArch: noarch + + +%description +|image0| |image1| |image2| |passing| |quality| |Coverage Status| +This program will help manage your scanned PDFs by doing the following: +- Take a scanned PDF file and run OCR on it (using the Tesseract OCR + software from Google), generating a searchable PDF +- Optionally, watch a folder for incoming scanned PDFs and + automatically run OCR on them +- Optionally, file the scanned PDFs into directories based on simple + keyword matching that you specify +- Evernote auto-upload and filing based on keyword search +- Email status when it files your PDF +More links: +- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__ +- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__ +- `Source @ github <https://www.github.com/virantha/pypdfocr>`__ +Usage: +###### +Single conversion: +~~~~~~~~~~~~~~~~~~ + pypdfocr filename.pdf + --> filename_ocr.pdf will be generated +If you have a language pack installed, then you can specify it with the +``-l`` option: + pypdfocr -l spa filename.pdf +Folder monitoring: +~~~~~~~~~~~~~~~~~~ + pypdfocr -w watch_directory + --> Every time a pdf file is added to `watch_directory` it will be OCR'ed +Automatic filing: +~~~~~~~~~~~~~~~~~ +To automatically move the OCR'ed pdf to a directory based on a keyword, +use the -f option and specify a configuration file (described below): + pypdfocr filename.pdf -f -c config.yaml +You can also do this in folder monitoring mode: + pypdfocr -w watch_directory -f -c config.yaml +Filing based on filename match: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If no keywords match the contents of the filename, you can optionally +allow it to fallback to trying to find keyword matches with the PDF +filename using the -n option. For example, you may have receipts always +named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move +this to a folder called 'receipts'. Assuming you have a keyword +``receipt`` matching to folder ``receipts`` in your configuration file +as described below, you can run the following and have this filed even +if the content of the pdf does not contain the text 'receipt': + pypdfocr filename.pdf -f -c config.yaml -n +Configuration file for automatic PDF filing +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config.yaml file above is a simple folder to keyword matching text +file. It determines where your OCR'ed PDFs (and optionally, the original +scanned PDF) are placed after processing. An example is given below: + target_folder: "docs/filed" + default_folder: "docs/filed/manual_sort" + original_move_folder: "docs/originals" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +The ``target_folder`` is the root of your filing cabinet. Any PDF moving +will happen in sub-directories under this directory. +The ``folders`` section defines your filing directories and the keywords +associated with them. In this example, we have three filing directories +(finances, travl, receipts), and some associated keywords for each +filing directory. For example, if your OCR'ed PDF contains the phrase +"american express" (in any upper/lower case), it will be filed into +``docs/filed/finances`` +The ``default_folder`` is where the OCR'ed PDF is moved to if there is +no keyword match. +The ``original_move_folder`` is optional (you can comment it out with +``#`` in front of that line), but if specified, the original scanned PDF +is moved into this directory after OCR is done. Otherwise, if this field +is not present or commented out, your original PDF will stay where it +was found. +If there is any naming conflict during filing, the program will add an +underscore followed by a number to each filename, in order to avoid +overwriting files that may already be present. +Evernote upload: +~~~~~~~~~~~~~~~~ +Evernote authentication token +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To enable Evernote support, you will need to `get a developer token for +your Evernote +account. <https://www.evernote.com/api/DeveloperToken.action>`__. You +should note that this script will never delete or modify existing notes +in your account, and limits itself to creating new Notebooks and Notes. +Once you get that token, you copy and paste it into your configuration +file as shown below +Evernote filing usage +^^^^^^^^^^^^^^^^^^^^^ +To automatically upload the OCR'ed pdf to a folder based on a keyword, +use the ``-e`` option instead of the ``-f`` auto filing option. + pypdfocr filename.pdf -e -c config.yaml +Similarly, you can also do this in folder monitoring mode: + pypdfocr -w watch_directory -e -c config.yaml +Evernote filing configuration file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config file shown above only needs to change slightly. The folders +section is completely unchanged, but note that ``target_folder`` is the +name of your "Notebook stack" in Evernote, and the ``default_folder`` +should just be the default Evernote upload notebook name. + target_folder: "evernote_stack" + default_folder: "default" + original_move_folder: "docs/originals" + evernote_developer_token: "YOUR_TOKEN" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +Auto email +~~~~~~~~~~ +You can have PyPDFOCR email you everytime it converts a file and files +it. You need to first specify the following lines in the configuration +file and then use the ``-m`` option when invoking ``pypdfocr``: + mail_smtp_server: "smtp.gmail.com:587" + mail_smtp_login: "virantha@gmail.com" + mail_smtp_password: "PASSWORD" + mail_from_addr: "virantha@gmail.com" + mail_to_list: + - "virantha@gmail.com" + - "person2@gmail.com" +Advanced options +################ +Fine-tuning Tesseract/Ghostscript/others +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can specify Tesseract and Ghostscript executable locations manually, as +well as the number of concurrent processes allowed during preprocessing and +tesseract. Use the following in your configuration file: + tesseract: + binary: "/usr/bin/tesseract" + threads: 8 + ghostscript: + binary: "/usr/local/bin/gs" + preprocess: + threads: 8 +Handling disk time-outs +~~~~~~~~~~~~~~~~~~~~~~~ +If you need to increase the time interval (default 3 seconds) between new +document scans when pypdfocr is watching a directory, you can specify the following +option in the configuration file: + watch: + scan_interval: 6 +Installation +############ +Using pip +~~~~~~~~~ +PyPDFOCR is available in PyPI, so you can just run: + pip install pypdfocr +Please note that some of the 3rd-party libraries required by PyPDFOCR wiill +require some build tools, especially on a default Ubuntu system. If you run +into any issues using pip install, you may want to install the +following packages on Ubuntu and try again: +- gcc +- libjpeg-dev +- zlib-bin +- zlib1g-dev +- python-dev +For those on **Windows**, because it's such a pain to get all the PIL +and PDF dependencies installed, I've gone ahead and made an executable +called +`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__ +You still need to install Tesseract, GhostScript, etc. as detailed below in +the external dependencies list. +Manual install +~~~~~~~~~~~~~~ +Clone the source directly from github (you need to have git installed): + git clone https://github.com/virantha/pypdfocr.git +Then, install the following third-party python libraries: +- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/ +- ReportLab (PDF generation library) + http://www.reportlab.com/opensource/ +- Watchdog (Cross-platform fhlesystem events monitoring) + https://pypi.python.org/pypi/watchdog +- PyPDF2 (Pure python pdf library) +These can all be installed via pip: + pip install Pillow + pip install reportlab + pip install watchdog + pip install pypdf2 +You will also need to install the external dependencies listed below. +External Dependencies +~~~~~~~~~~~~~~~~~~~~~ +PyPDFOCR relies on the following (free) programs being installed and in +the path: +- Tesseract OCR software https://code.google.com/p/tesseract-ocr/ +- GhostScript http://www.ghostscript.com/ +- ImageMagick http://www.imagemagick.org/ +- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__) +Poppler is only required if you want pypdfocr to figure out the original PDF resolution +automatically; just make sure you have ``pdfimages`` in your path. Note that the +`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, +because it does not support the ``-list`` option to list the table of images in a PDF file. +On Mac OS X, you can install these using homebrew: + brew install tesseract + brew install ghostscript + brew install poppler + brew install imagemagick +On Windows, please use the installers provided on their download pages. +\*\* Important \*\* Tesseract version 3.02.02 or newer required +(apparently 3.02.01-6 and possibly others do not work due to a hocr +output format change that I'm not planning to address). On Ubuntu, you +may need to compile and install it manually by following `these +instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__ +Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) +then you need to find your tessdata directory and do the following: + cd /usr/local/share/tessdata + cp eng.traineddata osd.traineddata +``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata +for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step, +then any landscape document will produce garbage +Disclaimer +########## +While test coverage is at 84% right now, Sphinx docs generation is at an +early stage. The software is distributed on an "AS IS" BASIS, WITHOUT + +%package -n python3-pypdfocr +Summary: Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript +Provides: python-pypdfocr +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pypdfocr +|image0| |image1| |image2| |passing| |quality| |Coverage Status| +This program will help manage your scanned PDFs by doing the following: +- Take a scanned PDF file and run OCR on it (using the Tesseract OCR + software from Google), generating a searchable PDF +- Optionally, watch a folder for incoming scanned PDFs and + automatically run OCR on them +- Optionally, file the scanned PDFs into directories based on simple + keyword matching that you specify +- Evernote auto-upload and filing based on keyword search +- Email status when it files your PDF +More links: +- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__ +- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__ +- `Source @ github <https://www.github.com/virantha/pypdfocr>`__ +Usage: +###### +Single conversion: +~~~~~~~~~~~~~~~~~~ + pypdfocr filename.pdf + --> filename_ocr.pdf will be generated +If you have a language pack installed, then you can specify it with the +``-l`` option: + pypdfocr -l spa filename.pdf +Folder monitoring: +~~~~~~~~~~~~~~~~~~ + pypdfocr -w watch_directory + --> Every time a pdf file is added to `watch_directory` it will be OCR'ed +Automatic filing: +~~~~~~~~~~~~~~~~~ +To automatically move the OCR'ed pdf to a directory based on a keyword, +use the -f option and specify a configuration file (described below): + pypdfocr filename.pdf -f -c config.yaml +You can also do this in folder monitoring mode: + pypdfocr -w watch_directory -f -c config.yaml +Filing based on filename match: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If no keywords match the contents of the filename, you can optionally +allow it to fallback to trying to find keyword matches with the PDF +filename using the -n option. For example, you may have receipts always +named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move +this to a folder called 'receipts'. Assuming you have a keyword +``receipt`` matching to folder ``receipts`` in your configuration file +as described below, you can run the following and have this filed even +if the content of the pdf does not contain the text 'receipt': + pypdfocr filename.pdf -f -c config.yaml -n +Configuration file for automatic PDF filing +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config.yaml file above is a simple folder to keyword matching text +file. It determines where your OCR'ed PDFs (and optionally, the original +scanned PDF) are placed after processing. An example is given below: + target_folder: "docs/filed" + default_folder: "docs/filed/manual_sort" + original_move_folder: "docs/originals" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +The ``target_folder`` is the root of your filing cabinet. Any PDF moving +will happen in sub-directories under this directory. +The ``folders`` section defines your filing directories and the keywords +associated with them. In this example, we have three filing directories +(finances, travl, receipts), and some associated keywords for each +filing directory. For example, if your OCR'ed PDF contains the phrase +"american express" (in any upper/lower case), it will be filed into +``docs/filed/finances`` +The ``default_folder`` is where the OCR'ed PDF is moved to if there is +no keyword match. +The ``original_move_folder`` is optional (you can comment it out with +``#`` in front of that line), but if specified, the original scanned PDF +is moved into this directory after OCR is done. Otherwise, if this field +is not present or commented out, your original PDF will stay where it +was found. +If there is any naming conflict during filing, the program will add an +underscore followed by a number to each filename, in order to avoid +overwriting files that may already be present. +Evernote upload: +~~~~~~~~~~~~~~~~ +Evernote authentication token +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To enable Evernote support, you will need to `get a developer token for +your Evernote +account. <https://www.evernote.com/api/DeveloperToken.action>`__. You +should note that this script will never delete or modify existing notes +in your account, and limits itself to creating new Notebooks and Notes. +Once you get that token, you copy and paste it into your configuration +file as shown below +Evernote filing usage +^^^^^^^^^^^^^^^^^^^^^ +To automatically upload the OCR'ed pdf to a folder based on a keyword, +use the ``-e`` option instead of the ``-f`` auto filing option. + pypdfocr filename.pdf -e -c config.yaml +Similarly, you can also do this in folder monitoring mode: + pypdfocr -w watch_directory -e -c config.yaml +Evernote filing configuration file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config file shown above only needs to change slightly. The folders +section is completely unchanged, but note that ``target_folder`` is the +name of your "Notebook stack" in Evernote, and the ``default_folder`` +should just be the default Evernote upload notebook name. + target_folder: "evernote_stack" + default_folder: "default" + original_move_folder: "docs/originals" + evernote_developer_token: "YOUR_TOKEN" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +Auto email +~~~~~~~~~~ +You can have PyPDFOCR email you everytime it converts a file and files +it. You need to first specify the following lines in the configuration +file and then use the ``-m`` option when invoking ``pypdfocr``: + mail_smtp_server: "smtp.gmail.com:587" + mail_smtp_login: "virantha@gmail.com" + mail_smtp_password: "PASSWORD" + mail_from_addr: "virantha@gmail.com" + mail_to_list: + - "virantha@gmail.com" + - "person2@gmail.com" +Advanced options +################ +Fine-tuning Tesseract/Ghostscript/others +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can specify Tesseract and Ghostscript executable locations manually, as +well as the number of concurrent processes allowed during preprocessing and +tesseract. Use the following in your configuration file: + tesseract: + binary: "/usr/bin/tesseract" + threads: 8 + ghostscript: + binary: "/usr/local/bin/gs" + preprocess: + threads: 8 +Handling disk time-outs +~~~~~~~~~~~~~~~~~~~~~~~ +If you need to increase the time interval (default 3 seconds) between new +document scans when pypdfocr is watching a directory, you can specify the following +option in the configuration file: + watch: + scan_interval: 6 +Installation +############ +Using pip +~~~~~~~~~ +PyPDFOCR is available in PyPI, so you can just run: + pip install pypdfocr +Please note that some of the 3rd-party libraries required by PyPDFOCR wiill +require some build tools, especially on a default Ubuntu system. If you run +into any issues using pip install, you may want to install the +following packages on Ubuntu and try again: +- gcc +- libjpeg-dev +- zlib-bin +- zlib1g-dev +- python-dev +For those on **Windows**, because it's such a pain to get all the PIL +and PDF dependencies installed, I've gone ahead and made an executable +called +`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__ +You still need to install Tesseract, GhostScript, etc. as detailed below in +the external dependencies list. +Manual install +~~~~~~~~~~~~~~ +Clone the source directly from github (you need to have git installed): + git clone https://github.com/virantha/pypdfocr.git +Then, install the following third-party python libraries: +- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/ +- ReportLab (PDF generation library) + http://www.reportlab.com/opensource/ +- Watchdog (Cross-platform fhlesystem events monitoring) + https://pypi.python.org/pypi/watchdog +- PyPDF2 (Pure python pdf library) +These can all be installed via pip: + pip install Pillow + pip install reportlab + pip install watchdog + pip install pypdf2 +You will also need to install the external dependencies listed below. +External Dependencies +~~~~~~~~~~~~~~~~~~~~~ +PyPDFOCR relies on the following (free) programs being installed and in +the path: +- Tesseract OCR software https://code.google.com/p/tesseract-ocr/ +- GhostScript http://www.ghostscript.com/ +- ImageMagick http://www.imagemagick.org/ +- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__) +Poppler is only required if you want pypdfocr to figure out the original PDF resolution +automatically; just make sure you have ``pdfimages`` in your path. Note that the +`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, +because it does not support the ``-list`` option to list the table of images in a PDF file. +On Mac OS X, you can install these using homebrew: + brew install tesseract + brew install ghostscript + brew install poppler + brew install imagemagick +On Windows, please use the installers provided on their download pages. +\*\* Important \*\* Tesseract version 3.02.02 or newer required +(apparently 3.02.01-6 and possibly others do not work due to a hocr +output format change that I'm not planning to address). On Ubuntu, you +may need to compile and install it manually by following `these +instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__ +Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) +then you need to find your tessdata directory and do the following: + cd /usr/local/share/tessdata + cp eng.traineddata osd.traineddata +``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata +for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step, +then any landscape document will produce garbage +Disclaimer +########## +While test coverage is at 84% right now, Sphinx docs generation is at an +early stage. The software is distributed on an "AS IS" BASIS, WITHOUT + +%package help +Summary: Development documents and examples for pypdfocr +Provides: python3-pypdfocr-doc +%description help +|image0| |image1| |image2| |passing| |quality| |Coverage Status| +This program will help manage your scanned PDFs by doing the following: +- Take a scanned PDF file and run OCR on it (using the Tesseract OCR + software from Google), generating a searchable PDF +- Optionally, watch a folder for incoming scanned PDFs and + automatically run OCR on them +- Optionally, file the scanned PDFs into directories based on simple + keyword matching that you specify +- Evernote auto-upload and filing based on keyword search +- Email status when it files your PDF +More links: +- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__ +- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__ +- `Source @ github <https://www.github.com/virantha/pypdfocr>`__ +Usage: +###### +Single conversion: +~~~~~~~~~~~~~~~~~~ + pypdfocr filename.pdf + --> filename_ocr.pdf will be generated +If you have a language pack installed, then you can specify it with the +``-l`` option: + pypdfocr -l spa filename.pdf +Folder monitoring: +~~~~~~~~~~~~~~~~~~ + pypdfocr -w watch_directory + --> Every time a pdf file is added to `watch_directory` it will be OCR'ed +Automatic filing: +~~~~~~~~~~~~~~~~~ +To automatically move the OCR'ed pdf to a directory based on a keyword, +use the -f option and specify a configuration file (described below): + pypdfocr filename.pdf -f -c config.yaml +You can also do this in folder monitoring mode: + pypdfocr -w watch_directory -f -c config.yaml +Filing based on filename match: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If no keywords match the contents of the filename, you can optionally +allow it to fallback to trying to find keyword matches with the PDF +filename using the -n option. For example, you may have receipts always +named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move +this to a folder called 'receipts'. Assuming you have a keyword +``receipt`` matching to folder ``receipts`` in your configuration file +as described below, you can run the following and have this filed even +if the content of the pdf does not contain the text 'receipt': + pypdfocr filename.pdf -f -c config.yaml -n +Configuration file for automatic PDF filing +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config.yaml file above is a simple folder to keyword matching text +file. It determines where your OCR'ed PDFs (and optionally, the original +scanned PDF) are placed after processing. An example is given below: + target_folder: "docs/filed" + default_folder: "docs/filed/manual_sort" + original_move_folder: "docs/originals" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +The ``target_folder`` is the root of your filing cabinet. Any PDF moving +will happen in sub-directories under this directory. +The ``folders`` section defines your filing directories and the keywords +associated with them. In this example, we have three filing directories +(finances, travl, receipts), and some associated keywords for each +filing directory. For example, if your OCR'ed PDF contains the phrase +"american express" (in any upper/lower case), it will be filed into +``docs/filed/finances`` +The ``default_folder`` is where the OCR'ed PDF is moved to if there is +no keyword match. +The ``original_move_folder`` is optional (you can comment it out with +``#`` in front of that line), but if specified, the original scanned PDF +is moved into this directory after OCR is done. Otherwise, if this field +is not present or commented out, your original PDF will stay where it +was found. +If there is any naming conflict during filing, the program will add an +underscore followed by a number to each filename, in order to avoid +overwriting files that may already be present. +Evernote upload: +~~~~~~~~~~~~~~~~ +Evernote authentication token +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To enable Evernote support, you will need to `get a developer token for +your Evernote +account. <https://www.evernote.com/api/DeveloperToken.action>`__. You +should note that this script will never delete or modify existing notes +in your account, and limits itself to creating new Notebooks and Notes. +Once you get that token, you copy and paste it into your configuration +file as shown below +Evernote filing usage +^^^^^^^^^^^^^^^^^^^^^ +To automatically upload the OCR'ed pdf to a folder based on a keyword, +use the ``-e`` option instead of the ``-f`` auto filing option. + pypdfocr filename.pdf -e -c config.yaml +Similarly, you can also do this in folder monitoring mode: + pypdfocr -w watch_directory -e -c config.yaml +Evernote filing configuration file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The config file shown above only needs to change slightly. The folders +section is completely unchanged, but note that ``target_folder`` is the +name of your "Notebook stack" in Evernote, and the ``default_folder`` +should just be the default Evernote upload notebook name. + target_folder: "evernote_stack" + default_folder: "default" + original_move_folder: "docs/originals" + evernote_developer_token: "YOUR_TOKEN" + folders: + finances: + - american express + - chase card + - internal revenue service + travel: + - boarding pass + - airlines + - expedia + - orbitz + receipts: + - receipt +Auto email +~~~~~~~~~~ +You can have PyPDFOCR email you everytime it converts a file and files +it. You need to first specify the following lines in the configuration +file and then use the ``-m`` option when invoking ``pypdfocr``: + mail_smtp_server: "smtp.gmail.com:587" + mail_smtp_login: "virantha@gmail.com" + mail_smtp_password: "PASSWORD" + mail_from_addr: "virantha@gmail.com" + mail_to_list: + - "virantha@gmail.com" + - "person2@gmail.com" +Advanced options +################ +Fine-tuning Tesseract/Ghostscript/others +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can specify Tesseract and Ghostscript executable locations manually, as +well as the number of concurrent processes allowed during preprocessing and +tesseract. Use the following in your configuration file: + tesseract: + binary: "/usr/bin/tesseract" + threads: 8 + ghostscript: + binary: "/usr/local/bin/gs" + preprocess: + threads: 8 +Handling disk time-outs +~~~~~~~~~~~~~~~~~~~~~~~ +If you need to increase the time interval (default 3 seconds) between new +document scans when pypdfocr is watching a directory, you can specify the following +option in the configuration file: + watch: + scan_interval: 6 +Installation +############ +Using pip +~~~~~~~~~ +PyPDFOCR is available in PyPI, so you can just run: + pip install pypdfocr +Please note that some of the 3rd-party libraries required by PyPDFOCR wiill +require some build tools, especially on a default Ubuntu system. If you run +into any issues using pip install, you may want to install the +following packages on Ubuntu and try again: +- gcc +- libjpeg-dev +- zlib-bin +- zlib1g-dev +- python-dev +For those on **Windows**, because it's such a pain to get all the PIL +and PDF dependencies installed, I've gone ahead and made an executable +called +`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__ +You still need to install Tesseract, GhostScript, etc. as detailed below in +the external dependencies list. +Manual install +~~~~~~~~~~~~~~ +Clone the source directly from github (you need to have git installed): + git clone https://github.com/virantha/pypdfocr.git +Then, install the following third-party python libraries: +- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/ +- ReportLab (PDF generation library) + http://www.reportlab.com/opensource/ +- Watchdog (Cross-platform fhlesystem events monitoring) + https://pypi.python.org/pypi/watchdog +- PyPDF2 (Pure python pdf library) +These can all be installed via pip: + pip install Pillow + pip install reportlab + pip install watchdog + pip install pypdf2 +You will also need to install the external dependencies listed below. +External Dependencies +~~~~~~~~~~~~~~~~~~~~~ +PyPDFOCR relies on the following (free) programs being installed and in +the path: +- Tesseract OCR software https://code.google.com/p/tesseract-ocr/ +- GhostScript http://www.ghostscript.com/ +- ImageMagick http://www.imagemagick.org/ +- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__) +Poppler is only required if you want pypdfocr to figure out the original PDF resolution +automatically; just make sure you have ``pdfimages`` in your path. Note that the +`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, +because it does not support the ``-list`` option to list the table of images in a PDF file. +On Mac OS X, you can install these using homebrew: + brew install tesseract + brew install ghostscript + brew install poppler + brew install imagemagick +On Windows, please use the installers provided on their download pages. +\*\* Important \*\* Tesseract version 3.02.02 or newer required +(apparently 3.02.01-6 and possibly others do not work due to a hocr +output format change that I'm not planning to address). On Ubuntu, you +may need to compile and install it manually by following `these +instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__ +Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) +then you need to find your tessdata directory and do the following: + cd /usr/local/share/tessdata + cp eng.traineddata osd.traineddata +``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata +for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step, +then any landscape document will produce garbage +Disclaimer +########## +While test coverage is at 84% right now, Sphinx docs generation is at an +early stage. The software is distributed on an "AS IS" BASIS, WITHOUT + +%prep +%autosetup -n pypdfocr-0.9.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pypdfocr -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.1-1 +- Package Spec generated @@ -0,0 +1 @@ +23d7deb772e6fa9aa89fef257efd68a0 pypdfocr-0.9.1.tar.gz |
