summaryrefslogtreecommitdiff
path: root/python-pypdfocr.spec
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-18 03:46:40 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-18 03:46:40 +0000
commit8788b77b00dd7e4dd14444f3db59bd1b3c1a106a (patch)
treefa86bfb8cd0a3812daf718203e0f3ba9e15d4f46 /python-pypdfocr.spec
parent53fcf37f0733545402bf7ea3a626bfa610517be7 (diff)
automatic import of python-pypdfocr
Diffstat (limited to 'python-pypdfocr.spec')
-rw-r--r--python-pypdfocr.spec753
1 files changed, 753 insertions, 0 deletions
diff --git a/python-pypdfocr.spec b/python-pypdfocr.spec
new file mode 100644
index 0000000..a4416bb
--- /dev/null
+++ b/python-pypdfocr.spec
@@ -0,0 +1,753 @@
+%global _empty_manifest_terminate_build 0
+Name: python-pypdfocr
+Version: 0.9.1
+Release: 1
+Summary: Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
+License: ASL 2.0
+URL: https://pypi.org/project/pypdfocr/
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c3/23/1bf42cb12af63d498fcd425882815c21efef37800514dbad9fa28918df5e/pypdfocr-0.9.1.tar.gz
+BuildArch: noarch
+
+
+%description
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+- Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+ software from Google), generating a searchable PDF
+- Optionally, watch a folder for incoming scanned PDFs and
+ automatically run OCR on them
+- Optionally, file the scanned PDFs into directories based on simple
+ keyword matching that you specify
+- Evernote auto-upload and filing based on keyword search
+- Email status when it files your PDF
+More links:
+- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+- `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr filename.pdf
+ --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+ pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr -w watch_directory
+ --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+ pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+ pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+ target_folder: "docs/filed"
+ default_folder: "docs/filed/manual_sort"
+ original_move_folder: "docs/originals"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+ pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+ target_folder: "evernote_stack"
+ default_folder: "default"
+ original_move_folder: "docs/originals"
+ evernote_developer_token: "YOUR_TOKEN"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+ mail_smtp_server: "smtp.gmail.com:587"
+ mail_smtp_login: "virantha@gmail.com"
+ mail_smtp_password: "PASSWORD"
+ mail_from_addr: "virantha@gmail.com"
+ mail_to_list:
+ - "virantha@gmail.com"
+ - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract. Use the following in your configuration file:
+ tesseract:
+ binary: "/usr/bin/tesseract"
+ threads: 8
+ ghostscript:
+ binary: "/usr/local/bin/gs"
+ preprocess:
+ threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+ watch:
+ scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+ pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system. If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+ git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+- ReportLab (PDF generation library)
+ http://www.reportlab.com/opensource/
+- Watchdog (Cross-platform fhlesystem events monitoring)
+ https://pypi.python.org/pypi/watchdog
+- PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+ pip install Pillow
+ pip install reportlab
+ pip install watchdog
+ pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+- GhostScript http://www.ghostscript.com/
+- ImageMagick http://www.imagemagick.org/
+- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path. Note that the
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this,
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+ brew install tesseract
+ brew install ghostscript
+ brew install poppler
+ brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+ cd /usr/local/share/tessdata
+ cp eng.traineddata osd.traineddata
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step,
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%package -n python3-pypdfocr
+Summary: Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
+Provides: python-pypdfocr
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-pypdfocr
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+- Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+ software from Google), generating a searchable PDF
+- Optionally, watch a folder for incoming scanned PDFs and
+ automatically run OCR on them
+- Optionally, file the scanned PDFs into directories based on simple
+ keyword matching that you specify
+- Evernote auto-upload and filing based on keyword search
+- Email status when it files your PDF
+More links:
+- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+- `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr filename.pdf
+ --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+ pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr -w watch_directory
+ --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+ pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+ pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+ target_folder: "docs/filed"
+ default_folder: "docs/filed/manual_sort"
+ original_move_folder: "docs/originals"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+ pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+ target_folder: "evernote_stack"
+ default_folder: "default"
+ original_move_folder: "docs/originals"
+ evernote_developer_token: "YOUR_TOKEN"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+ mail_smtp_server: "smtp.gmail.com:587"
+ mail_smtp_login: "virantha@gmail.com"
+ mail_smtp_password: "PASSWORD"
+ mail_from_addr: "virantha@gmail.com"
+ mail_to_list:
+ - "virantha@gmail.com"
+ - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract. Use the following in your configuration file:
+ tesseract:
+ binary: "/usr/bin/tesseract"
+ threads: 8
+ ghostscript:
+ binary: "/usr/local/bin/gs"
+ preprocess:
+ threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+ watch:
+ scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+ pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system. If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+ git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+- ReportLab (PDF generation library)
+ http://www.reportlab.com/opensource/
+- Watchdog (Cross-platform fhlesystem events monitoring)
+ https://pypi.python.org/pypi/watchdog
+- PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+ pip install Pillow
+ pip install reportlab
+ pip install watchdog
+ pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+- GhostScript http://www.ghostscript.com/
+- ImageMagick http://www.imagemagick.org/
+- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path. Note that the
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this,
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+ brew install tesseract
+ brew install ghostscript
+ brew install poppler
+ brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+ cd /usr/local/share/tessdata
+ cp eng.traineddata osd.traineddata
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step,
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%package help
+Summary: Development documents and examples for pypdfocr
+Provides: python3-pypdfocr-doc
+%description help
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+- Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+ software from Google), generating a searchable PDF
+- Optionally, watch a folder for incoming scanned PDFs and
+ automatically run OCR on them
+- Optionally, file the scanned PDFs into directories based on simple
+ keyword matching that you specify
+- Evernote auto-upload and filing based on keyword search
+- Email status when it files your PDF
+More links:
+- `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+- `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+- `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr filename.pdf
+ --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+ pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+ pypdfocr -w watch_directory
+ --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+ pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+ pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+ target_folder: "docs/filed"
+ default_folder: "docs/filed/manual_sort"
+ original_move_folder: "docs/originals"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+ pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+ pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+ target_folder: "evernote_stack"
+ default_folder: "default"
+ original_move_folder: "docs/originals"
+ evernote_developer_token: "YOUR_TOKEN"
+ folders:
+ finances:
+ - american express
+ - chase card
+ - internal revenue service
+ travel:
+ - boarding pass
+ - airlines
+ - expedia
+ - orbitz
+ receipts:
+ - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+ mail_smtp_server: "smtp.gmail.com:587"
+ mail_smtp_login: "virantha@gmail.com"
+ mail_smtp_password: "PASSWORD"
+ mail_from_addr: "virantha@gmail.com"
+ mail_to_list:
+ - "virantha@gmail.com"
+ - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract. Use the following in your configuration file:
+ tesseract:
+ binary: "/usr/bin/tesseract"
+ threads: 8
+ ghostscript:
+ binary: "/usr/local/bin/gs"
+ preprocess:
+ threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+ watch:
+ scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+ pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system. If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+ git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+- Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+- ReportLab (PDF generation library)
+ http://www.reportlab.com/opensource/
+- Watchdog (Cross-platform fhlesystem events monitoring)
+ https://pypi.python.org/pypi/watchdog
+- PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+ pip install Pillow
+ pip install reportlab
+ pip install watchdog
+ pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+- GhostScript http://www.ghostscript.com/
+- ImageMagick http://www.imagemagick.org/
+- Poppler http://poppler.freedesktop.org/ (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path. Note that the
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this,
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+ brew install tesseract
+ brew install ghostscript
+ brew install poppler
+ brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+ cd /usr/local/share/tessdata
+ cp eng.traineddata osd.traineddata
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``. If you don't do this step,
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%prep
+%autosetup -n pypdfocr-0.9.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-pypdfocr -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.1-1
+- Package Spec generated