automatic import of python-pypdfocr

author: CoprDistGit <infra@openeuler.org> 2023-05-18 03:46:40 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-18 03:46:40 +0000
commit: 8788b77b00dd7e4dd14444f3db59bd1b3c1a106a (patch)
tree: fa86bfb8cd0a3812daf718203e0f3ba9e15d4f46
parent: 53fcf37f0733545402bf7ea3a626bfa610517be7 (diff)
3 files changed, 755 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..9e06ad0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/pypdfocr-0.9.1.tar.gz
diff --git a/python-pypdfocr.spec b/python-pypdfocr.spec
new file mode 100644
index 0000000..a4416bb
--- /dev/null
+++ b/python-pypdfocr.spec
@@ -0,0 +1,753 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-pypdfocr
+Version:	0.9.1
+Release:	1
+Summary:	Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
+License:	ASL 2.0
+URL:		https://pypi.org/project/pypdfocr/
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/c3/23/1bf42cb12af63d498fcd425882815c21efef37800514dbad9fa28918df5e/pypdfocr-0.9.1.tar.gz
+BuildArch:	noarch
+
+
+%description
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+-  Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+   software from Google), generating a searchable PDF
+-  Optionally, watch a folder for incoming scanned PDFs and
+   automatically run OCR on them
+-  Optionally, file the scanned PDFs into directories based on simple
+   keyword matching that you specify
+-  Evernote auto-upload and filing based on keyword search
+-  Email status when it files your PDF
+More links:
+-  `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+-  `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+-  `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr filename.pdf
+    --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+    pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr -w watch_directory
+    --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+    pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+    pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+    target_folder: "docs/filed"
+    default_folder: "docs/filed/manual_sort"
+    original_move_folder: "docs/originals"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+    pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+    target_folder: "evernote_stack"
+    default_folder: "default"
+    original_move_folder: "docs/originals"
+    evernote_developer_token: "YOUR_TOKEN"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+    mail_smtp_server: "smtp.gmail.com:587"
+    mail_smtp_login: "virantha@gmail.com"
+    mail_smtp_password: "PASSWORD"
+    mail_from_addr: "virantha@gmail.com"
+    mail_to_list: 
+        - "virantha@gmail.com"
+        - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract.  Use the following in your configuration file:
+    tesseract:
+        binary: "/usr/bin/tesseract"
+        threads: 8
+    ghostscript:
+        binary: "/usr/local/bin/gs"
+    preprocess:
+        threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+    watch:
+        scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+    pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system.  If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+    git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+-  Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+-  ReportLab (PDF generation library)
+   http://www.reportlab.com/opensource/
+-  Watchdog (Cross-platform fhlesystem events monitoring)
+   https://pypi.python.org/pypi/watchdog
+-  PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+    pip install Pillow
+    pip install reportlab
+    pip install watchdog
+    pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+-  Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+-  GhostScript http://www.ghostscript.com/
+-  ImageMagick http://www.imagemagick.org/
+-  Poppler http://poppler.freedesktop.org/  (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path.   Note that the 
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, 
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+    brew install tesseract
+    brew install ghostscript
+    brew install poppler
+    brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+    cd /usr/local/share/tessdata 
+    cp eng.traineddata osd.traineddata 
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``.  If you don't do this step, 
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%package -n python3-pypdfocr
+Summary:	Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
+Provides:	python-pypdfocr
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-pypdfocr
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+-  Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+   software from Google), generating a searchable PDF
+-  Optionally, watch a folder for incoming scanned PDFs and
+   automatically run OCR on them
+-  Optionally, file the scanned PDFs into directories based on simple
+   keyword matching that you specify
+-  Evernote auto-upload and filing based on keyword search
+-  Email status when it files your PDF
+More links:
+-  `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+-  `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+-  `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr filename.pdf
+    --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+    pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr -w watch_directory
+    --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+    pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+    pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+    target_folder: "docs/filed"
+    default_folder: "docs/filed/manual_sort"
+    original_move_folder: "docs/originals"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+    pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+    target_folder: "evernote_stack"
+    default_folder: "default"
+    original_move_folder: "docs/originals"
+    evernote_developer_token: "YOUR_TOKEN"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+    mail_smtp_server: "smtp.gmail.com:587"
+    mail_smtp_login: "virantha@gmail.com"
+    mail_smtp_password: "PASSWORD"
+    mail_from_addr: "virantha@gmail.com"
+    mail_to_list: 
+        - "virantha@gmail.com"
+        - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract.  Use the following in your configuration file:
+    tesseract:
+        binary: "/usr/bin/tesseract"
+        threads: 8
+    ghostscript:
+        binary: "/usr/local/bin/gs"
+    preprocess:
+        threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+    watch:
+        scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+    pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system.  If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+    git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+-  Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+-  ReportLab (PDF generation library)
+   http://www.reportlab.com/opensource/
+-  Watchdog (Cross-platform fhlesystem events monitoring)
+   https://pypi.python.org/pypi/watchdog
+-  PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+    pip install Pillow
+    pip install reportlab
+    pip install watchdog
+    pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+-  Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+-  GhostScript http://www.ghostscript.com/
+-  ImageMagick http://www.imagemagick.org/
+-  Poppler http://poppler.freedesktop.org/  (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path.   Note that the 
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, 
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+    brew install tesseract
+    brew install ghostscript
+    brew install poppler
+    brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+    cd /usr/local/share/tessdata 
+    cp eng.traineddata osd.traineddata 
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``.  If you don't do this step, 
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%package help
+Summary:	Development documents and examples for pypdfocr
+Provides:	python3-pypdfocr-doc
+%description help
+|image0| |image1| |image2| |passing| |quality| |Coverage Status|
+This program will help manage your scanned PDFs by doing the following:
+-  Take a scanned PDF file and run OCR on it (using the Tesseract OCR
+   software from Google), generating a searchable PDF
+-  Optionally, watch a folder for incoming scanned PDFs and
+   automatically run OCR on them
+-  Optionally, file the scanned PDFs into directories based on simple
+   keyword matching that you specify
+-  Evernote auto-upload and filing based on keyword search
+-  Email status when it files your PDF
+More links:
+-  `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
+-  `Documentation @ gitpages <http://virantha.github.com/pypdfocr/html>`__
+-  `Source @ github <https://www.github.com/virantha/pypdfocr>`__
+Usage:
+######
+Single conversion:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr filename.pdf
+    --> filename_ocr.pdf will be generated
+If you have a language pack installed, then you can specify it with the
+``-l`` option:
+    pypdfocr -l spa filename.pdf
+Folder monitoring:
+~~~~~~~~~~~~~~~~~~
+    pypdfocr -w watch_directory
+    --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
+Automatic filing:
+~~~~~~~~~~~~~~~~~
+To automatically move the OCR'ed pdf to a directory based on a keyword,
+use the -f option and specify a configuration file (described below):
+    pypdfocr filename.pdf -f -c config.yaml
+You can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -f -c config.yaml
+Filing based on filename match:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If no keywords match the contents of the filename, you can optionally
+allow it to fallback to trying to find keyword matches with the PDF
+filename using the -n option. For example, you may have receipts always
+named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
+this to a folder called 'receipts'. Assuming you have a keyword
+``receipt`` matching to folder ``receipts`` in your configuration file
+as described below, you can run the following and have this filed even
+if the content of the pdf does not contain the text 'receipt':
+    pypdfocr filename.pdf -f -c config.yaml -n
+Configuration file for automatic PDF filing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config.yaml file above is a simple folder to keyword matching text
+file. It determines where your OCR'ed PDFs (and optionally, the original
+scanned PDF) are placed after processing. An example is given below:
+    target_folder: "docs/filed"
+    default_folder: "docs/filed/manual_sort"
+    original_move_folder: "docs/originals"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+The ``target_folder`` is the root of your filing cabinet. Any PDF moving
+will happen in sub-directories under this directory.
+The ``folders`` section defines your filing directories and the keywords
+associated with them. In this example, we have three filing directories
+(finances, travl, receipts), and some associated keywords for each
+filing directory. For example, if your OCR'ed PDF contains the phrase
+"american express" (in any upper/lower case), it will be filed into
+``docs/filed/finances``
+The ``default_folder`` is where the OCR'ed PDF is moved to if there is
+no keyword match.
+The ``original_move_folder`` is optional (you can comment it out with
+``#`` in front of that line), but if specified, the original scanned PDF
+is moved into this directory after OCR is done. Otherwise, if this field
+is not present or commented out, your original PDF will stay where it
+was found.
+If there is any naming conflict during filing, the program will add an
+underscore followed by a number to each filename, in order to avoid
+overwriting files that may already be present.
+Evernote upload:
+~~~~~~~~~~~~~~~~
+Evernote authentication token
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To enable Evernote support, you will need to `get a developer token for
+your Evernote
+account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
+should note that this script will never delete or modify existing notes
+in your account, and limits itself to creating new Notebooks and Notes.
+Once you get that token, you copy and paste it into your configuration
+file as shown below
+Evernote filing usage
+^^^^^^^^^^^^^^^^^^^^^
+To automatically upload the OCR'ed pdf to a folder based on a keyword,
+use the ``-e`` option instead of the ``-f`` auto filing option.
+    pypdfocr filename.pdf -e -c config.yaml
+Similarly, you can also do this in folder monitoring mode:
+    pypdfocr -w watch_directory -e -c config.yaml
+Evernote filing configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The config file shown above only needs to change slightly. The folders
+section is completely unchanged, but note that ``target_folder`` is the
+name of your "Notebook stack" in Evernote, and the ``default_folder``
+should just be the default Evernote upload notebook name.
+    target_folder: "evernote_stack"
+    default_folder: "default"
+    original_move_folder: "docs/originals"
+    evernote_developer_token: "YOUR_TOKEN"
+    folders:
+        finances:
+            - american express
+            - chase card
+            - internal revenue service
+        travel:
+            - boarding pass
+            - airlines
+            - expedia
+            - orbitz
+        receipts:
+            - receipt
+Auto email
+~~~~~~~~~~
+You can have PyPDFOCR email you everytime it converts a file and files
+it. You need to first specify the following lines in the configuration
+file and then use the ``-m`` option when invoking ``pypdfocr``:
+    mail_smtp_server: "smtp.gmail.com:587"
+    mail_smtp_login: "virantha@gmail.com"
+    mail_smtp_password: "PASSWORD"
+    mail_from_addr: "virantha@gmail.com"
+    mail_to_list: 
+        - "virantha@gmail.com"
+        - "person2@gmail.com"
+Advanced options
+################
+Fine-tuning Tesseract/Ghostscript/others
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You can specify Tesseract and Ghostscript executable locations manually, as
+well as the number of concurrent processes allowed during preprocessing and
+tesseract.  Use the following in your configuration file:
+    tesseract:
+        binary: "/usr/bin/tesseract"
+        threads: 8
+    ghostscript:
+        binary: "/usr/local/bin/gs"
+    preprocess:
+        threads: 8
+Handling disk time-outs
+~~~~~~~~~~~~~~~~~~~~~~~
+If you need to increase the time interval (default 3 seconds) between new
+document scans when pypdfocr is watching a directory, you can specify the following
+option in the configuration file:
+    watch:
+        scan_interval: 6
+Installation
+############
+Using pip
+~~~~~~~~~
+PyPDFOCR is available in PyPI, so you can just run:
+    pip install pypdfocr
+Please note that some of the 3rd-party libraries required by PyPDFOCR wiill
+require some build tools, especially on a default Ubuntu system.  If you run
+into any issues using pip install, you may want to install the
+following packages on Ubuntu and try again:
+- gcc
+- libjpeg-dev
+- zlib-bin
+- zlib1g-dev
+- python-dev
+For those on **Windows**, because it's such a pain to get all the PIL
+and PDF dependencies installed, I've gone ahead and made an executable
+called
+`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
+You still need to install Tesseract, GhostScript, etc. as detailed below in
+the external dependencies list.
+Manual install
+~~~~~~~~~~~~~~
+Clone the source directly from github (you need to have git installed):
+    git clone https://github.com/virantha/pypdfocr.git
+Then, install the following third-party python libraries:
+-  Pillow (Python Imaging Library) https://pillow.readthedocs.org/en/3.1.x/
+-  ReportLab (PDF generation library)
+   http://www.reportlab.com/opensource/
+-  Watchdog (Cross-platform fhlesystem events monitoring)
+   https://pypi.python.org/pypi/watchdog
+-  PyPDF2 (Pure python pdf library)
+These can all be installed via pip:
+    pip install Pillow
+    pip install reportlab
+    pip install watchdog
+    pip install pypdf2
+You will also need to install the external dependencies listed below.
+External Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+PyPDFOCR relies on the following (free) programs being installed and in
+the path:
+-  Tesseract OCR software https://code.google.com/p/tesseract-ocr/
+-  GhostScript http://www.ghostscript.com/
+-  ImageMagick http://www.imagemagick.org/
+-  Poppler http://poppler.freedesktop.org/  (`Windows <http://sourceforge.net/projects/poppler-win32/>`__)
+Poppler is only required if you want pypdfocr to figure out the original PDF resolution
+automatically; just make sure you have ``pdfimages`` in your path.   Note that the 
+`xpdf <http://www.foolabs.com/xpdf/download.html>`__ provided ``pdfimages`` does not work for this, 
+because it does not support the ``-list`` option to list the table of images in a PDF file.
+On Mac OS X, you can install these using homebrew:
+    brew install tesseract
+    brew install ghostscript
+    brew install poppler
+    brew install imagemagick
+On Windows, please use the installers provided on their download pages.
+\*\* Important \*\* Tesseract version 3.02.02 or newer required
+(apparently 3.02.01-6 and possibly others do not work due to a hocr
+output format change that I'm not planning to address). On Ubuntu, you
+may need to compile and install it manually by following `these
+instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
+Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
+then you need to find your tessdata directory and do the following:
+    cd /usr/local/share/tessdata 
+    cp eng.traineddata osd.traineddata 
+``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
+for whatever language you want to scan in as ``osd.traineddata``.  If you don't do this step, 
+then any landscape document will produce garbage
+Disclaimer
+##########
+While test coverage is at 84% right now, Sphinx docs generation is at an
+early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
+
+%prep
+%autosetup -n pypdfocr-0.9.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-pypdfocr -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.1-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..7c70fc8
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+23d7deb772e6fa9aa89fef257efd68a0  pypdfocr-0.9.1.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-05-18 03:46:40 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-18 03:46:40 +0000
commit	8788b77b00dd7e4dd14444f3db59bd1b3c1a106a (patch)
tree	fa86bfb8cd0a3812daf718203e0f3ba9e15d4f46
parent	53fcf37f0733545402bf7ea3a626bfa610517be7 (diff)