diff options
author | CoprDistGit <infra@openeuler.org> | 2023-03-09 14:56:19 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-03-09 14:56:19 +0000 |
commit | 72e2236f1b973972942c30ffd7b2848322ae97e4 (patch) | |
tree | 6b0d9b020f1e772855c4a502616e68898123f9af | |
parent | 399066ff07f558388dc59169329211eef5c830ec (diff) |
automatic import of python-pdfminer
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-pdfminer.spec | 393 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 395 insertions, 0 deletions
@@ -0,0 +1 @@ +/pdfminer-20191125.tar.gz diff --git a/python-pdfminer.spec b/python-pdfminer.spec new file mode 100644 index 0000000..69b1887 --- /dev/null +++ b/python-pdfminer.spec @@ -0,0 +1,393 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pdfminer +Version: 20191125 +Release: 1 +Summary: PDF parser and analyzer +License: MIT +URL: http://github.com/euske/pdfminer +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/71/a3/155c5cde5f9c0b1069043b2946a93f54a41fd72cc19c6c100f6f2f5bdc15/pdfminer-20191125.tar.gz +BuildArch: noarch + + +%description +# PDFMiner + +PDFMiner is a text extraction tool for PDF documents. + +[](https://travis-ci.org/euske/pdfminer) +[](https://pypi.org/project/pdfminer/) + +**Warning**: Starting from version 20191010, PDFMiner supports **Python 3 only**. +For Python 2 support, check out +<a href="https://github.com/pdfminer/pdfminer.six">pdfminer.six</a>. + +## Features: + + * Pure Python (3.6 or above). + * Supports PDF-1.7. (well, almost) + * Obtains the exact location of text as well as other layout information (fonts, etc.). + * Performs automatic layout analysis. + * Can convert PDF into other formats (HTML/XML). + * Can extract an outline (TOC). + * Can extract tagged contents. + * Supports basic encryption (RC4 and AES). + * Supports various font types (Type1, TrueType, Type3, and CID). + * Supports CJK languages and vertical writing scripts. + * Has an extensible PDF parser that can be used for other purposes. + + +## How to Use: + + 1. `> pip install pdfminer` + 1. `> pdf2txt.py samples/simple1.pdf` + + +## Command Line Syntax: + +### pdf2txt.py + +pdf2txt.py extracts all the texts that are rendered programmatically. +It also extracts the corresponding locations, font names, font sizes, +writing direction (horizontal or vertical) for each text segment. It +does not recognize text in images. A password needs to be provided for +restricted PDF documents. + + > pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag] + [-O output_dir] [-c encoding] [-s scale] [-R rotation] + [-Y normal|loose|exact] [-p pagenos] [-m maxpages] + [-S] [-C] [-n] [-A] [-V] + [-M char_margin] [-L line_margin] [-W word_margin] + [-F boxes_flow] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-o output` : Output file name. + * `-t text|html|xml|tag` : Output type. (default: automatically inferred from the output file name.) + * `-O output_dir` : Output directory for extracted images. + * `-c encoding` : Output encoding. (default: utf-8) + * `-s scale` : Output scale. + * `-R rotation` : Rotates the page in degree. + * `-Y normal|loose|exact` : Specifies the layout mode. (only for HTML output.) + * `-p pagenos` : Processes certain pages only. + * `-m maxpages` : Limits the number of maximum pages to process. + * `-S` : Strips control characters. + * `-C` : Disables resource caching. + * `-n` : Disables layout analysis. + * `-A` : Applies layout analysis for all texts including figures. + * `-V` : Automatically detects vertical writing. + * `-M char_margin` : Speficies the char margin. + * `-W word_margin` : Speficies the word margin. + * `-L line_margin` : Speficies the line margin. + * `-F boxes_flow` : Speficies the box flow ratio. + * `-d` : Turns on Debug output. + +### dumppdf.py + +dumppdf.py is used for debugging PDFs. +It dumps all the internal contents in pseudo-XML format. + + > dumppdf.py [-P password] [-a] [-p pageid] [-i objid] + [-o output] [-r|-b|-t] [-T] [-O directory] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-a` : Extracts all objects. + * `-p pageid` : Extracts a Page object. + * `-i objid` : Extracts a certain object. + * `-o output` : Output file name. + * `-r` : Raw mode. Dumps the raw compressed/encoded streams. + * `-b` : Binary mode. Dumps the uncompressed/decoded streams. + * `-t` : Text mode. Dumps the streams in text format. + * `-T` : Tagged mode. Dumps the tagged contents. + * `-O output_dir` : Output directory for extracted streams. + +## TODO + + * Replace STRICT variable with something better. + * Improve the debugging functions. + * Use logging module instead of sys.stderr. + * Proper test cases. + * PEP-8 and PEP-257 conformance. + * Better documentation. + * Crypto stream filter support. + + +## Related Projects + + * <a href="http://pybrary.net/pyPdf/">pyPdf</a> + * <a href="http://www.foolabs.com/xpdf/">xpdf</a> + * <a href="http://pdfbox.apache.org/">pdfbox</a> + * <a href="http://mupdf.com/">mupdf</a> + +%package -n python3-pdfminer +Summary: PDF parser and analyzer +Provides: python-pdfminer +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pdfminer +# PDFMiner + +PDFMiner is a text extraction tool for PDF documents. + +[](https://travis-ci.org/euske/pdfminer) +[](https://pypi.org/project/pdfminer/) + +**Warning**: Starting from version 20191010, PDFMiner supports **Python 3 only**. +For Python 2 support, check out +<a href="https://github.com/pdfminer/pdfminer.six">pdfminer.six</a>. + +## Features: + + * Pure Python (3.6 or above). + * Supports PDF-1.7. (well, almost) + * Obtains the exact location of text as well as other layout information (fonts, etc.). + * Performs automatic layout analysis. + * Can convert PDF into other formats (HTML/XML). + * Can extract an outline (TOC). + * Can extract tagged contents. + * Supports basic encryption (RC4 and AES). + * Supports various font types (Type1, TrueType, Type3, and CID). + * Supports CJK languages and vertical writing scripts. + * Has an extensible PDF parser that can be used for other purposes. + + +## How to Use: + + 1. `> pip install pdfminer` + 1. `> pdf2txt.py samples/simple1.pdf` + + +## Command Line Syntax: + +### pdf2txt.py + +pdf2txt.py extracts all the texts that are rendered programmatically. +It also extracts the corresponding locations, font names, font sizes, +writing direction (horizontal or vertical) for each text segment. It +does not recognize text in images. A password needs to be provided for +restricted PDF documents. + + > pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag] + [-O output_dir] [-c encoding] [-s scale] [-R rotation] + [-Y normal|loose|exact] [-p pagenos] [-m maxpages] + [-S] [-C] [-n] [-A] [-V] + [-M char_margin] [-L line_margin] [-W word_margin] + [-F boxes_flow] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-o output` : Output file name. + * `-t text|html|xml|tag` : Output type. (default: automatically inferred from the output file name.) + * `-O output_dir` : Output directory for extracted images. + * `-c encoding` : Output encoding. (default: utf-8) + * `-s scale` : Output scale. + * `-R rotation` : Rotates the page in degree. + * `-Y normal|loose|exact` : Specifies the layout mode. (only for HTML output.) + * `-p pagenos` : Processes certain pages only. + * `-m maxpages` : Limits the number of maximum pages to process. + * `-S` : Strips control characters. + * `-C` : Disables resource caching. + * `-n` : Disables layout analysis. + * `-A` : Applies layout analysis for all texts including figures. + * `-V` : Automatically detects vertical writing. + * `-M char_margin` : Speficies the char margin. + * `-W word_margin` : Speficies the word margin. + * `-L line_margin` : Speficies the line margin. + * `-F boxes_flow` : Speficies the box flow ratio. + * `-d` : Turns on Debug output. + +### dumppdf.py + +dumppdf.py is used for debugging PDFs. +It dumps all the internal contents in pseudo-XML format. + + > dumppdf.py [-P password] [-a] [-p pageid] [-i objid] + [-o output] [-r|-b|-t] [-T] [-O directory] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-a` : Extracts all objects. + * `-p pageid` : Extracts a Page object. + * `-i objid` : Extracts a certain object. + * `-o output` : Output file name. + * `-r` : Raw mode. Dumps the raw compressed/encoded streams. + * `-b` : Binary mode. Dumps the uncompressed/decoded streams. + * `-t` : Text mode. Dumps the streams in text format. + * `-T` : Tagged mode. Dumps the tagged contents. + * `-O output_dir` : Output directory for extracted streams. + +## TODO + + * Replace STRICT variable with something better. + * Improve the debugging functions. + * Use logging module instead of sys.stderr. + * Proper test cases. + * PEP-8 and PEP-257 conformance. + * Better documentation. + * Crypto stream filter support. + + +## Related Projects + + * <a href="http://pybrary.net/pyPdf/">pyPdf</a> + * <a href="http://www.foolabs.com/xpdf/">xpdf</a> + * <a href="http://pdfbox.apache.org/">pdfbox</a> + * <a href="http://mupdf.com/">mupdf</a> + +%package help +Summary: Development documents and examples for pdfminer +Provides: python3-pdfminer-doc +%description help +# PDFMiner + +PDFMiner is a text extraction tool for PDF documents. + +[](https://travis-ci.org/euske/pdfminer) +[](https://pypi.org/project/pdfminer/) + +**Warning**: Starting from version 20191010, PDFMiner supports **Python 3 only**. +For Python 2 support, check out +<a href="https://github.com/pdfminer/pdfminer.six">pdfminer.six</a>. + +## Features: + + * Pure Python (3.6 or above). + * Supports PDF-1.7. (well, almost) + * Obtains the exact location of text as well as other layout information (fonts, etc.). + * Performs automatic layout analysis. + * Can convert PDF into other formats (HTML/XML). + * Can extract an outline (TOC). + * Can extract tagged contents. + * Supports basic encryption (RC4 and AES). + * Supports various font types (Type1, TrueType, Type3, and CID). + * Supports CJK languages and vertical writing scripts. + * Has an extensible PDF parser that can be used for other purposes. + + +## How to Use: + + 1. `> pip install pdfminer` + 1. `> pdf2txt.py samples/simple1.pdf` + + +## Command Line Syntax: + +### pdf2txt.py + +pdf2txt.py extracts all the texts that are rendered programmatically. +It also extracts the corresponding locations, font names, font sizes, +writing direction (horizontal or vertical) for each text segment. It +does not recognize text in images. A password needs to be provided for +restricted PDF documents. + + > pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag] + [-O output_dir] [-c encoding] [-s scale] [-R rotation] + [-Y normal|loose|exact] [-p pagenos] [-m maxpages] + [-S] [-C] [-n] [-A] [-V] + [-M char_margin] [-L line_margin] [-W word_margin] + [-F boxes_flow] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-o output` : Output file name. + * `-t text|html|xml|tag` : Output type. (default: automatically inferred from the output file name.) + * `-O output_dir` : Output directory for extracted images. + * `-c encoding` : Output encoding. (default: utf-8) + * `-s scale` : Output scale. + * `-R rotation` : Rotates the page in degree. + * `-Y normal|loose|exact` : Specifies the layout mode. (only for HTML output.) + * `-p pagenos` : Processes certain pages only. + * `-m maxpages` : Limits the number of maximum pages to process. + * `-S` : Strips control characters. + * `-C` : Disables resource caching. + * `-n` : Disables layout analysis. + * `-A` : Applies layout analysis for all texts including figures. + * `-V` : Automatically detects vertical writing. + * `-M char_margin` : Speficies the char margin. + * `-W word_margin` : Speficies the word margin. + * `-L line_margin` : Speficies the line margin. + * `-F boxes_flow` : Speficies the box flow ratio. + * `-d` : Turns on Debug output. + +### dumppdf.py + +dumppdf.py is used for debugging PDFs. +It dumps all the internal contents in pseudo-XML format. + + > dumppdf.py [-P password] [-a] [-p pageid] [-i objid] + [-o output] [-r|-b|-t] [-T] [-O directory] [-d] + input.pdf ... + + * `-P password` : PDF password. + * `-a` : Extracts all objects. + * `-p pageid` : Extracts a Page object. + * `-i objid` : Extracts a certain object. + * `-o output` : Output file name. + * `-r` : Raw mode. Dumps the raw compressed/encoded streams. + * `-b` : Binary mode. Dumps the uncompressed/decoded streams. + * `-t` : Text mode. Dumps the streams in text format. + * `-T` : Tagged mode. Dumps the tagged contents. + * `-O output_dir` : Output directory for extracted streams. + +## TODO + + * Replace STRICT variable with something better. + * Improve the debugging functions. + * Use logging module instead of sys.stderr. + * Proper test cases. + * PEP-8 and PEP-257 conformance. + * Better documentation. + * Crypto stream filter support. + + +## Related Projects + + * <a href="http://pybrary.net/pyPdf/">pyPdf</a> + * <a href="http://www.foolabs.com/xpdf/">xpdf</a> + * <a href="http://pdfbox.apache.org/">pdfbox</a> + * <a href="http://mupdf.com/">mupdf</a> + +%prep +%autosetup -n pdfminer-20191125 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pdfminer -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu Mar 09 2023 Python_Bot <Python_Bot@openeuler.org> - 20191125-1 +- Package Spec generated @@ -0,0 +1 @@ +822eb51838a944154027b8ca42d439e3 pdfminer-20191125.tar.gz |