%global _empty_manifest_terminate_build 0
Name:		python-pdf2docx
Version:	0.5.6
Release:	1
Summary:	Open source Python library converting pdf to docx.
License:	GPL v3
URL:		https://github.com/dothinking/pdf2docx
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/05/8f/e50f748e72b812b6ff1af534f2c618c79ad58f0e50d09369665be36086a7/pdf2docx-0.5.6.tar.gz
BuildArch:	noarch

Requires:	python3-PyMuPDF
Requires:	python3-docx
Requires:	python3-fonttools
Requires:	python3-numpy
Requires:	python3-opencv-python
Requires:	python3-fire

%description
English | [中文](README_CN.md)

# pdf2docx 

![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
![license](https://img.shields.io/pypi/l/pdf2docx.svg)
![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)

- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings 
- Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with `python-docx`

## Features

- Parse and re-create page layout
    - page margin
    - section and column (1 or 2 columns only)
    - page header and footer [TODO]

- Parse and re-create paragraph
    - OCR text [TODO]
    - text in horizontal/vertical direction: from left to right, from bottom to top
    - font style, e.g. font name, size, weight, italic and color
    - text format, e.g. highlight, underline, strike-through
    - list style [TODO]
    - external hyper link
    - paragraph horizontal alignment (left/right/center/justify) and vertical spacing
    
- Parse and re-create image
	- in-line image
    - image in Gray/RGB/CMYK mode
    - transparent image
    - floating image, i.e. picture behind text

- Parse and re-create table
    - border style, e.g. width, color
    - shading style, i.e. background color
    - merged cells
    - vertical direction cell
    - table with partly hidden borders
    - nested tables

- Parsing pages with multi-processing

*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*

## Limitations

- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout


## Documentation

- [Installation](https://dothinking.github.io/pdf2docx/installation.html)
- [Quickstart](https://dothinking.github.io/pdf2docx/quickstart.html)
    - [Convert PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
    - [Extract table](https://dothinking.github.io/pdf2docx/quickstart.table.html)
    - [Command Line Interface](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
    - [Graphic User Interface](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
- [Technical Documentation (In Chinese)](https://dothinking.github.io/pdf2docx/techdoc.html)
- [API Documentation](https://dothinking.github.io/pdf2docx/modules.html)

## Sample

![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)


%package -n python3-pdf2docx
Summary:	Open source Python library converting pdf to docx.
Provides:	python-pdf2docx
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-pdf2docx
English | [中文](README_CN.md)

# pdf2docx 

![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
![license](https://img.shields.io/pypi/l/pdf2docx.svg)
![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)

- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings 
- Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with `python-docx`

## Features

- Parse and re-create page layout
    - page margin
    - section and column (1 or 2 columns only)
    - page header and footer [TODO]

- Parse and re-create paragraph
    - OCR text [TODO]
    - text in horizontal/vertical direction: from left to right, from bottom to top
    - font style, e.g. font name, size, weight, italic and color
    - text format, e.g. highlight, underline, strike-through
    - list style [TODO]
    - external hyper link
    - paragraph horizontal alignment (left/right/center/justify) and vertical spacing
    
- Parse and re-create image
	- in-line image
    - image in Gray/RGB/CMYK mode
    - transparent image
    - floating image, i.e. picture behind text

- Parse and re-create table
    - border style, e.g. width, color
    - shading style, i.e. background color
    - merged cells
    - vertical direction cell
    - table with partly hidden borders
    - nested tables

- Parsing pages with multi-processing

*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*

## Limitations

- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout


## Documentation

- [Installation](https://dothinking.github.io/pdf2docx/installation.html)
- [Quickstart](https://dothinking.github.io/pdf2docx/quickstart.html)
    - [Convert PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
    - [Extract table](https://dothinking.github.io/pdf2docx/quickstart.table.html)
    - [Command Line Interface](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
    - [Graphic User Interface](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
- [Technical Documentation (In Chinese)](https://dothinking.github.io/pdf2docx/techdoc.html)
- [API Documentation](https://dothinking.github.io/pdf2docx/modules.html)

## Sample

![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)


%package help
Summary:	Development documents and examples for pdf2docx
Provides:	python3-pdf2docx-doc
%description help
English | [中文](README_CN.md)

# pdf2docx 

![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
![license](https://img.shields.io/pypi/l/pdf2docx.svg)
![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)

- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings 
- Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with `python-docx`

## Features

- Parse and re-create page layout
    - page margin
    - section and column (1 or 2 columns only)
    - page header and footer [TODO]

- Parse and re-create paragraph
    - OCR text [TODO]
    - text in horizontal/vertical direction: from left to right, from bottom to top
    - font style, e.g. font name, size, weight, italic and color
    - text format, e.g. highlight, underline, strike-through
    - list style [TODO]
    - external hyper link
    - paragraph horizontal alignment (left/right/center/justify) and vertical spacing
    
- Parse and re-create image
	- in-line image
    - image in Gray/RGB/CMYK mode
    - transparent image
    - floating image, i.e. picture behind text

- Parse and re-create table
    - border style, e.g. width, color
    - shading style, i.e. background color
    - merged cells
    - vertical direction cell
    - table with partly hidden borders
    - nested tables

- Parsing pages with multi-processing

*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*

## Limitations

- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout


## Documentation

- [Installation](https://dothinking.github.io/pdf2docx/installation.html)
- [Quickstart](https://dothinking.github.io/pdf2docx/quickstart.html)
    - [Convert PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
    - [Extract table](https://dothinking.github.io/pdf2docx/quickstart.table.html)
    - [Command Line Interface](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
    - [Graphic User Interface](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
- [Technical Documentation (In Chinese)](https://dothinking.github.io/pdf2docx/techdoc.html)
- [API Documentation](https://dothinking.github.io/pdf2docx/modules.html)

## Sample

![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)


%prep
%autosetup -n pdf2docx-0.5.6

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-pdf2docx -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.6-1
- Package Spec generated