%global _empty_manifest_terminate_build 0
Name:		python-PAGETools
Version:	0.5.0
Release:	1
Summary:	Toolset to perform various operations on PAGE XML datasets
License:	MIT License
URL:		https://github.com/uniwuezpd/PAGETools
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/43/33/726ddedd2e7c74335fb3d1cf17187e198707fe140b92c5625e095fb4c9c9/PAGETools-0.5.0.tar.gz
BuildArch:	noarch

Requires:	python3-opencv-python
Requires:	python3-lxml
Requires:	python3-numpy
Requires:	python3-click
Requires:	python3-flake8
Requires:	python3-deskew
Requires:	python3-regex
Requires:	python3-pytest
Requires:	python3-importlib-resources

%description
[![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml)
Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the
[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd).
## Installing
### Installation using pip
The suggested method is to install `pagetools` into a virtual environment using pip:
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools
```
To install the package from source, clone this repository and run inside the project directory
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install .
```
## Usage
### Transformations 
#### Extraction
```
Usage: pagetools extract [OPTIONS] XMLS...
  Extract elements as image (optionally with text) files.
Options:
  --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to extract (highest
                                  priority).
  --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to exclude from
                                  extraction (lowest priority).
  --no-text                       Suppresses text extraction.
  -ie, --image-extension TEXT     Extension of image files. Must be in the
                                  same directory as corresponding XML file.
  -o, --output TEXT               Path where generated files will get saved.
  -e, --enumerate-output          Enumerates output file names instead of
                                  using original names.
  -z, --zip-output                Add generated output to zip archive.
  -bg, --background-color INTEGER...
                                  RGB color code used to fill up background.
                                  Used when padding and / or deskewing.
  --background-mode [median|mean|dominant]
                                  Color calc mode to fill up background
                                  (overwrites -bg / --background-color).
  -p, --padding INTEGER...        Padding in pixels around the line image
                                  cutout (top, bottom, left, right).
  -ad, --auto-deskew              Automatically deskew extracted line images
                                  (Experimental!).
  -d, --deskew FLOAT              Angle for manual clockwise rotation of the
                                  line images.
  -gt, --gt-index INTEGER         Index of the TextEquiv elements containing
                                  ground truth.
  -pred, --pred-index INTEGER     Index of the TextEquiv elements containing
                                  predicted text.
  --help                          Show this message and exit.
```
##### Examples
Only extract `TextLine` elements:
```
pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*"
```
Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.
#### line2page
Merges line images with corresponding text-files in page-images and page-xml
```
Usage: pagetools line2page [OPTIONS]
  Links line images and corresponding texts in a page and creates a combined
  image and XML-File of each page
Options:
  -c, --creator TEXT              Creator tag for PAGE XML
  -s, --source-folder TEXT        Path to images and GT  [required]
  -i, --image-folder TEXT         Path to images
  -gt, --gt-folder TEXT           Path to GT
  -d, --dest-folder TEXT          Path to merge objects
  -e, --ext TEXT                  Image extension
  -p, --pred BOOLEAN              Set flag to also store .pred.txt
  -l, --lines INTEGER RANGE       Lines per page
  -ls, --line-spacing INTEGER RANGE
                                  Spacing between lines in pixel
  -b, --border INTEGER RANGE...   Border in pixel: top bottom left right
  --debug [10|20|30|40|50]        Sets the level of feedback to receive:
                                  DEBUG=10, INFO=20, WARNING=30, ERROR=40,
                                  CRITICAL=50
  --threads INTEGER RANGE         Thread count to be used
  --xml-schema [17|19]            Sets the year of the xml-Schema to be used
  --help                          Show this message and exit.
```
Please note that each image file has to have the same name as its Ground Truth file.
```
foo.nrm.png -> foo.gt.txt (& foo.pred.txt)
bar.bin.png -> bar.gt.txt (& bar.pred.txt)
```
#### Regularization
```
Usage: pagetools regularize [OPTIONS] XMLS...
  Regularize the text content of PAGE XML files using custom rulesets.
Options:
  --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Removes specified default ruleset.
  --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Adds specified default ruleset. Overrides
                                  all other default options.
  -nd, --no-default               Disables all default rulesets.
  -r, --rules PATH                File(s) which contains serialized ruleset.
  -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD]
                                  Normalize unicode for both rules and PAGE
                                  XML tests.
  -s, --safe / -us, --unsafe      Creates backups of original files before
                                  overwriting.
  --help                          Show this message and exit.
```
#### Change index
```
Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET
  Change index on TextEquiv elements.
Options:
  -s, --safe / -us, --unsafe  Creates backups of original files before
                              overwriting.
  --help                      Show this message and exit.
```
### Analytics
#### Get Codec
```
Usage: pagetools get-codec [OPTIONS] FILES...
  Retrieves codec of PAGE XML files.
Options:
  -l, --level [region|line|word|glyph]
  -idx, --index INTEGER           Considers only text from TextEquiv elements
                                  with a certain index.
  -mc, --most-common INTEGER      Only prints n most common entries. Shows all
                                  by default.
  -o, --output TEXT               File to which results are written.
  -rw, --remove-whitespace
  -of, --output-format [json|csv|txt]
                                  Available result formats.
  -freq, --frequencies            Outputs character frequencies.
  --text-output-newline           Inserts new line after every character in
                                  txt output. Only applies when frequencies
                                  aren't output.
  --verbose / --silent            Choose between verbose or silent output.
  --help                          Show this message and exit.
```
### Get text count
```
Usage: pagetools get-text-count [OPTIONS] FILES...
  Returns the amount of text equiv elements in certain elements for certain
  indices.
Options:
  -e, --element [TextRegion|TextLine|Word]
  -i, --index TEXT                [required]
  -so, --stats-out TEXT           Output directory for detailed stats csv
                                  file.
  --help                          Show this message and exit.
```

%package -n python3-PAGETools
Summary:	Toolset to perform various operations on PAGE XML datasets
Provides:	python-PAGETools
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-PAGETools
[![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml)
Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the
[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd).
## Installing
### Installation using pip
The suggested method is to install `pagetools` into a virtual environment using pip:
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools
```
To install the package from source, clone this repository and run inside the project directory
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install .
```
## Usage
### Transformations 
#### Extraction
```
Usage: pagetools extract [OPTIONS] XMLS...
  Extract elements as image (optionally with text) files.
Options:
  --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to extract (highest
                                  priority).
  --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to exclude from
                                  extraction (lowest priority).
  --no-text                       Suppresses text extraction.
  -ie, --image-extension TEXT     Extension of image files. Must be in the
                                  same directory as corresponding XML file.
  -o, --output TEXT               Path where generated files will get saved.
  -e, --enumerate-output          Enumerates output file names instead of
                                  using original names.
  -z, --zip-output                Add generated output to zip archive.
  -bg, --background-color INTEGER...
                                  RGB color code used to fill up background.
                                  Used when padding and / or deskewing.
  --background-mode [median|mean|dominant]
                                  Color calc mode to fill up background
                                  (overwrites -bg / --background-color).
  -p, --padding INTEGER...        Padding in pixels around the line image
                                  cutout (top, bottom, left, right).
  -ad, --auto-deskew              Automatically deskew extracted line images
                                  (Experimental!).
  -d, --deskew FLOAT              Angle for manual clockwise rotation of the
                                  line images.
  -gt, --gt-index INTEGER         Index of the TextEquiv elements containing
                                  ground truth.
  -pred, --pred-index INTEGER     Index of the TextEquiv elements containing
                                  predicted text.
  --help                          Show this message and exit.
```
##### Examples
Only extract `TextLine` elements:
```
pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*"
```
Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.
#### line2page
Merges line images with corresponding text-files in page-images and page-xml
```
Usage: pagetools line2page [OPTIONS]
  Links line images and corresponding texts in a page and creates a combined
  image and XML-File of each page
Options:
  -c, --creator TEXT              Creator tag for PAGE XML
  -s, --source-folder TEXT        Path to images and GT  [required]
  -i, --image-folder TEXT         Path to images
  -gt, --gt-folder TEXT           Path to GT
  -d, --dest-folder TEXT          Path to merge objects
  -e, --ext TEXT                  Image extension
  -p, --pred BOOLEAN              Set flag to also store .pred.txt
  -l, --lines INTEGER RANGE       Lines per page
  -ls, --line-spacing INTEGER RANGE
                                  Spacing between lines in pixel
  -b, --border INTEGER RANGE...   Border in pixel: top bottom left right
  --debug [10|20|30|40|50]        Sets the level of feedback to receive:
                                  DEBUG=10, INFO=20, WARNING=30, ERROR=40,
                                  CRITICAL=50
  --threads INTEGER RANGE         Thread count to be used
  --xml-schema [17|19]            Sets the year of the xml-Schema to be used
  --help                          Show this message and exit.
```
Please note that each image file has to have the same name as its Ground Truth file.
```
foo.nrm.png -> foo.gt.txt (& foo.pred.txt)
bar.bin.png -> bar.gt.txt (& bar.pred.txt)
```
#### Regularization
```
Usage: pagetools regularize [OPTIONS] XMLS...
  Regularize the text content of PAGE XML files using custom rulesets.
Options:
  --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Removes specified default ruleset.
  --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Adds specified default ruleset. Overrides
                                  all other default options.
  -nd, --no-default               Disables all default rulesets.
  -r, --rules PATH                File(s) which contains serialized ruleset.
  -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD]
                                  Normalize unicode for both rules and PAGE
                                  XML tests.
  -s, --safe / -us, --unsafe      Creates backups of original files before
                                  overwriting.
  --help                          Show this message and exit.
```
#### Change index
```
Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET
  Change index on TextEquiv elements.
Options:
  -s, --safe / -us, --unsafe  Creates backups of original files before
                              overwriting.
  --help                      Show this message and exit.
```
### Analytics
#### Get Codec
```
Usage: pagetools get-codec [OPTIONS] FILES...
  Retrieves codec of PAGE XML files.
Options:
  -l, --level [region|line|word|glyph]
  -idx, --index INTEGER           Considers only text from TextEquiv elements
                                  with a certain index.
  -mc, --most-common INTEGER      Only prints n most common entries. Shows all
                                  by default.
  -o, --output TEXT               File to which results are written.
  -rw, --remove-whitespace
  -of, --output-format [json|csv|txt]
                                  Available result formats.
  -freq, --frequencies            Outputs character frequencies.
  --text-output-newline           Inserts new line after every character in
                                  txt output. Only applies when frequencies
                                  aren't output.
  --verbose / --silent            Choose between verbose or silent output.
  --help                          Show this message and exit.
```
### Get text count
```
Usage: pagetools get-text-count [OPTIONS] FILES...
  Returns the amount of text equiv elements in certain elements for certain
  indices.
Options:
  -e, --element [TextRegion|TextLine|Word]
  -i, --index TEXT                [required]
  -so, --stats-out TEXT           Output directory for detailed stats csv
                                  file.
  --help                          Show this message and exit.
```

%package help
Summary:	Development documents and examples for PAGETools
Provides:	python3-PAGETools-doc
%description help
[![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml)
Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the
[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd).
## Installing
### Installation using pip
The suggested method is to install `pagetools` into a virtual environment using pip:
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools
```
To install the package from source, clone this repository and run inside the project directory
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install .
```
## Usage
### Transformations 
#### Extraction
```
Usage: pagetools extract [OPTIONS] XMLS...
  Extract elements as image (optionally with text) files.
Options:
  --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to extract (highest
                                  priority).
  --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
                                  PAGE XML element types to exclude from
                                  extraction (lowest priority).
  --no-text                       Suppresses text extraction.
  -ie, --image-extension TEXT     Extension of image files. Must be in the
                                  same directory as corresponding XML file.
  -o, --output TEXT               Path where generated files will get saved.
  -e, --enumerate-output          Enumerates output file names instead of
                                  using original names.
  -z, --zip-output                Add generated output to zip archive.
  -bg, --background-color INTEGER...
                                  RGB color code used to fill up background.
                                  Used when padding and / or deskewing.
  --background-mode [median|mean|dominant]
                                  Color calc mode to fill up background
                                  (overwrites -bg / --background-color).
  -p, --padding INTEGER...        Padding in pixels around the line image
                                  cutout (top, bottom, left, right).
  -ad, --auto-deskew              Automatically deskew extracted line images
                                  (Experimental!).
  -d, --deskew FLOAT              Angle for manual clockwise rotation of the
                                  line images.
  -gt, --gt-index INTEGER         Index of the TextEquiv elements containing
                                  ground truth.
  -pred, --pred-index INTEGER     Index of the TextEquiv elements containing
                                  predicted text.
  --help                          Show this message and exit.
```
##### Examples
Only extract `TextLine` elements:
```
pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*"
```
Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.
#### line2page
Merges line images with corresponding text-files in page-images and page-xml
```
Usage: pagetools line2page [OPTIONS]
  Links line images and corresponding texts in a page and creates a combined
  image and XML-File of each page
Options:
  -c, --creator TEXT              Creator tag for PAGE XML
  -s, --source-folder TEXT        Path to images and GT  [required]
  -i, --image-folder TEXT         Path to images
  -gt, --gt-folder TEXT           Path to GT
  -d, --dest-folder TEXT          Path to merge objects
  -e, --ext TEXT                  Image extension
  -p, --pred BOOLEAN              Set flag to also store .pred.txt
  -l, --lines INTEGER RANGE       Lines per page
  -ls, --line-spacing INTEGER RANGE
                                  Spacing between lines in pixel
  -b, --border INTEGER RANGE...   Border in pixel: top bottom left right
  --debug [10|20|30|40|50]        Sets the level of feedback to receive:
                                  DEBUG=10, INFO=20, WARNING=30, ERROR=40,
                                  CRITICAL=50
  --threads INTEGER RANGE         Thread count to be used
  --xml-schema [17|19]            Sets the year of the xml-Schema to be used
  --help                          Show this message and exit.
```
Please note that each image file has to have the same name as its Ground Truth file.
```
foo.nrm.png -> foo.gt.txt (& foo.pred.txt)
bar.bin.png -> bar.gt.txt (& bar.pred.txt)
```
#### Regularization
```
Usage: pagetools regularize [OPTIONS] XMLS...
  Regularize the text content of PAGE XML files using custom rulesets.
Options:
  --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Removes specified default ruleset.
  --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
                                  Adds specified default ruleset. Overrides
                                  all other default options.
  -nd, --no-default               Disables all default rulesets.
  -r, --rules PATH                File(s) which contains serialized ruleset.
  -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD]
                                  Normalize unicode for both rules and PAGE
                                  XML tests.
  -s, --safe / -us, --unsafe      Creates backups of original files before
                                  overwriting.
  --help                          Show this message and exit.
```
#### Change index
```
Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET
  Change index on TextEquiv elements.
Options:
  -s, --safe / -us, --unsafe  Creates backups of original files before
                              overwriting.
  --help                      Show this message and exit.
```
### Analytics
#### Get Codec
```
Usage: pagetools get-codec [OPTIONS] FILES...
  Retrieves codec of PAGE XML files.
Options:
  -l, --level [region|line|word|glyph]
  -idx, --index INTEGER           Considers only text from TextEquiv elements
                                  with a certain index.
  -mc, --most-common INTEGER      Only prints n most common entries. Shows all
                                  by default.
  -o, --output TEXT               File to which results are written.
  -rw, --remove-whitespace
  -of, --output-format [json|csv|txt]
                                  Available result formats.
  -freq, --frequencies            Outputs character frequencies.
  --text-output-newline           Inserts new line after every character in
                                  txt output. Only applies when frequencies
                                  aren't output.
  --verbose / --silent            Choose between verbose or silent output.
  --help                          Show this message and exit.
```
### Get text count
```
Usage: pagetools get-text-count [OPTIONS] FILES...
  Returns the amount of text equiv elements in certain elements for certain
  indices.
Options:
  -e, --element [TextRegion|TextLine|Word]
  -i, --index TEXT                [required]
  -so, --stats-out TEXT           Output directory for detailed stats csv
                                  file.
  --help                          Show this message and exit.
```

%prep
%autosetup -n PAGETools-0.5.0

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-PAGETools -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Tue May 30 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.0-1
- Package Spec generated