%global _empty_manifest_terminate_build 0 Name: python-PAGETools Version: 0.5.0 Release: 1 Summary: Toolset to perform various operations on PAGE XML datasets License: MIT License URL: https://github.com/uniwuezpd/PAGETools Source0: https://mirrors.aliyun.com/pypi/web/packages/43/33/726ddedd2e7c74335fb3d1cf17187e198707fe140b92c5625e095fb4c9c9/PAGETools-0.5.0.tar.gz BuildArch: noarch Requires: python3-opencv-python Requires: python3-lxml Requires: python3-numpy Requires: python3-click Requires: python3-flake8 Requires: python3-deskew Requires: python3-regex Requires: python3-pytest Requires: python3-importlib-resources %description [![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the [Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). ## Installing ### Installation using pip The suggested method is to install `pagetools` into a virtual environment using pip: ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install pagetools ``` To install the package from source, clone this repository and run inside the project directory ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install . ``` ## Usage ### Transformations #### Extraction ``` Usage: pagetools extract [OPTIONS] XMLS... Extract elements as image (optionally with text) files. Options: --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to extract (highest priority). --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to exclude from extraction (lowest priority). --no-text Suppresses text extraction. -ie, --image-extension TEXT Extension of image files. Must be in the same directory as corresponding XML file. -o, --output TEXT Path where generated files will get saved. -e, --enumerate-output Enumerates output file names instead of using original names. -z, --zip-output Add generated output to zip archive. -bg, --background-color INTEGER... RGB color code used to fill up background. Used when padding and / or deskewing. --background-mode [median|mean|dominant] Color calc mode to fill up background (overwrites -bg / --background-color). -p, --padding INTEGER... Padding in pixels around the line image cutout (top, bottom, left, right). -ad, --auto-deskew Automatically deskew extracted line images (Experimental!). -d, --deskew FLOAT Angle for manual clockwise rotation of the line images. -gt, --gt-index INTEGER Index of the TextEquiv elements containing ground truth. -pred, --pred-index INTEGER Index of the TextEquiv elements containing predicted text. --help Show this message and exit. ``` ##### Examples Only extract `TextLine` elements: ``` pagetools extract /*.xml -ie -o --include TextLine --exclude "*" ``` Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. #### line2page Merges line images with corresponding text-files in page-images and page-xml ``` Usage: pagetools line2page [OPTIONS] Links line images and corresponding texts in a page and creates a combined image and XML-File of each page Options: -c, --creator TEXT Creator tag for PAGE XML -s, --source-folder TEXT Path to images and GT [required] -i, --image-folder TEXT Path to images -gt, --gt-folder TEXT Path to GT -d, --dest-folder TEXT Path to merge objects -e, --ext TEXT Image extension -p, --pred BOOLEAN Set flag to also store .pred.txt -l, --lines INTEGER RANGE Lines per page -ls, --line-spacing INTEGER RANGE Spacing between lines in pixel -b, --border INTEGER RANGE... Border in pixel: top bottom left right --debug [10|20|30|40|50] Sets the level of feedback to receive: DEBUG=10, INFO=20, WARNING=30, ERROR=40, CRITICAL=50 --threads INTEGER RANGE Thread count to be used --xml-schema [17|19] Sets the year of the xml-Schema to be used --help Show this message and exit. ``` Please note that each image file has to have the same name as its Ground Truth file. ``` foo.nrm.png -> foo.gt.txt (& foo.pred.txt) bar.bin.png -> bar.gt.txt (& bar.pred.txt) ``` #### Regularization ``` Usage: pagetools regularize [OPTIONS] XMLS... Regularize the text content of PAGE XML files using custom rulesets. Options: --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Removes specified default ruleset. --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Adds specified default ruleset. Overrides all other default options. -nd, --no-default Disables all default rulesets. -r, --rules PATH File(s) which contains serialized ruleset. -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] Normalize unicode for both rules and PAGE XML tests. -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` #### Change index ``` Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET Change index on TextEquiv elements. Options: -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` ### Analytics #### Get Codec ``` Usage: pagetools get-codec [OPTIONS] FILES... Retrieves codec of PAGE XML files. Options: -l, --level [region|line|word|glyph] -idx, --index INTEGER Considers only text from TextEquiv elements with a certain index. -mc, --most-common INTEGER Only prints n most common entries. Shows all by default. -o, --output TEXT File to which results are written. -rw, --remove-whitespace -of, --output-format [json|csv|txt] Available result formats. -freq, --frequencies Outputs character frequencies. --text-output-newline Inserts new line after every character in txt output. Only applies when frequencies aren't output. --verbose / --silent Choose between verbose or silent output. --help Show this message and exit. ``` ### Get text count ``` Usage: pagetools get-text-count [OPTIONS] FILES... Returns the amount of text equiv elements in certain elements for certain indices. Options: -e, --element [TextRegion|TextLine|Word] -i, --index TEXT [required] -so, --stats-out TEXT Output directory for detailed stats csv file. --help Show this message and exit. ``` %package -n python3-PAGETools Summary: Toolset to perform various operations on PAGE XML datasets Provides: python-PAGETools BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-PAGETools [![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the [Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). ## Installing ### Installation using pip The suggested method is to install `pagetools` into a virtual environment using pip: ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install pagetools ``` To install the package from source, clone this repository and run inside the project directory ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install . ``` ## Usage ### Transformations #### Extraction ``` Usage: pagetools extract [OPTIONS] XMLS... Extract elements as image (optionally with text) files. Options: --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to extract (highest priority). --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to exclude from extraction (lowest priority). --no-text Suppresses text extraction. -ie, --image-extension TEXT Extension of image files. Must be in the same directory as corresponding XML file. -o, --output TEXT Path where generated files will get saved. -e, --enumerate-output Enumerates output file names instead of using original names. -z, --zip-output Add generated output to zip archive. -bg, --background-color INTEGER... RGB color code used to fill up background. Used when padding and / or deskewing. --background-mode [median|mean|dominant] Color calc mode to fill up background (overwrites -bg / --background-color). -p, --padding INTEGER... Padding in pixels around the line image cutout (top, bottom, left, right). -ad, --auto-deskew Automatically deskew extracted line images (Experimental!). -d, --deskew FLOAT Angle for manual clockwise rotation of the line images. -gt, --gt-index INTEGER Index of the TextEquiv elements containing ground truth. -pred, --pred-index INTEGER Index of the TextEquiv elements containing predicted text. --help Show this message and exit. ``` ##### Examples Only extract `TextLine` elements: ``` pagetools extract /*.xml -ie -o --include TextLine --exclude "*" ``` Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. #### line2page Merges line images with corresponding text-files in page-images and page-xml ``` Usage: pagetools line2page [OPTIONS] Links line images and corresponding texts in a page and creates a combined image and XML-File of each page Options: -c, --creator TEXT Creator tag for PAGE XML -s, --source-folder TEXT Path to images and GT [required] -i, --image-folder TEXT Path to images -gt, --gt-folder TEXT Path to GT -d, --dest-folder TEXT Path to merge objects -e, --ext TEXT Image extension -p, --pred BOOLEAN Set flag to also store .pred.txt -l, --lines INTEGER RANGE Lines per page -ls, --line-spacing INTEGER RANGE Spacing between lines in pixel -b, --border INTEGER RANGE... Border in pixel: top bottom left right --debug [10|20|30|40|50] Sets the level of feedback to receive: DEBUG=10, INFO=20, WARNING=30, ERROR=40, CRITICAL=50 --threads INTEGER RANGE Thread count to be used --xml-schema [17|19] Sets the year of the xml-Schema to be used --help Show this message and exit. ``` Please note that each image file has to have the same name as its Ground Truth file. ``` foo.nrm.png -> foo.gt.txt (& foo.pred.txt) bar.bin.png -> bar.gt.txt (& bar.pred.txt) ``` #### Regularization ``` Usage: pagetools regularize [OPTIONS] XMLS... Regularize the text content of PAGE XML files using custom rulesets. Options: --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Removes specified default ruleset. --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Adds specified default ruleset. Overrides all other default options. -nd, --no-default Disables all default rulesets. -r, --rules PATH File(s) which contains serialized ruleset. -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] Normalize unicode for both rules and PAGE XML tests. -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` #### Change index ``` Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET Change index on TextEquiv elements. Options: -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` ### Analytics #### Get Codec ``` Usage: pagetools get-codec [OPTIONS] FILES... Retrieves codec of PAGE XML files. Options: -l, --level [region|line|word|glyph] -idx, --index INTEGER Considers only text from TextEquiv elements with a certain index. -mc, --most-common INTEGER Only prints n most common entries. Shows all by default. -o, --output TEXT File to which results are written. -rw, --remove-whitespace -of, --output-format [json|csv|txt] Available result formats. -freq, --frequencies Outputs character frequencies. --text-output-newline Inserts new line after every character in txt output. Only applies when frequencies aren't output. --verbose / --silent Choose between verbose or silent output. --help Show this message and exit. ``` ### Get text count ``` Usage: pagetools get-text-count [OPTIONS] FILES... Returns the amount of text equiv elements in certain elements for certain indices. Options: -e, --element [TextRegion|TextLine|Word] -i, --index TEXT [required] -so, --stats-out TEXT Output directory for detailed stats csv file. --help Show this message and exit. ``` %package help Summary: Development documents and examples for PAGETools Provides: python3-PAGETools-doc %description help [![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the [Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). ## Installing ### Installation using pip The suggested method is to install `pagetools` into a virtual environment using pip: ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install pagetools ``` To install the package from source, clone this repository and run inside the project directory ```bash python -m venv VENV_NAME source VENV_NAME/bin/activate pip install . ``` ## Usage ### Transformations #### Extraction ``` Usage: pagetools extract [OPTIONS] XMLS... Extract elements as image (optionally with text) files. Options: --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to extract (highest priority). --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] PAGE XML element types to exclude from extraction (lowest priority). --no-text Suppresses text extraction. -ie, --image-extension TEXT Extension of image files. Must be in the same directory as corresponding XML file. -o, --output TEXT Path where generated files will get saved. -e, --enumerate-output Enumerates output file names instead of using original names. -z, --zip-output Add generated output to zip archive. -bg, --background-color INTEGER... RGB color code used to fill up background. Used when padding and / or deskewing. --background-mode [median|mean|dominant] Color calc mode to fill up background (overwrites -bg / --background-color). -p, --padding INTEGER... Padding in pixels around the line image cutout (top, bottom, left, right). -ad, --auto-deskew Automatically deskew extracted line images (Experimental!). -d, --deskew FLOAT Angle for manual clockwise rotation of the line images. -gt, --gt-index INTEGER Index of the TextEquiv elements containing ground truth. -pred, --pred-index INTEGER Index of the TextEquiv elements containing predicted text. --help Show this message and exit. ``` ##### Examples Only extract `TextLine` elements: ``` pagetools extract /*.xml -ie -o --include TextLine --exclude "*" ``` Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. #### line2page Merges line images with corresponding text-files in page-images and page-xml ``` Usage: pagetools line2page [OPTIONS] Links line images and corresponding texts in a page and creates a combined image and XML-File of each page Options: -c, --creator TEXT Creator tag for PAGE XML -s, --source-folder TEXT Path to images and GT [required] -i, --image-folder TEXT Path to images -gt, --gt-folder TEXT Path to GT -d, --dest-folder TEXT Path to merge objects -e, --ext TEXT Image extension -p, --pred BOOLEAN Set flag to also store .pred.txt -l, --lines INTEGER RANGE Lines per page -ls, --line-spacing INTEGER RANGE Spacing between lines in pixel -b, --border INTEGER RANGE... Border in pixel: top bottom left right --debug [10|20|30|40|50] Sets the level of feedback to receive: DEBUG=10, INFO=20, WARNING=30, ERROR=40, CRITICAL=50 --threads INTEGER RANGE Thread count to be used --xml-schema [17|19] Sets the year of the xml-Schema to be used --help Show this message and exit. ``` Please note that each image file has to have the same name as its Ground Truth file. ``` foo.nrm.png -> foo.gt.txt (& foo.pred.txt) bar.bin.png -> bar.gt.txt (& bar.pred.txt) ``` #### Regularization ``` Usage: pagetools regularize [OPTIONS] XMLS... Regularize the text content of PAGE XML files using custom rulesets. Options: --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Removes specified default ruleset. --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] Adds specified default ruleset. Overrides all other default options. -nd, --no-default Disables all default rulesets. -r, --rules PATH File(s) which contains serialized ruleset. -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] Normalize unicode for both rules and PAGE XML tests. -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` #### Change index ``` Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET Change index on TextEquiv elements. Options: -s, --safe / -us, --unsafe Creates backups of original files before overwriting. --help Show this message and exit. ``` ### Analytics #### Get Codec ``` Usage: pagetools get-codec [OPTIONS] FILES... Retrieves codec of PAGE XML files. Options: -l, --level [region|line|word|glyph] -idx, --index INTEGER Considers only text from TextEquiv elements with a certain index. -mc, --most-common INTEGER Only prints n most common entries. Shows all by default. -o, --output TEXT File to which results are written. -rw, --remove-whitespace -of, --output-format [json|csv|txt] Available result formats. -freq, --frequencies Outputs character frequencies. --text-output-newline Inserts new line after every character in txt output. Only applies when frequencies aren't output. --verbose / --silent Choose between verbose or silent output. --help Show this message and exit. ``` ### Get text count ``` Usage: pagetools get-text-count [OPTIONS] FILES... Returns the amount of text equiv elements in certain elements for certain indices. Options: -e, --element [TextRegion|TextLine|Word] -i, --index TEXT [required] -so, --stats-out TEXT Output directory for detailed stats csv file. --help Show this message and exit. ``` %prep %autosetup -n PAGETools-0.5.0 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-PAGETools -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Thu Jun 08 2023 Python_Bot - 0.5.0-1 - Package Spec generated