diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-18 04:03:45 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-18 04:03:45 +0000 |
| commit | caba3ba66ef9c27d87b7fe8b01da1d5b698814eb (patch) | |
| tree | 463fe70df18eb24d1da3bd443c8210d71cfcc570 /python-pagetools.spec | |
| parent | 1bb7d8cbfcbb143f34c1dbb786264eaff4b44d30 (diff) | |
automatic import of python-pagetools
Diffstat (limited to 'python-pagetools.spec')
| -rw-r--r-- | python-pagetools.spec | 537 |
1 files changed, 537 insertions, 0 deletions
diff --git a/python-pagetools.spec b/python-pagetools.spec new file mode 100644 index 0000000..08b8e36 --- /dev/null +++ b/python-pagetools.spec @@ -0,0 +1,537 @@ +%global _empty_manifest_terminate_build 0 +Name: python-PAGETools +Version: 0.5.0 +Release: 1 +Summary: Toolset to perform various operations on PAGE XML datasets +License: MIT License +URL: https://github.com/uniwuezpd/PAGETools +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/43/33/726ddedd2e7c74335fb3d1cf17187e198707fe140b92c5625e095fb4c9c9/PAGETools-0.5.0.tar.gz +BuildArch: noarch + +Requires: python3-opencv-python +Requires: python3-lxml +Requires: python3-numpy +Requires: python3-click +Requires: python3-flake8 +Requires: python3-deskew +Requires: python3-regex +Requires: python3-pytest +Requires: python3-importlib-resources + +%description +[](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) +Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the +[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). +## Installing +### Installation using pip +The suggested method is to install `pagetools` into a virtual environment using pip: +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install pagetools +``` +To install the package from source, clone this repository and run inside the project directory +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install . +``` +## Usage +### Transformations +#### Extraction +``` +Usage: pagetools extract [OPTIONS] XMLS... + Extract elements as image (optionally with text) files. +Options: + --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to extract (highest + priority). + --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to exclude from + extraction (lowest priority). + --no-text Suppresses text extraction. + -ie, --image-extension TEXT Extension of image files. Must be in the + same directory as corresponding XML file. + -o, --output TEXT Path where generated files will get saved. + -e, --enumerate-output Enumerates output file names instead of + using original names. + -z, --zip-output Add generated output to zip archive. + -bg, --background-color INTEGER... + RGB color code used to fill up background. + Used when padding and / or deskewing. + --background-mode [median|mean|dominant] + Color calc mode to fill up background + (overwrites -bg / --background-color). + -p, --padding INTEGER... Padding in pixels around the line image + cutout (top, bottom, left, right). + -ad, --auto-deskew Automatically deskew extracted line images + (Experimental!). + -d, --deskew FLOAT Angle for manual clockwise rotation of the + line images. + -gt, --gt-index INTEGER Index of the TextEquiv elements containing + ground truth. + -pred, --pred-index INTEGER Index of the TextEquiv elements containing + predicted text. + --help Show this message and exit. +``` +##### Examples +Only extract `TextLine` elements: +``` +pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*" +``` +Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. +#### line2page +Merges line images with corresponding text-files in page-images and page-xml +``` +Usage: pagetools line2page [OPTIONS] + Links line images and corresponding texts in a page and creates a combined + image and XML-File of each page +Options: + -c, --creator TEXT Creator tag for PAGE XML + -s, --source-folder TEXT Path to images and GT [required] + -i, --image-folder TEXT Path to images + -gt, --gt-folder TEXT Path to GT + -d, --dest-folder TEXT Path to merge objects + -e, --ext TEXT Image extension + -p, --pred BOOLEAN Set flag to also store .pred.txt + -l, --lines INTEGER RANGE Lines per page + -ls, --line-spacing INTEGER RANGE + Spacing between lines in pixel + -b, --border INTEGER RANGE... Border in pixel: top bottom left right + --debug [10|20|30|40|50] Sets the level of feedback to receive: + DEBUG=10, INFO=20, WARNING=30, ERROR=40, + CRITICAL=50 + --threads INTEGER RANGE Thread count to be used + --xml-schema [17|19] Sets the year of the xml-Schema to be used + --help Show this message and exit. +``` +Please note that each image file has to have the same name as its Ground Truth file. +``` +foo.nrm.png -> foo.gt.txt (& foo.pred.txt) +bar.bin.png -> bar.gt.txt (& bar.pred.txt) +``` +#### Regularization +``` +Usage: pagetools regularize [OPTIONS] XMLS... + Regularize the text content of PAGE XML files using custom rulesets. +Options: + --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Removes specified default ruleset. + --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Adds specified default ruleset. Overrides + all other default options. + -nd, --no-default Disables all default rulesets. + -r, --rules PATH File(s) which contains serialized ruleset. + -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] + Normalize unicode for both rules and PAGE + XML tests. + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +#### Change index +``` +Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET + Change index on TextEquiv elements. +Options: + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +### Analytics +#### Get Codec +``` +Usage: pagetools get-codec [OPTIONS] FILES... + Retrieves codec of PAGE XML files. +Options: + -l, --level [region|line|word|glyph] + -idx, --index INTEGER Considers only text from TextEquiv elements + with a certain index. + -mc, --most-common INTEGER Only prints n most common entries. Shows all + by default. + -o, --output TEXT File to which results are written. + -rw, --remove-whitespace + -of, --output-format [json|csv|txt] + Available result formats. + -freq, --frequencies Outputs character frequencies. + --text-output-newline Inserts new line after every character in + txt output. Only applies when frequencies + aren't output. + --verbose / --silent Choose between verbose or silent output. + --help Show this message and exit. +``` +### Get text count +``` +Usage: pagetools get-text-count [OPTIONS] FILES... + Returns the amount of text equiv elements in certain elements for certain + indices. +Options: + -e, --element [TextRegion|TextLine|Word] + -i, --index TEXT [required] + -so, --stats-out TEXT Output directory for detailed stats csv + file. + --help Show this message and exit. +``` + +%package -n python3-PAGETools +Summary: Toolset to perform various operations on PAGE XML datasets +Provides: python-PAGETools +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-PAGETools +[](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) +Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the +[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). +## Installing +### Installation using pip +The suggested method is to install `pagetools` into a virtual environment using pip: +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install pagetools +``` +To install the package from source, clone this repository and run inside the project directory +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install . +``` +## Usage +### Transformations +#### Extraction +``` +Usage: pagetools extract [OPTIONS] XMLS... + Extract elements as image (optionally with text) files. +Options: + --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to extract (highest + priority). + --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to exclude from + extraction (lowest priority). + --no-text Suppresses text extraction. + -ie, --image-extension TEXT Extension of image files. Must be in the + same directory as corresponding XML file. + -o, --output TEXT Path where generated files will get saved. + -e, --enumerate-output Enumerates output file names instead of + using original names. + -z, --zip-output Add generated output to zip archive. + -bg, --background-color INTEGER... + RGB color code used to fill up background. + Used when padding and / or deskewing. + --background-mode [median|mean|dominant] + Color calc mode to fill up background + (overwrites -bg / --background-color). + -p, --padding INTEGER... Padding in pixels around the line image + cutout (top, bottom, left, right). + -ad, --auto-deskew Automatically deskew extracted line images + (Experimental!). + -d, --deskew FLOAT Angle for manual clockwise rotation of the + line images. + -gt, --gt-index INTEGER Index of the TextEquiv elements containing + ground truth. + -pred, --pred-index INTEGER Index of the TextEquiv elements containing + predicted text. + --help Show this message and exit. +``` +##### Examples +Only extract `TextLine` elements: +``` +pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*" +``` +Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. +#### line2page +Merges line images with corresponding text-files in page-images and page-xml +``` +Usage: pagetools line2page [OPTIONS] + Links line images and corresponding texts in a page and creates a combined + image and XML-File of each page +Options: + -c, --creator TEXT Creator tag for PAGE XML + -s, --source-folder TEXT Path to images and GT [required] + -i, --image-folder TEXT Path to images + -gt, --gt-folder TEXT Path to GT + -d, --dest-folder TEXT Path to merge objects + -e, --ext TEXT Image extension + -p, --pred BOOLEAN Set flag to also store .pred.txt + -l, --lines INTEGER RANGE Lines per page + -ls, --line-spacing INTEGER RANGE + Spacing between lines in pixel + -b, --border INTEGER RANGE... Border in pixel: top bottom left right + --debug [10|20|30|40|50] Sets the level of feedback to receive: + DEBUG=10, INFO=20, WARNING=30, ERROR=40, + CRITICAL=50 + --threads INTEGER RANGE Thread count to be used + --xml-schema [17|19] Sets the year of the xml-Schema to be used + --help Show this message and exit. +``` +Please note that each image file has to have the same name as its Ground Truth file. +``` +foo.nrm.png -> foo.gt.txt (& foo.pred.txt) +bar.bin.png -> bar.gt.txt (& bar.pred.txt) +``` +#### Regularization +``` +Usage: pagetools regularize [OPTIONS] XMLS... + Regularize the text content of PAGE XML files using custom rulesets. +Options: + --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Removes specified default ruleset. + --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Adds specified default ruleset. Overrides + all other default options. + -nd, --no-default Disables all default rulesets. + -r, --rules PATH File(s) which contains serialized ruleset. + -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] + Normalize unicode for both rules and PAGE + XML tests. + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +#### Change index +``` +Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET + Change index on TextEquiv elements. +Options: + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +### Analytics +#### Get Codec +``` +Usage: pagetools get-codec [OPTIONS] FILES... + Retrieves codec of PAGE XML files. +Options: + -l, --level [region|line|word|glyph] + -idx, --index INTEGER Considers only text from TextEquiv elements + with a certain index. + -mc, --most-common INTEGER Only prints n most common entries. Shows all + by default. + -o, --output TEXT File to which results are written. + -rw, --remove-whitespace + -of, --output-format [json|csv|txt] + Available result formats. + -freq, --frequencies Outputs character frequencies. + --text-output-newline Inserts new line after every character in + txt output. Only applies when frequencies + aren't output. + --verbose / --silent Choose between verbose or silent output. + --help Show this message and exit. +``` +### Get text count +``` +Usage: pagetools get-text-count [OPTIONS] FILES... + Returns the amount of text equiv elements in certain elements for certain + indices. +Options: + -e, --element [TextRegion|TextLine|Word] + -i, --index TEXT [required] + -so, --stats-out TEXT Output directory for detailed stats csv + file. + --help Show this message and exit. +``` + +%package help +Summary: Development documents and examples for PAGETools +Provides: python3-PAGETools-doc +%description help +[](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml) +Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the +[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd). +## Installing +### Installation using pip +The suggested method is to install `pagetools` into a virtual environment using pip: +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install pagetools +``` +To install the package from source, clone this repository and run inside the project directory +```bash +python -m venv VENV_NAME +source VENV_NAME/bin/activate +pip install . +``` +## Usage +### Transformations +#### Extraction +``` +Usage: pagetools extract [OPTIONS] XMLS... + Extract elements as image (optionally with text) files. +Options: + --include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to extract (highest + priority). + --exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*] + PAGE XML element types to exclude from + extraction (lowest priority). + --no-text Suppresses text extraction. + -ie, --image-extension TEXT Extension of image files. Must be in the + same directory as corresponding XML file. + -o, --output TEXT Path where generated files will get saved. + -e, --enumerate-output Enumerates output file names instead of + using original names. + -z, --zip-output Add generated output to zip archive. + -bg, --background-color INTEGER... + RGB color code used to fill up background. + Used when padding and / or deskewing. + --background-mode [median|mean|dominant] + Color calc mode to fill up background + (overwrites -bg / --background-color). + -p, --padding INTEGER... Padding in pixels around the line image + cutout (top, bottom, left, right). + -ad, --auto-deskew Automatically deskew extracted line images + (Experimental!). + -d, --deskew FLOAT Angle for manual clockwise rotation of the + line images. + -gt, --gt-index INTEGER Index of the TextEquiv elements containing + ground truth. + -pred, --pred-index INTEGER Index of the TextEquiv elements containing + predicted text. + --help Show this message and exit. +``` +##### Examples +Only extract `TextLine` elements: +``` +pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*" +``` +Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call. +#### line2page +Merges line images with corresponding text-files in page-images and page-xml +``` +Usage: pagetools line2page [OPTIONS] + Links line images and corresponding texts in a page and creates a combined + image and XML-File of each page +Options: + -c, --creator TEXT Creator tag for PAGE XML + -s, --source-folder TEXT Path to images and GT [required] + -i, --image-folder TEXT Path to images + -gt, --gt-folder TEXT Path to GT + -d, --dest-folder TEXT Path to merge objects + -e, --ext TEXT Image extension + -p, --pred BOOLEAN Set flag to also store .pred.txt + -l, --lines INTEGER RANGE Lines per page + -ls, --line-spacing INTEGER RANGE + Spacing between lines in pixel + -b, --border INTEGER RANGE... Border in pixel: top bottom left right + --debug [10|20|30|40|50] Sets the level of feedback to receive: + DEBUG=10, INFO=20, WARNING=30, ERROR=40, + CRITICAL=50 + --threads INTEGER RANGE Thread count to be used + --xml-schema [17|19] Sets the year of the xml-Schema to be used + --help Show this message and exit. +``` +Please note that each image file has to have the same name as its Ground Truth file. +``` +foo.nrm.png -> foo.gt.txt (& foo.pred.txt) +bar.bin.png -> bar.gt.txt (& bar.pred.txt) +``` +#### Regularization +``` +Usage: pagetools regularize [OPTIONS] XMLS... + Regularize the text content of PAGE XML files using custom rulesets. +Options: + --remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Removes specified default ruleset. + --add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces] + Adds specified default ruleset. Overrides + all other default options. + -nd, --no-default Disables all default rulesets. + -r, --rules PATH File(s) which contains serialized ruleset. + -nu, --normalize-unicode [NFC|NFD|NFKC|NFKD] + Normalize unicode for both rules and PAGE + XML tests. + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +#### Change index +``` +Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET + Change index on TextEquiv elements. +Options: + -s, --safe / -us, --unsafe Creates backups of original files before + overwriting. + --help Show this message and exit. +``` +### Analytics +#### Get Codec +``` +Usage: pagetools get-codec [OPTIONS] FILES... + Retrieves codec of PAGE XML files. +Options: + -l, --level [region|line|word|glyph] + -idx, --index INTEGER Considers only text from TextEquiv elements + with a certain index. + -mc, --most-common INTEGER Only prints n most common entries. Shows all + by default. + -o, --output TEXT File to which results are written. + -rw, --remove-whitespace + -of, --output-format [json|csv|txt] + Available result formats. + -freq, --frequencies Outputs character frequencies. + --text-output-newline Inserts new line after every character in + txt output. Only applies when frequencies + aren't output. + --verbose / --silent Choose between verbose or silent output. + --help Show this message and exit. +``` +### Get text count +``` +Usage: pagetools get-text-count [OPTIONS] FILES... + Returns the amount of text equiv elements in certain elements for certain + indices. +Options: + -e, --element [TextRegion|TextLine|Word] + -i, --index TEXT [required] + -so, --stats-out TEXT Output directory for detailed stats csv + file. + --help Show this message and exit. +``` + +%prep +%autosetup -n PAGETools-0.5.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-PAGETools -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.0-1 +- Package Spec generated |
