%global _empty_manifest_terminate_build 0 Name: python-opustools-pkg Version: 0.0.52 Release: 1 Summary: Tools to read OPUS License: MIT License URL: https://github.com/Helsinki-NLP/OpusTools Source0: https://mirrors.nju.edu.cn/pypi/web/packages/14/25/843d07bee632c269820bbe7e0a0843c9e2af8ad7ebd441970ca2869432ca/opustools_pkg-0.0.52.tar.gz BuildArch: noarch %description ## opus_read ### Usage ``` usage: opus_read [-h] -d corpus_name -s langid -t langid [-r version] [-p {raw,xml,parsed}] [-m M] [-S S] [-T T] [-a attribute] [-tr TR] [-ln] [-w file_name [file_name ...]] [-wm {normal,moses,tmx,links}] [-pn] [-f] [-rd path_to_dir] [-af path_to_file] [-sz path_to_zip] [-tz path_to_zip] [-cm delimiter] [-pa] [-sa attribute [attribute ...]] [-ta attribute [attribute ...]] [-ca delimiter] [--src_cld2 lang_id score] [--trg_cld2 lang_id score] [--src_langid lang_id score] [--trg_langid lang_id score] [-id file_name] [-q] [-dl DOWNLOAD_DIR] [-pi] [-v] ``` arguments: ``` -h, --help show this help message and exit -d corpus_name, --directory corpus_name Corpus name -s langid, --source langid Source language -t langid, --target langid Target language -r version, --release version Release (default=latest) -p {raw,xml,parsed}, --preprocess {raw,xml,parsed} Preprocess-type (raw, xml or parsed, default=xml) -m MAXIMUM, --maximum MAXIMUM Maximum number of alignments -S SRC_RANGE, --src_range SRC_RANGE Number of source sentences in alignments (range is allowed, eg. -S 1-2) -T TGT_RANGE, --tgt_range TGT_RANGE Number of target sentences in alignments (range is allowed, eg. -T 1-2) -a attribute, --attribute attribute Set attribute for filttering -tr THRESHOLD, --threshold THRESHOLD Set threshold for an attribute -ln, --leave_non_alignments_out Leave non-alignments out -w file_name [file_name ...], --write file_name [file_name ...] Write to file. To print moses format in separate files, enter two file names. Otherwise enter one file name. -wm {normal,moses,tmx,links}, --write_mode {normal,moses,tmx,links} Set write mode -pn, --print_file_names Print file names when using moses format -f, --fast Fast parsing. Faster than normal parsing, if you print a small part of the whole corpus, but requires the sentence ids in alignment files to be in sequence. -rd path_to_dir, --root_directory path_to_dir Change root directory (default=/proj/nlpl/data/OPUS) -af path_to_file, --alignment_file path_to_file Use given alignment file -sz path_to_zip, --source_zip path_to_zip Use given source zip file -tz path_to_zip, --target_zip path_to_zip Use given target zip file -cm delimiter, --change_moses_delimiter delimiter Change moses delimiter (default=tab) -pa, --print_annotations Print annotations, if they exist -sa attribute [attribute ...], --source_annotations attribute [attribute ...] Set source sentence annotation attributes to be printed, e.g. -sa pos lem. To print all available attributes use -sa all_attrs (default=pos lem) -ta attribute [attribute ...], --target_annotations attribute [attribute ...] Set target sentence annotation attributes to be printed, e.g. -ta pos lem. To print all available attributes use -ta all_attrs (default=pos lem) -ca delimiter, --change_annotation_delimiter delimiter Change annotation delimiter (default=|) --src_cld2 lang_id score Filter source sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --trg_cld2 lang_id score Filter target sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --src_langid lang_id score Filter source sentences by their langid.py language id labels and confidence score, e.g. en 0.9 --trg_langid lang_id score Filter target sentences by their langid.py language id labels and confidence score, e.g. en 0.9 -id file_name, --write_ids file_name Write sentence ids to a file. -q, --suppress_prompts Download necessary files without prompting "(y/n)" -dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR Set download directory (default=current directory) -pi, --preserve_inline_tags Preserve inline tags within sentences -v, --verbose Print prorgess messages when writing results to files ``` ### Description `opus_read` is a script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below). ``` ``` Several parameters can be set to filter the alignments and to print only certain types of alignments. `opus_read` can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Set the "-wm" flag to "links" to enable this mode. `opus_read` reads the alignments from zip files. Starting up the script might take some time, if the zip files are large (for example OpenSubtitles in OPUS). `opus_read` uses `ExhaustiveSentenceParser` by default. This means that each time a `` tag is found, the corresponding source and target documents are read through and each sentence is stored in a hashmap with the sentence id as the key. This allows the reader to read alignment files that have sentence ids in non-sequential order. Each time a `` tag is found, the script pauses printing for a second to read through the source and target documents. The duration of the pause depends on the size of the source and target documents. Using the "-f" flag allows the usage of `SentenceParser`, which is faster than ExhaustiveSentenceParser in cases where only a small part of a corpus is read. `SentenceParser` does not store the sentences in a hashmap. Rather, when it finds a `` tag, it iterates through a sentence file until a sentence id is matched with the sentence id found in the `` tag. SentenceParser can't go backwards, which means that if the ids are not in sequential order in the alignment file, the parser will not find alignment pairs after the sentence id sequence breaks. `SentenceParser` is less reliable than `ExhaustiveSentenceParser`, but using the "-f" flag is beneficial when the whole corpus does not need to be scanned, in other words, when using the "-m" flag. **Examples:** Read sentence alignment in XCES align format. Necessary files will be downloaded if they are not found locally: `opus_read --directory RF --source en --target sv` Read sentences with specific preprocessing type. (default is xml, which is tokenized text): `opus_read --directory RF --source en --target sv --preprocess raw` Leave non-alignments (pairs with no sentences on one side) out ``` opus_read --directory RF \ --source en \ --target sv \ --preprocess raw\ --leave_non_alignments_out ``` Print first 10 alignment pairs: `opus_read --directory RF --source en --target sv -m 10` Print XCES align format of all 1:1 sentence alignments: ``` opus_read --directory RF \ --source en \ --target sv \ --src_range 1 \ --tgt_range 1 ``` Print alignments with alignment certainty greater than 1.1: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 ``` Write results to file: `opus_read --directory RF --source en --target sv --write result.txt` Write with different output format: ``` opus_read --directory RF \ --source en \ --target sv \ --write result.tmx\ --write_mode tmx ``` Write moses format to one file: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.txt\ --write_mode moses ``` or to two files: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.en en-sv.sv \ --write_mode moses ``` Read sentences using your alignment file. First create an alignment file, for example: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 \ --write_mode links \ --write en-sv.links ``` Then use the created alignment file: `opus_read --directory RF --source en --target sv --alignment_file en-sv.links` Annotations can be printed with `--print_annotations` if they are included in the sentence files. To print all annotation attributes, specify this with `--source_annotations` and `--target_annotations` flags: ``` opus_read --directory RF \ --source en \ --target sv \ --print_annotations \ --source_annotations all_attrs \ --target_annotations all_attrs ``` Sentences can be filtered by their language id labels and confidence score. First, the language ids need to be added to the sentence files with `opus_langid`. If you have run the previous examples, you should have `RF_latest_xml_en.zip` and `RF_latest_xml_sv.zip` in your current working directory. Apply `opus_langid` to these files: ``` opus_langid --file_path RF_latest_xml_en.zip opus_langid --file_path RF_latest_xml_sv.zip ``` Now you can filter by language ids. This example uses both cld2 and langid.py language detection confidence scores: ``` opus_read --directory RF \ --source en \ --target sv \ --src_cld2 en 0.99 \ --trg_cld2 sv 0.99 \ --src_langid en 1 \ --trg_langid sv 1 ``` **You can also import the module to your python script:** In `your_script.py`, first import the package: `import opustools_pkg` Initialize OpusRead: ``` opus_reader = opustools_pkg.OpusRead( directory='Books', source='en', target='fi') opus_reader.printPairs() ``` and then run: %package -n python3-opustools-pkg Summary: Tools to read OPUS Provides: python-opustools-pkg BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-opustools-pkg ## opus_read ### Usage ``` usage: opus_read [-h] -d corpus_name -s langid -t langid [-r version] [-p {raw,xml,parsed}] [-m M] [-S S] [-T T] [-a attribute] [-tr TR] [-ln] [-w file_name [file_name ...]] [-wm {normal,moses,tmx,links}] [-pn] [-f] [-rd path_to_dir] [-af path_to_file] [-sz path_to_zip] [-tz path_to_zip] [-cm delimiter] [-pa] [-sa attribute [attribute ...]] [-ta attribute [attribute ...]] [-ca delimiter] [--src_cld2 lang_id score] [--trg_cld2 lang_id score] [--src_langid lang_id score] [--trg_langid lang_id score] [-id file_name] [-q] [-dl DOWNLOAD_DIR] [-pi] [-v] ``` arguments: ``` -h, --help show this help message and exit -d corpus_name, --directory corpus_name Corpus name -s langid, --source langid Source language -t langid, --target langid Target language -r version, --release version Release (default=latest) -p {raw,xml,parsed}, --preprocess {raw,xml,parsed} Preprocess-type (raw, xml or parsed, default=xml) -m MAXIMUM, --maximum MAXIMUM Maximum number of alignments -S SRC_RANGE, --src_range SRC_RANGE Number of source sentences in alignments (range is allowed, eg. -S 1-2) -T TGT_RANGE, --tgt_range TGT_RANGE Number of target sentences in alignments (range is allowed, eg. -T 1-2) -a attribute, --attribute attribute Set attribute for filttering -tr THRESHOLD, --threshold THRESHOLD Set threshold for an attribute -ln, --leave_non_alignments_out Leave non-alignments out -w file_name [file_name ...], --write file_name [file_name ...] Write to file. To print moses format in separate files, enter two file names. Otherwise enter one file name. -wm {normal,moses,tmx,links}, --write_mode {normal,moses,tmx,links} Set write mode -pn, --print_file_names Print file names when using moses format -f, --fast Fast parsing. Faster than normal parsing, if you print a small part of the whole corpus, but requires the sentence ids in alignment files to be in sequence. -rd path_to_dir, --root_directory path_to_dir Change root directory (default=/proj/nlpl/data/OPUS) -af path_to_file, --alignment_file path_to_file Use given alignment file -sz path_to_zip, --source_zip path_to_zip Use given source zip file -tz path_to_zip, --target_zip path_to_zip Use given target zip file -cm delimiter, --change_moses_delimiter delimiter Change moses delimiter (default=tab) -pa, --print_annotations Print annotations, if they exist -sa attribute [attribute ...], --source_annotations attribute [attribute ...] Set source sentence annotation attributes to be printed, e.g. -sa pos lem. To print all available attributes use -sa all_attrs (default=pos lem) -ta attribute [attribute ...], --target_annotations attribute [attribute ...] Set target sentence annotation attributes to be printed, e.g. -ta pos lem. To print all available attributes use -ta all_attrs (default=pos lem) -ca delimiter, --change_annotation_delimiter delimiter Change annotation delimiter (default=|) --src_cld2 lang_id score Filter source sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --trg_cld2 lang_id score Filter target sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --src_langid lang_id score Filter source sentences by their langid.py language id labels and confidence score, e.g. en 0.9 --trg_langid lang_id score Filter target sentences by their langid.py language id labels and confidence score, e.g. en 0.9 -id file_name, --write_ids file_name Write sentence ids to a file. -q, --suppress_prompts Download necessary files without prompting "(y/n)" -dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR Set download directory (default=current directory) -pi, --preserve_inline_tags Preserve inline tags within sentences -v, --verbose Print prorgess messages when writing results to files ``` ### Description `opus_read` is a script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below). ``` ``` Several parameters can be set to filter the alignments and to print only certain types of alignments. `opus_read` can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Set the "-wm" flag to "links" to enable this mode. `opus_read` reads the alignments from zip files. Starting up the script might take some time, if the zip files are large (for example OpenSubtitles in OPUS). `opus_read` uses `ExhaustiveSentenceParser` by default. This means that each time a `` tag is found, the corresponding source and target documents are read through and each sentence is stored in a hashmap with the sentence id as the key. This allows the reader to read alignment files that have sentence ids in non-sequential order. Each time a `` tag is found, the script pauses printing for a second to read through the source and target documents. The duration of the pause depends on the size of the source and target documents. Using the "-f" flag allows the usage of `SentenceParser`, which is faster than ExhaustiveSentenceParser in cases where only a small part of a corpus is read. `SentenceParser` does not store the sentences in a hashmap. Rather, when it finds a `` tag, it iterates through a sentence file until a sentence id is matched with the sentence id found in the `` tag. SentenceParser can't go backwards, which means that if the ids are not in sequential order in the alignment file, the parser will not find alignment pairs after the sentence id sequence breaks. `SentenceParser` is less reliable than `ExhaustiveSentenceParser`, but using the "-f" flag is beneficial when the whole corpus does not need to be scanned, in other words, when using the "-m" flag. **Examples:** Read sentence alignment in XCES align format. Necessary files will be downloaded if they are not found locally: `opus_read --directory RF --source en --target sv` Read sentences with specific preprocessing type. (default is xml, which is tokenized text): `opus_read --directory RF --source en --target sv --preprocess raw` Leave non-alignments (pairs with no sentences on one side) out ``` opus_read --directory RF \ --source en \ --target sv \ --preprocess raw\ --leave_non_alignments_out ``` Print first 10 alignment pairs: `opus_read --directory RF --source en --target sv -m 10` Print XCES align format of all 1:1 sentence alignments: ``` opus_read --directory RF \ --source en \ --target sv \ --src_range 1 \ --tgt_range 1 ``` Print alignments with alignment certainty greater than 1.1: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 ``` Write results to file: `opus_read --directory RF --source en --target sv --write result.txt` Write with different output format: ``` opus_read --directory RF \ --source en \ --target sv \ --write result.tmx\ --write_mode tmx ``` Write moses format to one file: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.txt\ --write_mode moses ``` or to two files: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.en en-sv.sv \ --write_mode moses ``` Read sentences using your alignment file. First create an alignment file, for example: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 \ --write_mode links \ --write en-sv.links ``` Then use the created alignment file: `opus_read --directory RF --source en --target sv --alignment_file en-sv.links` Annotations can be printed with `--print_annotations` if they are included in the sentence files. To print all annotation attributes, specify this with `--source_annotations` and `--target_annotations` flags: ``` opus_read --directory RF \ --source en \ --target sv \ --print_annotations \ --source_annotations all_attrs \ --target_annotations all_attrs ``` Sentences can be filtered by their language id labels and confidence score. First, the language ids need to be added to the sentence files with `opus_langid`. If you have run the previous examples, you should have `RF_latest_xml_en.zip` and `RF_latest_xml_sv.zip` in your current working directory. Apply `opus_langid` to these files: ``` opus_langid --file_path RF_latest_xml_en.zip opus_langid --file_path RF_latest_xml_sv.zip ``` Now you can filter by language ids. This example uses both cld2 and langid.py language detection confidence scores: ``` opus_read --directory RF \ --source en \ --target sv \ --src_cld2 en 0.99 \ --trg_cld2 sv 0.99 \ --src_langid en 1 \ --trg_langid sv 1 ``` **You can also import the module to your python script:** In `your_script.py`, first import the package: `import opustools_pkg` Initialize OpusRead: ``` opus_reader = opustools_pkg.OpusRead( directory='Books', source='en', target='fi') opus_reader.printPairs() ``` and then run: %package help Summary: Development documents and examples for opustools-pkg Provides: python3-opustools-pkg-doc %description help ## opus_read ### Usage ``` usage: opus_read [-h] -d corpus_name -s langid -t langid [-r version] [-p {raw,xml,parsed}] [-m M] [-S S] [-T T] [-a attribute] [-tr TR] [-ln] [-w file_name [file_name ...]] [-wm {normal,moses,tmx,links}] [-pn] [-f] [-rd path_to_dir] [-af path_to_file] [-sz path_to_zip] [-tz path_to_zip] [-cm delimiter] [-pa] [-sa attribute [attribute ...]] [-ta attribute [attribute ...]] [-ca delimiter] [--src_cld2 lang_id score] [--trg_cld2 lang_id score] [--src_langid lang_id score] [--trg_langid lang_id score] [-id file_name] [-q] [-dl DOWNLOAD_DIR] [-pi] [-v] ``` arguments: ``` -h, --help show this help message and exit -d corpus_name, --directory corpus_name Corpus name -s langid, --source langid Source language -t langid, --target langid Target language -r version, --release version Release (default=latest) -p {raw,xml,parsed}, --preprocess {raw,xml,parsed} Preprocess-type (raw, xml or parsed, default=xml) -m MAXIMUM, --maximum MAXIMUM Maximum number of alignments -S SRC_RANGE, --src_range SRC_RANGE Number of source sentences in alignments (range is allowed, eg. -S 1-2) -T TGT_RANGE, --tgt_range TGT_RANGE Number of target sentences in alignments (range is allowed, eg. -T 1-2) -a attribute, --attribute attribute Set attribute for filttering -tr THRESHOLD, --threshold THRESHOLD Set threshold for an attribute -ln, --leave_non_alignments_out Leave non-alignments out -w file_name [file_name ...], --write file_name [file_name ...] Write to file. To print moses format in separate files, enter two file names. Otherwise enter one file name. -wm {normal,moses,tmx,links}, --write_mode {normal,moses,tmx,links} Set write mode -pn, --print_file_names Print file names when using moses format -f, --fast Fast parsing. Faster than normal parsing, if you print a small part of the whole corpus, but requires the sentence ids in alignment files to be in sequence. -rd path_to_dir, --root_directory path_to_dir Change root directory (default=/proj/nlpl/data/OPUS) -af path_to_file, --alignment_file path_to_file Use given alignment file -sz path_to_zip, --source_zip path_to_zip Use given source zip file -tz path_to_zip, --target_zip path_to_zip Use given target zip file -cm delimiter, --change_moses_delimiter delimiter Change moses delimiter (default=tab) -pa, --print_annotations Print annotations, if they exist -sa attribute [attribute ...], --source_annotations attribute [attribute ...] Set source sentence annotation attributes to be printed, e.g. -sa pos lem. To print all available attributes use -sa all_attrs (default=pos lem) -ta attribute [attribute ...], --target_annotations attribute [attribute ...] Set target sentence annotation attributes to be printed, e.g. -ta pos lem. To print all available attributes use -ta all_attrs (default=pos lem) -ca delimiter, --change_annotation_delimiter delimiter Change annotation delimiter (default=|) --src_cld2 lang_id score Filter source sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --trg_cld2 lang_id score Filter target sentences by their cld2 language id labels and confidence score, e.g. en 0.9 --src_langid lang_id score Filter source sentences by their langid.py language id labels and confidence score, e.g. en 0.9 --trg_langid lang_id score Filter target sentences by their langid.py language id labels and confidence score, e.g. en 0.9 -id file_name, --write_ids file_name Write sentence ids to a file. -q, --suppress_prompts Download necessary files without prompting "(y/n)" -dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR Set download directory (default=current directory) -pi, --preserve_inline_tags Preserve inline tags within sentences -v, --verbose Print prorgess messages when writing results to files ``` ### Description `opus_read` is a script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below). ``` ``` Several parameters can be set to filter the alignments and to print only certain types of alignments. `opus_read` can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Set the "-wm" flag to "links" to enable this mode. `opus_read` reads the alignments from zip files. Starting up the script might take some time, if the zip files are large (for example OpenSubtitles in OPUS). `opus_read` uses `ExhaustiveSentenceParser` by default. This means that each time a `` tag is found, the corresponding source and target documents are read through and each sentence is stored in a hashmap with the sentence id as the key. This allows the reader to read alignment files that have sentence ids in non-sequential order. Each time a `` tag is found, the script pauses printing for a second to read through the source and target documents. The duration of the pause depends on the size of the source and target documents. Using the "-f" flag allows the usage of `SentenceParser`, which is faster than ExhaustiveSentenceParser in cases where only a small part of a corpus is read. `SentenceParser` does not store the sentences in a hashmap. Rather, when it finds a `` tag, it iterates through a sentence file until a sentence id is matched with the sentence id found in the `` tag. SentenceParser can't go backwards, which means that if the ids are not in sequential order in the alignment file, the parser will not find alignment pairs after the sentence id sequence breaks. `SentenceParser` is less reliable than `ExhaustiveSentenceParser`, but using the "-f" flag is beneficial when the whole corpus does not need to be scanned, in other words, when using the "-m" flag. **Examples:** Read sentence alignment in XCES align format. Necessary files will be downloaded if they are not found locally: `opus_read --directory RF --source en --target sv` Read sentences with specific preprocessing type. (default is xml, which is tokenized text): `opus_read --directory RF --source en --target sv --preprocess raw` Leave non-alignments (pairs with no sentences on one side) out ``` opus_read --directory RF \ --source en \ --target sv \ --preprocess raw\ --leave_non_alignments_out ``` Print first 10 alignment pairs: `opus_read --directory RF --source en --target sv -m 10` Print XCES align format of all 1:1 sentence alignments: ``` opus_read --directory RF \ --source en \ --target sv \ --src_range 1 \ --tgt_range 1 ``` Print alignments with alignment certainty greater than 1.1: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 ``` Write results to file: `opus_read --directory RF --source en --target sv --write result.txt` Write with different output format: ``` opus_read --directory RF \ --source en \ --target sv \ --write result.tmx\ --write_mode tmx ``` Write moses format to one file: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.txt\ --write_mode moses ``` or to two files: ``` opus_read --directory RF \ --source en \ --target sv \ --write en-sv.en en-sv.sv \ --write_mode moses ``` Read sentences using your alignment file. First create an alignment file, for example: ``` opus_read --directory RF \ --source en \ --target sv \ --attribute certainty \ --threshold 1.1 \ --write_mode links \ --write en-sv.links ``` Then use the created alignment file: `opus_read --directory RF --source en --target sv --alignment_file en-sv.links` Annotations can be printed with `--print_annotations` if they are included in the sentence files. To print all annotation attributes, specify this with `--source_annotations` and `--target_annotations` flags: ``` opus_read --directory RF \ --source en \ --target sv \ --print_annotations \ --source_annotations all_attrs \ --target_annotations all_attrs ``` Sentences can be filtered by their language id labels and confidence score. First, the language ids need to be added to the sentence files with `opus_langid`. If you have run the previous examples, you should have `RF_latest_xml_en.zip` and `RF_latest_xml_sv.zip` in your current working directory. Apply `opus_langid` to these files: ``` opus_langid --file_path RF_latest_xml_en.zip opus_langid --file_path RF_latest_xml_sv.zip ``` Now you can filter by language ids. This example uses both cld2 and langid.py language detection confidence scores: ``` opus_read --directory RF \ --source en \ --target sv \ --src_cld2 en 0.99 \ --trg_cld2 sv 0.99 \ --src_langid en 1 \ --trg_langid sv 1 ``` **You can also import the module to your python script:** In `your_script.py`, first import the package: `import opustools_pkg` Initialize OpusRead: ``` opus_reader = opustools_pkg.OpusRead( directory='Books', source='en', target='fi') opus_reader.printPairs() ``` and then run: %prep %autosetup -n opustools-pkg-0.0.52 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-opustools-pkg -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Wed May 31 2023 Python_Bot - 0.0.52-1 - Package Spec generated