diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-pycld2.spec | 396 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 398 insertions, 0 deletions
@@ -0,0 +1 @@ +/pycld2-0.41.tar.gz diff --git a/python-pycld2.spec b/python-pycld2.spec new file mode 100644 index 0000000..5710ffd --- /dev/null +++ b/python-pycld2.spec @@ -0,0 +1,396 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pycld2 +Version: 0.41 +Release: 1 +Summary: Python bindings around Google Chromium's embedded compact language detection library (CLD2) +License: Apache2 +URL: https://github.com/aboSamoor/pycld2 +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/21/d2/8b0def84a53c88d0eb27c67b05269fbd16ad68df8c78849e7b5d65e6aec3/pycld2-0.41.tar.gz +BuildArch: noarch + + +%description +# PYCLD2 - Python Bindings to CLD2 + +Python bindings for the Compact Langauge Detect 2 (CLD2). + +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://travis-ci.org/aboSamoor/pycld2) + +This package contains forks of: + +- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites +- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector), + originally created by Mike McCandless, which has been modified post-fork. + These bindings, among other changes, make the support of over 165 languages + the default. + +The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two. + +The LICENSE is the same as Chromium's LICENSE and is included in the +LICENSE file for reference. + +## Installing + +```bash +$ python -m pip install -U pycld2 +``` + +## Example + +```python +import pycld2 as cld2 + +isReliable, textBytesFound, details = cld2.detect( + "а неправильный формат идентификатора дн назад" +) + +print(isReliable) +# True +details[0] +# ('RUSSIAN', 'ru', 98, 404.0) + +fr_en_Latn = """\ +France is the largest country in Western Europe and the third-largest in Europe as a whole. +A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections +et exporter Cet article concerne le pays européen aujourd’hui appelé République française. +Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide +dans le menu ci-dessus. +Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. +The quick brown fox jumped over the lazy dog.""" + +isReliable, textBytesFound, details, vectors = cld2.detect( + fr_en_Latn, returnVectors=True +) +print(vectors) +# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en')) +``` + +## API + +This package exports one function, `detect()`. See `help(detect)` for the full docstring. + +The first parameter (`utf8Bytes`) is the text for which you want to detect language. + +`utf8Bytes` may be either: + +- `str` (example: `"¼ cup of flour"`) +- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`) + +Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing +b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise. + +All other parameters are optional: + +| Parameter | Type/Default | Use | +| --------- | ------------ | --- | +| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. | +| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. | +| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. | +| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. | +| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. | +| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. | +| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) | +| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. | +| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. | +| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. | +| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. | +| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. | +| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. | +| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. | + +<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup> + +## Constants + +This package exports these global constants: + +| Constant | Description | +| -------- | ----------- | +| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). | +| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). | +| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. | +| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. | + +## What About CLD3? + +Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3). + +%package -n python3-pycld2 +Summary: Python bindings around Google Chromium's embedded compact language detection library (CLD2) +Provides: python-pycld2 +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pycld2 +# PYCLD2 - Python Bindings to CLD2 + +Python bindings for the Compact Langauge Detect 2 (CLD2). + +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://travis-ci.org/aboSamoor/pycld2) + +This package contains forks of: + +- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites +- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector), + originally created by Mike McCandless, which has been modified post-fork. + These bindings, among other changes, make the support of over 165 languages + the default. + +The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two. + +The LICENSE is the same as Chromium's LICENSE and is included in the +LICENSE file for reference. + +## Installing + +```bash +$ python -m pip install -U pycld2 +``` + +## Example + +```python +import pycld2 as cld2 + +isReliable, textBytesFound, details = cld2.detect( + "а неправильный формат идентификатора дн назад" +) + +print(isReliable) +# True +details[0] +# ('RUSSIAN', 'ru', 98, 404.0) + +fr_en_Latn = """\ +France is the largest country in Western Europe and the third-largest in Europe as a whole. +A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections +et exporter Cet article concerne le pays européen aujourd’hui appelé République française. +Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide +dans le menu ci-dessus. +Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. +The quick brown fox jumped over the lazy dog.""" + +isReliable, textBytesFound, details, vectors = cld2.detect( + fr_en_Latn, returnVectors=True +) +print(vectors) +# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en')) +``` + +## API + +This package exports one function, `detect()`. See `help(detect)` for the full docstring. + +The first parameter (`utf8Bytes`) is the text for which you want to detect language. + +`utf8Bytes` may be either: + +- `str` (example: `"¼ cup of flour"`) +- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`) + +Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing +b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise. + +All other parameters are optional: + +| Parameter | Type/Default | Use | +| --------- | ------------ | --- | +| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. | +| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. | +| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. | +| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. | +| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. | +| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. | +| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) | +| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. | +| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. | +| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. | +| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. | +| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. | +| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. | +| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. | + +<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup> + +## Constants + +This package exports these global constants: + +| Constant | Description | +| -------- | ----------- | +| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). | +| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). | +| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. | +| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. | + +## What About CLD3? + +Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3). + +%package help +Summary: Development documents and examples for pycld2 +Provides: python3-pycld2-doc +%description help +# PYCLD2 - Python Bindings to CLD2 + +Python bindings for the Compact Langauge Detect 2 (CLD2). + +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://pypi.python.org/pypi/pycld2) +[](https://travis-ci.org/aboSamoor/pycld2) + +This package contains forks of: + +- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites +- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector), + originally created by Mike McCandless, which has been modified post-fork. + These bindings, among other changes, make the support of over 165 languages + the default. + +The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two. + +The LICENSE is the same as Chromium's LICENSE and is included in the +LICENSE file for reference. + +## Installing + +```bash +$ python -m pip install -U pycld2 +``` + +## Example + +```python +import pycld2 as cld2 + +isReliable, textBytesFound, details = cld2.detect( + "а неправильный формат идентификатора дн назад" +) + +print(isReliable) +# True +details[0] +# ('RUSSIAN', 'ru', 98, 404.0) + +fr_en_Latn = """\ +France is the largest country in Western Europe and the third-largest in Europe as a whole. +A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections +et exporter Cet article concerne le pays européen aujourd’hui appelé République française. +Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide +dans le menu ci-dessus. +Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. +The quick brown fox jumped over the lazy dog.""" + +isReliable, textBytesFound, details, vectors = cld2.detect( + fr_en_Latn, returnVectors=True +) +print(vectors) +# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en')) +``` + +## API + +This package exports one function, `detect()`. See `help(detect)` for the full docstring. + +The first parameter (`utf8Bytes`) is the text for which you want to detect language. + +`utf8Bytes` may be either: + +- `str` (example: `"¼ cup of flour"`) +- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`) + +Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing +b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise. + +All other parameters are optional: + +| Parameter | Type/Default | Use | +| --------- | ------------ | --- | +| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. | +| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. | +| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. | +| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. | +| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. | +| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. | +| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) | +| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. | +| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. | +| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. | +| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. | +| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. | +| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. | +| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. | + +<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup> + +## Constants + +This package exports these global constants: + +| Constant | Description | +| -------- | ----------- | +| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). | +| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). | +| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. | +| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. | + +## What About CLD3? + +Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3). + +%prep +%autosetup -n pycld2-0.41 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pycld2 -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.41-1 +- Package Spec generated @@ -0,0 +1 @@ +cff427c2cfa50bb9d11055a03a291042 pycld2-0.41.tar.gz |
