summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-pycld2.spec396
-rw-r--r--sources1
3 files changed, 398 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..97106f5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/pycld2-0.41.tar.gz
diff --git a/python-pycld2.spec b/python-pycld2.spec
new file mode 100644
index 0000000..5710ffd
--- /dev/null
+++ b/python-pycld2.spec
@@ -0,0 +1,396 @@
+%global _empty_manifest_terminate_build 0
+Name: python-pycld2
+Version: 0.41
+Release: 1
+Summary: Python bindings around Google Chromium's embedded compact language detection library (CLD2)
+License: Apache2
+URL: https://github.com/aboSamoor/pycld2
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/21/d2/8b0def84a53c88d0eb27c67b05269fbd16ad68df8c78849e7b5d65e6aec3/pycld2-0.41.tar.gz
+BuildArch: noarch
+
+
+%description
+# PYCLD2 - Python Bindings to CLD2
+
+Python bindings for the Compact Langauge Detect 2 (CLD2).
+
+[![Downloads](https://img.shields.io/pypi/dm/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Latest version](https://img.shields.io/pypi/v/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Supported Python versions](https://img.shields.io/pypi/pyversions/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Development Status](https://img.shields.io/pypi/status/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Download format](https://img.shields.io/pypi/format/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Build status](https://travis-ci.org/aboSamoor/pycld2.png?branch=master)](https://travis-ci.org/aboSamoor/pycld2)
+
+This package contains forks of:
+
+- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites
+- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector),
+ originally created by Mike McCandless, which has been modified post-fork.
+ These bindings, among other changes, make the support of over 165 languages
+ the default.
+
+The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two.
+
+The LICENSE is the same as Chromium's LICENSE and is included in the
+LICENSE file for reference.
+
+## Installing
+
+```bash
+$ python -m pip install -U pycld2
+```
+
+## Example
+
+```python
+import pycld2 as cld2
+
+isReliable, textBytesFound, details = cld2.detect(
+ "а неправильный формат идентификатора дн назад"
+)
+
+print(isReliable)
+# True
+details[0]
+# ('RUSSIAN', 'ru', 98, 404.0)
+
+fr_en_Latn = """\
+France is the largest country in Western Europe and the third-largest in Europe as a whole.
+A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections
+et exporter Cet article concerne le pays européen aujourd’hui appelé République française.
+Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide
+dans le menu ci-dessus.
+Motoring events began soon after the construction of the first successful gasoline-fueled automobiles.
+The quick brown fox jumped over the lazy dog."""
+
+isReliable, textBytesFound, details, vectors = cld2.detect(
+ fr_en_Latn, returnVectors=True
+)
+print(vectors)
+# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en'))
+```
+
+## API
+
+This package exports one function, `detect()`. See `help(detect)` for the full docstring.
+
+The first parameter (`utf8Bytes`) is the text for which you want to detect language.
+
+`utf8Bytes` may be either:
+
+- `str` (example: `"¼ cup of flour"`)
+- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`)
+
+Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing
+b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise.
+
+All other parameters are optional:
+
+| Parameter | Type/Default | Use |
+| --------- | ------------ | --- |
+| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. |
+| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. |
+| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. |
+| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. |
+| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. |
+| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. |
+| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) |
+| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. |
+| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. |
+| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. |
+| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. |
+| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. |
+| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. |
+| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. |
+
+<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup>
+
+## Constants
+
+This package exports these global constants:
+
+| Constant | Description |
+| -------- | ----------- |
+| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). |
+| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). |
+| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. |
+| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. |
+
+## What About CLD3?
+
+Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3).
+
+%package -n python3-pycld2
+Summary: Python bindings around Google Chromium's embedded compact language detection library (CLD2)
+Provides: python-pycld2
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-pycld2
+# PYCLD2 - Python Bindings to CLD2
+
+Python bindings for the Compact Langauge Detect 2 (CLD2).
+
+[![Downloads](https://img.shields.io/pypi/dm/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Latest version](https://img.shields.io/pypi/v/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Supported Python versions](https://img.shields.io/pypi/pyversions/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Development Status](https://img.shields.io/pypi/status/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Download format](https://img.shields.io/pypi/format/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Build status](https://travis-ci.org/aboSamoor/pycld2.png?branch=master)](https://travis-ci.org/aboSamoor/pycld2)
+
+This package contains forks of:
+
+- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites
+- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector),
+ originally created by Mike McCandless, which has been modified post-fork.
+ These bindings, among other changes, make the support of over 165 languages
+ the default.
+
+The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two.
+
+The LICENSE is the same as Chromium's LICENSE and is included in the
+LICENSE file for reference.
+
+## Installing
+
+```bash
+$ python -m pip install -U pycld2
+```
+
+## Example
+
+```python
+import pycld2 as cld2
+
+isReliable, textBytesFound, details = cld2.detect(
+ "а неправильный формат идентификатора дн назад"
+)
+
+print(isReliable)
+# True
+details[0]
+# ('RUSSIAN', 'ru', 98, 404.0)
+
+fr_en_Latn = """\
+France is the largest country in Western Europe and the third-largest in Europe as a whole.
+A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections
+et exporter Cet article concerne le pays européen aujourd’hui appelé République française.
+Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide
+dans le menu ci-dessus.
+Motoring events began soon after the construction of the first successful gasoline-fueled automobiles.
+The quick brown fox jumped over the lazy dog."""
+
+isReliable, textBytesFound, details, vectors = cld2.detect(
+ fr_en_Latn, returnVectors=True
+)
+print(vectors)
+# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en'))
+```
+
+## API
+
+This package exports one function, `detect()`. See `help(detect)` for the full docstring.
+
+The first parameter (`utf8Bytes`) is the text for which you want to detect language.
+
+`utf8Bytes` may be either:
+
+- `str` (example: `"¼ cup of flour"`)
+- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`)
+
+Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing
+b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise.
+
+All other parameters are optional:
+
+| Parameter | Type/Default | Use |
+| --------- | ------------ | --- |
+| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. |
+| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. |
+| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. |
+| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. |
+| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. |
+| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. |
+| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) |
+| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. |
+| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. |
+| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. |
+| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. |
+| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. |
+| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. |
+| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. |
+
+<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup>
+
+## Constants
+
+This package exports these global constants:
+
+| Constant | Description |
+| -------- | ----------- |
+| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). |
+| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). |
+| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. |
+| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. |
+
+## What About CLD3?
+
+Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3).
+
+%package help
+Summary: Development documents and examples for pycld2
+Provides: python3-pycld2-doc
+%description help
+# PYCLD2 - Python Bindings to CLD2
+
+Python bindings for the Compact Langauge Detect 2 (CLD2).
+
+[![Downloads](https://img.shields.io/pypi/dm/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Latest version](https://img.shields.io/pypi/v/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Supported Python versions](https://img.shields.io/pypi/pyversions/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Development Status](https://img.shields.io/pypi/status/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Download format](https://img.shields.io/pypi/format/pycld2.svg)](https://pypi.python.org/pypi/pycld2)
+[![Build status](https://travis-ci.org/aboSamoor/pycld2.png?branch=master)](https://travis-ci.org/aboSamoor/pycld2)
+
+This package contains forks of:
+
+- The [`cld2` C++ library](https://github.com/CLD2Owners/cld2), developed by Dick Sites
+- The [`chromium-compact-language-detector` C++ extension module](https://github.com/mikemccand/chromium-compact-language-detector),
+ originally created by Mike McCandless, which has been modified post-fork.
+ These bindings, among other changes, make the support of over 165 languages
+ the default.
+
+The goal of this project is to consolidate the upstream library with its bindings, so the user can `pip install` one package instead of two.
+
+The LICENSE is the same as Chromium's LICENSE and is included in the
+LICENSE file for reference.
+
+## Installing
+
+```bash
+$ python -m pip install -U pycld2
+```
+
+## Example
+
+```python
+import pycld2 as cld2
+
+isReliable, textBytesFound, details = cld2.detect(
+ "а неправильный формат идентификатора дн назад"
+)
+
+print(isReliable)
+# True
+details[0]
+# ('RUSSIAN', 'ru', 98, 404.0)
+
+fr_en_Latn = """\
+France is the largest country in Western Europe and the third-largest in Europe as a whole.
+A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections
+et exporter Cet article concerne le pays européen aujourd’hui appelé République française.
+Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide
+dans le menu ci-dessus.
+Motoring events began soon after the construction of the first successful gasoline-fueled automobiles.
+The quick brown fox jumped over the lazy dog."""
+
+isReliable, textBytesFound, details, vectors = cld2.detect(
+ fr_en_Latn, returnVectors=True
+)
+print(vectors)
+# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en'))
+```
+
+## API
+
+This package exports one function, `detect()`. See `help(detect)` for the full docstring.
+
+The first parameter (`utf8Bytes`) is the text for which you want to detect language.
+
+`utf8Bytes` may be either:
+
+- `str` (example: `"¼ cup of flour"`)
+- `bytes` that have been encoded using UTF-8 (example: `"¼ cup of flour".encode("utf-8")`)
+
+Bytes that are *not* UTF-8 encoded will raise a `pycld2.error`. For example, passing
+b"\xbc cup of flour" (which is `"¼ cup of flour".encode("latin-1")`) will raise.
+
+All other parameters are optional:
+
+| Parameter | Type/Default | Use |
+| --------- | ------------ | --- |
+| `utf8Bytes` | `str` or `bytes`\* | The text to detect language for. |
+| `isPlainText` | `bool`, default `False` | If `False`, then the input is HTML and CLD will skip HTML tags, expand HTML entities, detect HTML `<lang ...>` tags, etc. |
+| `hintTopLevelDomain` | `str` | E.g., `'id'` boosts Indonesian. |
+| `hintLanguage` | `str` | E.g., `'ITALIAN'` or `'it'` boosts Italian; see `cld.LANGUAGES` for all known languages. |
+| `hintLanguageHTTPHeaders` | `str` | E.g., `'mi,en'` boosts Maori and English. |
+| `hintEncoding` | `str` | E.g, `'SJS'` boosts Japanese; see `cld.ENCODINGS` for all known encodings. |
+| `returnVectors` | `bool`, default `False` | If `True`, then the vectors indicating which language was detected in which byte range are returned in addition to details. The vectors are a sequence of `(bytesOffset, bytesLength, languageName, languageCode)`, in order. `bytesOffset` is the start of the vector, `bytesLength `is the length of the vector. Note that there is some added CPU cost if this is True. (Approx. 2x performance hit.) |
+| `debugScoreAsQuads` | `bool`, default `False` | Normally, several languages are detected solely by their Unicode script. Combined with appropritate lookup tables, this flag forces them instead to be detected via quadgrams. This can be a useful refinement when looking for meaningful text in these languages, instead of just character sets. The default tables do not support this use. |
+| `debugHTML` | `bool`, default `False` | For each detection call, write an HTML file to stderr, showing the text chunks and their detected languages. See `cld2/docs/InterpretingCLD2UnitTestOutput.pdf` to interpret this output. |
+| `debugCR` | `bool`, default `False` | In that HTML file, force a new line for each chunk. |
+| `debugVerbose` | `bool`, default `False` | In that HTML file, show every lookup entry. |
+| `debugQuiet` | `bool`, default `False` | In that HTML file, suppress most of the output detail. |
+| `debugEcho` | `bool`, default `False` | Echo every input buffer to stderr. |
+| `bestEffort` | `bool`, default `False` | If `True`, then allow low-quality results for short text, rather than forcing the result to `"UNKNOWN_LANGUAGE"`. This may be of use for those desiring approximate results on short input text, but there is no claim that these result are very good. |
+
+<sup>\*If `bytes`, must be UTF-8 encoded bytes.</sup>
+
+## Constants
+
+This package exports these global constants:
+
+| Constant | Description |
+| -------- | ----------- |
+| `pycld2.ENCODINGS` | list of the encoding names CLD recognizes (if you provide `hintEncoding`, it must be one of these names). |
+| `pycld2.LANGUAGES` | list of languages and their codes (if you provide `hintLanguageCode`, it must be one of the codes from these codes). |
+| `pycld2.EXTERNAL_LANGUAGES` | list of external languages and their codes. |
+| `pycld2.DETECTED_LANGUAGES` | list of all detectable languages. |
+
+## What About CLD3?
+
+Python bindings for [CLD3](https://github.com/google/cld3/) are available as a separate project, [`pycld3`](https://github.com/bsolomon1124/pycld3).
+
+%prep
+%autosetup -n pycld2-0.41
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-pycld2 -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.41-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..f2eb7f6
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+cff427c2cfa50bb9d11055a03a291042 pycld2-0.41.tar.gz