automatic import of python-urltitle

author: CoprDistGit <infra@openeuler.org> 2023-05-17 03:17:58 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-17 03:17:58 +0000
commit: aeea78ebe511cf1e37abc435d3ec6ba3078f046c (patch)
tree: b09180553cd6ed68e62ea280df90083bec5424e1
parent: 83861aeb65318b9386f5f703f9fe706518b58de0 (diff)
3 files changed, 456 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..dead43c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/urltitle-0.3.8.tar.gz
diff --git a/python-urltitle.spec b/python-urltitle.spec
new file mode 100644
index 0000000..16da37b
--- /dev/null
+++ b/python-urltitle.spec
@@ -0,0 +1,454 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-urltitle
+Version:	0.3.8
+Release:	1
+Summary:	Get page title for URL
+License:	GNU Affero General Public License v3
+URL:		https://github.com/impredicative/urltitle/
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/9d/7a/945fc47db0920f90c09e17df435d87a4d6d20afedbdd97768a5caf347a0c/urltitle-0.3.8.tar.gz
+BuildArch:	noarch
+
+Requires:	python3-beautifulsoup4
+Requires:	python3-cachetools
+Requires:	python3-humanize
+Requires:	python3-pikepdf
+
+%description
+# urltitle
+**urltitle** uses Python 3.7 to return the page title or header-based description for a given URL.
+Its intended primary use is the inclusion of the returned value in conversations.
+As a disclaimer, note that the returned title is not guaranteed to be accurate due to many possible factors.
+
+[![cicd badge](https://github.com/impredicative/urltitle/workflows/cicd/badge.svg?branch=master)](https://github.com/impredicative/urltitle/actions?query=workflow%3Acicd+branch%3Amaster)
+
+## Features
+* An in-memory cache is used with a default entry expiration of a week. The cache size and time are customizable.
+* Approximately only the fraction of a HTML page required to return a title is read, up to a customizable maximum of 1 MiB.
+* A fallback to the `og:title` and `twitter:title` if the `title` tag is unavailable.
+* A PDF title metadata extractor is used for PDF files of up to a customizable maximum size of 8 MiB.
+* Up to three attempts are made for resiliency except if there is an unrecoverable error, i.e. 400, 401, 404, etc.
+* A guess of `https` and otherwise `http` is made for a URL with a missing scheme, e.g. git-scm.com/downloads.
+* SSL verification for https sites can optionally be disabled.
+* A fallback to Google web cache is used if a HTML page presents a Distil captcha.
+It is also used for a PDF which is too large or doesn't have title metadata.
+* Diagnostic logging can be optionally enabled for the logger named `urltitle` at the desired level.
+* Some site-specific customizations are configurable:
+  - Regular expression based URL and title substitutions
+  - Use of Google web cache
+  - User-Agent
+  - Additional headers
+  - CSS title selector
+  - Use of `og:title` or `twitter:title` over `title` tag
+  - Initial read size
+
+## Links
+* Code: https://github.com/impredicative/urltitle/
+* Release: https://pypi.org/project/urltitle/
+
+## Usage
+### Installation
+Python ≥3.7 is required due to a reference 
+to [`SSLCertVerificationError`](https://docs.python.org/3/library/ssl.html#ssl.SSLCertVerificationError).
+
+To install the package, run:
+
+    pip install urltitle
+
+### Examples
+```python
+from urltitle import URLTitleReader
+
+reader = URLTitleReader(verify_ssl=True)
+
+# Titles for HTML content
+reader.title('https://www.cnn.com/2019/02/11/health/insect-decline-study-intl/index.html')
+"Insect numbers in precipitous decline could have 'catastrophic' consequences, warns study - CNN"
+
+reader.title('https://www.youtube.com/watch?v=53YvP6gdD7U')
+'Deep Learning State of the Art (2019) - MIT - YouTube'
+
+# Titles for URLs with a missing scheme
+reader.title('www.reuters.com/article/us-usa-military-army/army-calls-base-housing-hazards-unconscionable-details-steps-to-protect-families-idUSKCN1Q4275')
+"Army calls base housing hazards 'unconscionable,' details steps to protect families | Reuters"
+
+reader.title('reddit.com/r/FoodNerds/comments/arb6qj')
+'Paternal high-fat diet transgenerationally impacts hepatic immunometabolism. - PubMed - NCBI : FoodNerds'
+
+reader.title('neverssl.com')
+'NeverSSL - helping you get online'
+
+# Titles for non-ASCII URLs
+reader.title('https://en.wikipedia.org/wiki/Amanattō')
+'Amanattō - Wikipedia'
+
+reader.title('https://fr.wikipedia.org/wiki/Wikipédia:Accueil_principal')
+"Wikipédia, l'encyclopédie libre"
+
+# Titles for PDFs having title metadata
+reader.title('https://www.diabetes.org.br/publico/images/pdf/artificial-sweeteners-induce-glucose-intolerance-by-altering-the-gut-microbiota.pdf')
+'Artificial sweeteners induce glucose intolerance by altering the gut microbiota'
+
+reader.title('https://www.omicsonline.org/open-access/detection-of-glyphosate-in-malformed-piglets-2161-0525.1000230.pdf')
+'Detection of Glyphosate in Malformed Piglets'
+
+# Titles for other content showing Content-Type and Content-Length as available:
+reader.title('https://www.sciencedaily.com/images/2019/02/190213142720_1_540x360.jpg')
+'(image/jpeg) (54K)'
+
+reader.title('https://kdnuggets.com/rss')
+'(application/rss+xml; charset=UTF-8)'
+
+reader.title('https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso')
+'(application/octet-stream) (2G)'
+
+# Titles for substituted URLs as per configuration:
+reader.title('https://arxiv.org/pdf/1902.04704.pdf')
+'[1902.04704] Neural network models and deep learning - a primer for biologists'
+
+reader.title('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2495396/pdf/postmedj00315-0056.pdf')
+"Features of a successful therapeutic fast of 382 days' duration"
+
+reader.title('https://pdfs.semanticscholar.org/1d76/d4561b594b5c5b5250edb43122d85db07262.pdf')
+'Nutrition and health. The issue is not food, nor nutrients, so much as processing. - Semantic Scholar'
+```
+
+### Exceptions
+An error is expected to raise the `urltitle.URLTitleError` exception.
+
+### Customizations
+For any site-specific customizations, update (but ideally not replace) 
+`urltitle.config.NETLOC_OVERRIDES` with the relevant sites using the preexisting entries in it as examples. 
+Refer to [`overrides.py`](urltitle/config/overrides.py).
+The site of a URL is as defined and returned by the `URLTitleReader().netloc(url)` method in
+[`urltitle.py`](urltitle/urltitle.py).
+
+The following examples show various URLs and their corresponding sites for the purpose of entering site-specific
+customizations:
+
+| URL | Site |
+| --- | ---- |
+| `https://www.google.com/search?q=asdf` | `google.com` |
+| `https://google.com/search?q=hjkl` | `google.com` |
+| `google.com/search?q=qwer` | `google.com` |
+| `google.com` | `google.com` |
+| `GOOGLE.COM` | `google.com` |
+| `gOogLE.com` | `google.com` |
+| `https://drive.google.com/drive/my-drive` | `drive.google.com` |
+| `https://help.github.com/en/` | `help.github.com` |
+| `https://github.com/pytorch/pytorch` | `github.com`
+| `https://www.amazon.com/gp/product/B01F8POA7U` | `amazon.com`
+| `https://rise.cs.berkeley.edu/blog/` | `rise.cs.berkeley.edu` |
+| `https://www.swansonvitamins.com/web-specials` | `swansonvitamins.com` |
+
+
+
+%package -n python3-urltitle
+Summary:	Get page title for URL
+Provides:	python-urltitle
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-urltitle
+# urltitle
+**urltitle** uses Python 3.7 to return the page title or header-based description for a given URL.
+Its intended primary use is the inclusion of the returned value in conversations.
+As a disclaimer, note that the returned title is not guaranteed to be accurate due to many possible factors.
+
+[![cicd badge](https://github.com/impredicative/urltitle/workflows/cicd/badge.svg?branch=master)](https://github.com/impredicative/urltitle/actions?query=workflow%3Acicd+branch%3Amaster)
+
+## Features
+* An in-memory cache is used with a default entry expiration of a week. The cache size and time are customizable.
+* Approximately only the fraction of a HTML page required to return a title is read, up to a customizable maximum of 1 MiB.
+* A fallback to the `og:title` and `twitter:title` if the `title` tag is unavailable.
+* A PDF title metadata extractor is used for PDF files of up to a customizable maximum size of 8 MiB.
+* Up to three attempts are made for resiliency except if there is an unrecoverable error, i.e. 400, 401, 404, etc.
+* A guess of `https` and otherwise `http` is made for a URL with a missing scheme, e.g. git-scm.com/downloads.
+* SSL verification for https sites can optionally be disabled.
+* A fallback to Google web cache is used if a HTML page presents a Distil captcha.
+It is also used for a PDF which is too large or doesn't have title metadata.
+* Diagnostic logging can be optionally enabled for the logger named `urltitle` at the desired level.
+* Some site-specific customizations are configurable:
+  - Regular expression based URL and title substitutions
+  - Use of Google web cache
+  - User-Agent
+  - Additional headers
+  - CSS title selector
+  - Use of `og:title` or `twitter:title` over `title` tag
+  - Initial read size
+
+## Links
+* Code: https://github.com/impredicative/urltitle/
+* Release: https://pypi.org/project/urltitle/
+
+## Usage
+### Installation
+Python ≥3.7 is required due to a reference 
+to [`SSLCertVerificationError`](https://docs.python.org/3/library/ssl.html#ssl.SSLCertVerificationError).
+
+To install the package, run:
+
+    pip install urltitle
+
+### Examples
+```python
+from urltitle import URLTitleReader
+
+reader = URLTitleReader(verify_ssl=True)
+
+# Titles for HTML content
+reader.title('https://www.cnn.com/2019/02/11/health/insect-decline-study-intl/index.html')
+"Insect numbers in precipitous decline could have 'catastrophic' consequences, warns study - CNN"
+
+reader.title('https://www.youtube.com/watch?v=53YvP6gdD7U')
+'Deep Learning State of the Art (2019) - MIT - YouTube'
+
+# Titles for URLs with a missing scheme
+reader.title('www.reuters.com/article/us-usa-military-army/army-calls-base-housing-hazards-unconscionable-details-steps-to-protect-families-idUSKCN1Q4275')
+"Army calls base housing hazards 'unconscionable,' details steps to protect families | Reuters"
+
+reader.title('reddit.com/r/FoodNerds/comments/arb6qj')
+'Paternal high-fat diet transgenerationally impacts hepatic immunometabolism. - PubMed - NCBI : FoodNerds'
+
+reader.title('neverssl.com')
+'NeverSSL - helping you get online'
+
+# Titles for non-ASCII URLs
+reader.title('https://en.wikipedia.org/wiki/Amanattō')
+'Amanattō - Wikipedia'
+
+reader.title('https://fr.wikipedia.org/wiki/Wikipédia:Accueil_principal')
+"Wikipédia, l'encyclopédie libre"
+
+# Titles for PDFs having title metadata
+reader.title('https://www.diabetes.org.br/publico/images/pdf/artificial-sweeteners-induce-glucose-intolerance-by-altering-the-gut-microbiota.pdf')
+'Artificial sweeteners induce glucose intolerance by altering the gut microbiota'
+
+reader.title('https://www.omicsonline.org/open-access/detection-of-glyphosate-in-malformed-piglets-2161-0525.1000230.pdf')
+'Detection of Glyphosate in Malformed Piglets'
+
+# Titles for other content showing Content-Type and Content-Length as available:
+reader.title('https://www.sciencedaily.com/images/2019/02/190213142720_1_540x360.jpg')
+'(image/jpeg) (54K)'
+
+reader.title('https://kdnuggets.com/rss')
+'(application/rss+xml; charset=UTF-8)'
+
+reader.title('https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso')
+'(application/octet-stream) (2G)'
+
+# Titles for substituted URLs as per configuration:
+reader.title('https://arxiv.org/pdf/1902.04704.pdf')
+'[1902.04704] Neural network models and deep learning - a primer for biologists'
+
+reader.title('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2495396/pdf/postmedj00315-0056.pdf')
+"Features of a successful therapeutic fast of 382 days' duration"
+
+reader.title('https://pdfs.semanticscholar.org/1d76/d4561b594b5c5b5250edb43122d85db07262.pdf')
+'Nutrition and health. The issue is not food, nor nutrients, so much as processing. - Semantic Scholar'
+```
+
+### Exceptions
+An error is expected to raise the `urltitle.URLTitleError` exception.
+
+### Customizations
+For any site-specific customizations, update (but ideally not replace) 
+`urltitle.config.NETLOC_OVERRIDES` with the relevant sites using the preexisting entries in it as examples. 
+Refer to [`overrides.py`](urltitle/config/overrides.py).
+The site of a URL is as defined and returned by the `URLTitleReader().netloc(url)` method in
+[`urltitle.py`](urltitle/urltitle.py).
+
+The following examples show various URLs and their corresponding sites for the purpose of entering site-specific
+customizations:
+
+| URL | Site |
+| --- | ---- |
+| `https://www.google.com/search?q=asdf` | `google.com` |
+| `https://google.com/search?q=hjkl` | `google.com` |
+| `google.com/search?q=qwer` | `google.com` |
+| `google.com` | `google.com` |
+| `GOOGLE.COM` | `google.com` |
+| `gOogLE.com` | `google.com` |
+| `https://drive.google.com/drive/my-drive` | `drive.google.com` |
+| `https://help.github.com/en/` | `help.github.com` |
+| `https://github.com/pytorch/pytorch` | `github.com`
+| `https://www.amazon.com/gp/product/B01F8POA7U` | `amazon.com`
+| `https://rise.cs.berkeley.edu/blog/` | `rise.cs.berkeley.edu` |
+| `https://www.swansonvitamins.com/web-specials` | `swansonvitamins.com` |
+
+
+
+%package help
+Summary:	Development documents and examples for urltitle
+Provides:	python3-urltitle-doc
+%description help
+# urltitle
+**urltitle** uses Python 3.7 to return the page title or header-based description for a given URL.
+Its intended primary use is the inclusion of the returned value in conversations.
+As a disclaimer, note that the returned title is not guaranteed to be accurate due to many possible factors.
+
+[![cicd badge](https://github.com/impredicative/urltitle/workflows/cicd/badge.svg?branch=master)](https://github.com/impredicative/urltitle/actions?query=workflow%3Acicd+branch%3Amaster)
+
+## Features
+* An in-memory cache is used with a default entry expiration of a week. The cache size and time are customizable.
+* Approximately only the fraction of a HTML page required to return a title is read, up to a customizable maximum of 1 MiB.
+* A fallback to the `og:title` and `twitter:title` if the `title` tag is unavailable.
+* A PDF title metadata extractor is used for PDF files of up to a customizable maximum size of 8 MiB.
+* Up to three attempts are made for resiliency except if there is an unrecoverable error, i.e. 400, 401, 404, etc.
+* A guess of `https` and otherwise `http` is made for a URL with a missing scheme, e.g. git-scm.com/downloads.
+* SSL verification for https sites can optionally be disabled.
+* A fallback to Google web cache is used if a HTML page presents a Distil captcha.
+It is also used for a PDF which is too large or doesn't have title metadata.
+* Diagnostic logging can be optionally enabled for the logger named `urltitle` at the desired level.
+* Some site-specific customizations are configurable:
+  - Regular expression based URL and title substitutions
+  - Use of Google web cache
+  - User-Agent
+  - Additional headers
+  - CSS title selector
+  - Use of `og:title` or `twitter:title` over `title` tag
+  - Initial read size
+
+## Links
+* Code: https://github.com/impredicative/urltitle/
+* Release: https://pypi.org/project/urltitle/
+
+## Usage
+### Installation
+Python ≥3.7 is required due to a reference 
+to [`SSLCertVerificationError`](https://docs.python.org/3/library/ssl.html#ssl.SSLCertVerificationError).
+
+To install the package, run:
+
+    pip install urltitle
+
+### Examples
+```python
+from urltitle import URLTitleReader
+
+reader = URLTitleReader(verify_ssl=True)
+
+# Titles for HTML content
+reader.title('https://www.cnn.com/2019/02/11/health/insect-decline-study-intl/index.html')
+"Insect numbers in precipitous decline could have 'catastrophic' consequences, warns study - CNN"
+
+reader.title('https://www.youtube.com/watch?v=53YvP6gdD7U')
+'Deep Learning State of the Art (2019) - MIT - YouTube'
+
+# Titles for URLs with a missing scheme
+reader.title('www.reuters.com/article/us-usa-military-army/army-calls-base-housing-hazards-unconscionable-details-steps-to-protect-families-idUSKCN1Q4275')
+"Army calls base housing hazards 'unconscionable,' details steps to protect families | Reuters"
+
+reader.title('reddit.com/r/FoodNerds/comments/arb6qj')
+'Paternal high-fat diet transgenerationally impacts hepatic immunometabolism. - PubMed - NCBI : FoodNerds'
+
+reader.title('neverssl.com')
+'NeverSSL - helping you get online'
+
+# Titles for non-ASCII URLs
+reader.title('https://en.wikipedia.org/wiki/Amanattō')
+'Amanattō - Wikipedia'
+
+reader.title('https://fr.wikipedia.org/wiki/Wikipédia:Accueil_principal')
+"Wikipédia, l'encyclopédie libre"
+
+# Titles for PDFs having title metadata
+reader.title('https://www.diabetes.org.br/publico/images/pdf/artificial-sweeteners-induce-glucose-intolerance-by-altering-the-gut-microbiota.pdf')
+'Artificial sweeteners induce glucose intolerance by altering the gut microbiota'
+
+reader.title('https://www.omicsonline.org/open-access/detection-of-glyphosate-in-malformed-piglets-2161-0525.1000230.pdf')
+'Detection of Glyphosate in Malformed Piglets'
+
+# Titles for other content showing Content-Type and Content-Length as available:
+reader.title('https://www.sciencedaily.com/images/2019/02/190213142720_1_540x360.jpg')
+'(image/jpeg) (54K)'
+
+reader.title('https://kdnuggets.com/rss')
+'(application/rss+xml; charset=UTF-8)'
+
+reader.title('https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso')
+'(application/octet-stream) (2G)'
+
+# Titles for substituted URLs as per configuration:
+reader.title('https://arxiv.org/pdf/1902.04704.pdf')
+'[1902.04704] Neural network models and deep learning - a primer for biologists'
+
+reader.title('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2495396/pdf/postmedj00315-0056.pdf')
+"Features of a successful therapeutic fast of 382 days' duration"
+
+reader.title('https://pdfs.semanticscholar.org/1d76/d4561b594b5c5b5250edb43122d85db07262.pdf')
+'Nutrition and health. The issue is not food, nor nutrients, so much as processing. - Semantic Scholar'
+```
+
+### Exceptions
+An error is expected to raise the `urltitle.URLTitleError` exception.
+
+### Customizations
+For any site-specific customizations, update (but ideally not replace) 
+`urltitle.config.NETLOC_OVERRIDES` with the relevant sites using the preexisting entries in it as examples. 
+Refer to [`overrides.py`](urltitle/config/overrides.py).
+The site of a URL is as defined and returned by the `URLTitleReader().netloc(url)` method in
+[`urltitle.py`](urltitle/urltitle.py).
+
+The following examples show various URLs and their corresponding sites for the purpose of entering site-specific
+customizations:
+
+| URL | Site |
+| --- | ---- |
+| `https://www.google.com/search?q=asdf` | `google.com` |
+| `https://google.com/search?q=hjkl` | `google.com` |
+| `google.com/search?q=qwer` | `google.com` |
+| `google.com` | `google.com` |
+| `GOOGLE.COM` | `google.com` |
+| `gOogLE.com` | `google.com` |
+| `https://drive.google.com/drive/my-drive` | `drive.google.com` |
+| `https://help.github.com/en/` | `help.github.com` |
+| `https://github.com/pytorch/pytorch` | `github.com`
+| `https://www.amazon.com/gp/product/B01F8POA7U` | `amazon.com`
+| `https://rise.cs.berkeley.edu/blog/` | `rise.cs.berkeley.edu` |
+| `https://www.swansonvitamins.com/web-specials` | `swansonvitamins.com` |
+
+
+
+%prep
+%autosetup -n urltitle-0.3.8
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-urltitle -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 17 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.8-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..d52f35d
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+3d34eda5fe1a297da6703eafbdd13698  urltitle-0.3.8.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-05-17 03:17:58 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-17 03:17:58 +0000
commit	aeea78ebe511cf1e37abc435d3ec6ba3078f046c (patch)
tree	b09180553cd6ed68e62ea280df90083bec5424e1
parent	83861aeb65318b9386f5f703f9fe706518b58de0 (diff)