%global _empty_manifest_terminate_build 0 Name: python-scrapy-requests Version: 0.2.0 Release: 1 Summary: Scrapy with requests-html License: MIT License URL: https://github.com/rafyzg/scrapy-requests Source0: https://mirrors.nju.edu.cn/pypi/web/packages/33/10/76fc04b22ad261867080471d9d18ff45ff6acd41e051f71664b7deda68a1/scrapy-requests-0.2.0.tar.gz BuildArch: noarch Requires: python3-scrapy Requires: python3-requests-html %description # scrapy-requests ![PyPI](https://img.shields.io/pypi/v/scrapy-requests) [![Build Status](https://travis-ci.org/rafyzg/scrapy-requests.svg?branch=main)](https://travis-ci.org/rafyzg/scrapy-requests) ![Codecov](https://img.shields.io/codecov/c/github/rafyzg/scrapy-requests) Scrapy middleware to asynchronously handle javascript pages using requests-html. requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. [Check out their documentation.](https://github.com/psf/requests-html "requests_html repo") ## Requirements - Python >= 3.6 - Scrapy >= 2.0 - requests-html ## Installation ``` pip install scrapy-requests ``` ## Configuration Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware #### settings.py ```python TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' DOWNLOADER_MIDDLEWARES = { 'scrapy_requests.RequestsMiddleware': 800 } ``` ## Usage Use scrapy_requests.HtmlRequest instead of scrapy.Request ```python from scrapy_requests import HtmlRequest yield HtmlRequest(url=url, callback=self.parse) ``` The requests will be handled by requests_html, and the request will add an additional meta varialble `page` containing the HTML object. ```python def parse(self, response): page = response.request.meta['page'] page.html.render() ``` ## Additional settings If you would like the page to be rendered by pyppeteer - pass `True` to the `render` key paramater. ```python yield HtmlRequest(url=url, callback=self.parse, render=True) ``` You could choose a more speific functionality for the HTML object. For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way: ```python script = "document.body.querySelector('.btn').click();" yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script}) ``` You could pass default settings to requests-html session - specifying header, proxies, auth settings etc... You do this by specifying an addtional variable in `settings.py` ```python DEFAULT_SCRAPY_REQUESTS_SETTINGS = { 'verify': False, # Verifying SSL certificates 'mock_browser': True, # Mock browser user-agent 'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'], } ``` ## Notes Please star this repo if you found it useful. Feel free to contribute and propose issues & additional features. License is MIT. %package -n python3-scrapy-requests Summary: Scrapy with requests-html Provides: python-scrapy-requests BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-scrapy-requests # scrapy-requests ![PyPI](https://img.shields.io/pypi/v/scrapy-requests) [![Build Status](https://travis-ci.org/rafyzg/scrapy-requests.svg?branch=main)](https://travis-ci.org/rafyzg/scrapy-requests) ![Codecov](https://img.shields.io/codecov/c/github/rafyzg/scrapy-requests) Scrapy middleware to asynchronously handle javascript pages using requests-html. requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. [Check out their documentation.](https://github.com/psf/requests-html "requests_html repo") ## Requirements - Python >= 3.6 - Scrapy >= 2.0 - requests-html ## Installation ``` pip install scrapy-requests ``` ## Configuration Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware #### settings.py ```python TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' DOWNLOADER_MIDDLEWARES = { 'scrapy_requests.RequestsMiddleware': 800 } ``` ## Usage Use scrapy_requests.HtmlRequest instead of scrapy.Request ```python from scrapy_requests import HtmlRequest yield HtmlRequest(url=url, callback=self.parse) ``` The requests will be handled by requests_html, and the request will add an additional meta varialble `page` containing the HTML object. ```python def parse(self, response): page = response.request.meta['page'] page.html.render() ``` ## Additional settings If you would like the page to be rendered by pyppeteer - pass `True` to the `render` key paramater. ```python yield HtmlRequest(url=url, callback=self.parse, render=True) ``` You could choose a more speific functionality for the HTML object. For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way: ```python script = "document.body.querySelector('.btn').click();" yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script}) ``` You could pass default settings to requests-html session - specifying header, proxies, auth settings etc... You do this by specifying an addtional variable in `settings.py` ```python DEFAULT_SCRAPY_REQUESTS_SETTINGS = { 'verify': False, # Verifying SSL certificates 'mock_browser': True, # Mock browser user-agent 'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'], } ``` ## Notes Please star this repo if you found it useful. Feel free to contribute and propose issues & additional features. License is MIT. %package help Summary: Development documents and examples for scrapy-requests Provides: python3-scrapy-requests-doc %description help # scrapy-requests ![PyPI](https://img.shields.io/pypi/v/scrapy-requests) [![Build Status](https://travis-ci.org/rafyzg/scrapy-requests.svg?branch=main)](https://travis-ci.org/rafyzg/scrapy-requests) ![Codecov](https://img.shields.io/codecov/c/github/rafyzg/scrapy-requests) Scrapy middleware to asynchronously handle javascript pages using requests-html. requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. [Check out their documentation.](https://github.com/psf/requests-html "requests_html repo") ## Requirements - Python >= 3.6 - Scrapy >= 2.0 - requests-html ## Installation ``` pip install scrapy-requests ``` ## Configuration Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware #### settings.py ```python TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' DOWNLOADER_MIDDLEWARES = { 'scrapy_requests.RequestsMiddleware': 800 } ``` ## Usage Use scrapy_requests.HtmlRequest instead of scrapy.Request ```python from scrapy_requests import HtmlRequest yield HtmlRequest(url=url, callback=self.parse) ``` The requests will be handled by requests_html, and the request will add an additional meta varialble `page` containing the HTML object. ```python def parse(self, response): page = response.request.meta['page'] page.html.render() ``` ## Additional settings If you would like the page to be rendered by pyppeteer - pass `True` to the `render` key paramater. ```python yield HtmlRequest(url=url, callback=self.parse, render=True) ``` You could choose a more speific functionality for the HTML object. For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way: ```python script = "document.body.querySelector('.btn').click();" yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script}) ``` You could pass default settings to requests-html session - specifying header, proxies, auth settings etc... You do this by specifying an addtional variable in `settings.py` ```python DEFAULT_SCRAPY_REQUESTS_SETTINGS = { 'verify': False, # Verifying SSL certificates 'mock_browser': True, # Mock browser user-agent 'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'], } ``` ## Notes Please star this repo if you found it useful. Feel free to contribute and propose issues & additional features. License is MIT. %prep %autosetup -n scrapy-requests-0.2.0 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-scrapy-requests -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Wed May 31 2023 Python_Bot - 0.2.0-1 - Package Spec generated