diff options
Diffstat (limited to 'python-gerapy-pyppeteer.spec')
-rw-r--r-- | python-gerapy-pyppeteer.spec | 1361 |
1 files changed, 1361 insertions, 0 deletions
diff --git a/python-gerapy-pyppeteer.spec b/python-gerapy-pyppeteer.spec new file mode 100644 index 0000000..44c508d --- /dev/null +++ b/python-gerapy-pyppeteer.spec @@ -0,0 +1,1361 @@ +%global _empty_manifest_terminate_build 0 +Name: python-gerapy-pyppeteer +Version: 0.2.4 +Release: 1 +Summary: Pyppeteer Components for Scrapy & Gerapy +License: MIT +URL: https://github.com/Gerapy/GerapyPyppeteer +Source0: https://mirrors.aliyun.com/pypi/web/packages/09/c5/54df6684821697ec1bb5bef633852d7475b2c8a923991c19301e047fd40b/gerapy-pyppeteer-0.2.4.tar.gz +BuildArch: noarch + +Requires: python3-scrapy +Requires: python3-pyppeteer + +%description + +# Gerapy Pyppeteer + +This is a package for supporting pyppeteer in Scrapy, also this +package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). + +## Installation + +```shell script +pip3 install gerapy-pyppeteer +``` + +## Usage + +You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. + +For example: + +```python +yield PyppeteerRequest(detail_url, callback=self.parse_detail) +``` + +And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: + +```python +DOWNLOADER_MIDDLEWARES = { + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, +} +``` + +Congratulate, you've finished the all of the required configuration. + +If you run the Spider again, Pyppeteer will be started to render every +web page which you configured the request as PyppeteerRequest. + +## Settings + +GerapyPyppeteer provides some optional settings. + +### Concurrency + +You can directly use Scrapy's setting to set Concurrency of Pyppeteer, +for example: + +```python +CONCURRENT_REQUESTS = 3 +``` + +### Pretend as Real Browser + +Some website will detect WebDriver or Headless, GerapyPyppeteer can +pretend Chromium by inject scripts. This is enabled by default. + +You can close it if website does not detect WebDriver to speed up: + +```python +GERAPY_PYPPETEER_PRETEND = False +``` + +Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this +configuration. + +### Logging Level + +By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer +configured the logging level of Pyppeteer to WARNING. + +If you want to see more logs from Pyppeteer, you can change the this setting: + +```python +import logging +GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG +``` + +### Download Timeout + +Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: + +```python +# pyppeteer timeout +GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 +``` + +### Headless + +By default, Pyppeteer is running in `Headless` mode, you can also +change it to `False` as you need, default is `True`: + +```python +GERAPY_PYPPETEER_HEADLESS = False +``` + +### Window Size + +You can also set the width and height of Pyppeteer window: + +```python +GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 +GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 +``` + +Default is 1400, 700. + +### Proxy + +You can set a proxy channel via below this config: + +```python +GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' +GERAPY_PYPPETEER_PROXY_CREDENTIAL = { + 'username': 'xxx', + 'password': 'xxxx' +} +``` + +### Pyppeteer Args + +You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. + +Optional settings and their default values: + +```python +GERAPY_PYPPETEER_DUMPIO = False +GERAPY_PYPPETEER_DEVTOOLS = False +GERAPY_PYPPETEER_EXECUTABLE_PATH = None +GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True +GERAPY_PYPPETEER_HIDE_SCROLLBARS = True +GERAPY_PYPPETEER_MUTE_AUDIO = True +GERAPY_PYPPETEER_NO_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_GPU = True +``` + +### Disable loading of specific resource type + +You can disable the loading of specific resource type to +decrease the loading time of web page. You can configure +the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] +``` + +For example, if you want to disable the loading of css and javascript, +you can set as below: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] +``` + +All of the optional resource type list: + +- document: the Original HTML document +- stylesheet: CSS files +- script: JavaScript files +- image: Images +- media: Media files such as audios or videos +- font: Fonts files +- texttrack: Text Track files +- xhr: Ajax Requests +- fetch: Fetch Requests +- eventsource: Event Source +- websocket: Websocket +- manifest: Manifest files +- other: Other files + +### Screenshot + +You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: + +- `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. +- `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. +- `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. +- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: + - `x` (int): x-coordinate of top-left corner of clip area. + - `y` (int): y-coordinate of top-left corner of clip area. + - `width` (int): width of clipping area. + - `height` (int): height of clipping area. +- `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. +- `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. + +For example: + +```python +yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ + 'type': 'png', + 'fullPage': True + }) +``` + +then you can get screenshot result in `response.meta['screenshot']`: + +Simplest save it to file: + +```python +def parse_index(self, response): + with open('screenshot.png', 'wb') as f: + f.write(response.meta['screenshot'].getbuffer()) +``` + +If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. + +For example: + +```python +GERAPY_PYPPETEER_SCREENSHOT = { + 'type': 'png', + 'fullPage': True +} +``` + +## PyppeteerRequest + +`PyppeteerRequest` provide args which can override global settings above. + +- url: request url +- callback: callback +- one of "load", "domcontentloaded", "networkidle0", "networkidle2". + see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` +- wait_for: wait for some element to load, also supports dict +- script: script to execute +- actions: actions defined for execution of Page object +- proxy: use proxy for this time, like `http://x.x.x.x:x` +- proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` +- sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` +- timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` +- ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` +- pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` +- screenshot: ignored resource types, see + https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, + override `GERAPY_PYPPETEER_SCREENSHOT` + +For example, you can configure PyppeteerRequest as: + +```python +from gerapy_pyppeteer import PyppeteerRequest + +def parse(self, response): + yield PyppeteerRequest(url, + callback=self.parse_detail, + wait_until='domcontentloaded', + wait_for='title', + script='() => { return {name: "Germey"} }', + sleep=2) +``` + +Then Pyppeteer will: + +- wait for document to load +- wait for title to load +- execute `console.log(document)` script +- sleep for 2s +- return the rendered web page content, get from `response.meta['screenshot']` +- return the script executed result, get from `response.meta['script_result']` + +For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: + +```python +js = '''async () => { + await new Promise(resolve => setTimeout(resolve, 10000)); + return { + 'name': 'Germey' + } +} +''' +yield PyppeteerRequest(url, callback=self.parse, script=js) +``` + +Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. + +If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: + +```python +async def execute_actions(page: Page): + await page.evaluate('() => { document.title = "Hello World"; }') + return 1 +yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) +``` + +Then you can get the actions result from `response.meta['actions_result']`, result is `1`. + +Also you can define proxy and proxy_credential for each Reqest, for example: + +```python +yield PyppeteerRequest( + self.base_url, + callback=self.parse_index, + priority=10, + proxy='http://tps254.kdlapi.com:15818', + proxy_credential={ + 'username': 'xxxx', + 'password': 'xxxx' +}) +``` + +`proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. + +## Example + +For more detail, please see [example](./example). + +Also you can directly run with Docker: + +``` +docker run germey/gerapy-pyppeteer-example +``` + +Outputs: + +```shell script +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit +2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor +2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: +{'BOT_NAME': 'example', + 'CONCURRENT_REQUESTS': 3, + 'NEWSPIDER_MODULE': 'example.spiders', + 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], + 'SPIDER_MODULES': ['example.spiders']} +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: +['scrapy.extensions.corestats.CoreStats', + 'scrapy.extensions.telnet.TelnetConsole', + 'scrapy.extensions.memusage.MemoryUsage', + 'scrapy.extensions.logstats.LogStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: +['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', + 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', + 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', + 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', + 'scrapy.downloadermiddlewares.retry.RetryMiddleware', + 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', + 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', + 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', + 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', + 'scrapy.downloadermiddlewares.stats.DownloaderStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: +['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', + 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', + 'scrapy.spidermiddlewares.referer.RefererMiddleware', + 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', + 'scrapy.spidermiddlewares.depth.DepthMiddleware'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: +[] +2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened +2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 +2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1> +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None) +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2> +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> +{'name': '壁穴ヘブンホール', + 'score': '5.6', + 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626> +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> +{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer +... +``` + +## Trouble Shooting + +### Pyppeteer does not start properly + +Chromium download speed is too slow, it can not be used normally. + +Here are two solutions: + +#### Solution 1 (Recommended) + +Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: + +```python +# Default: +DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' +# modify +DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror +``` + +#### Solution 2 + +Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: + +```python +# Default: +chromiumExecutable = { + 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', + 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / + 'Contents' / 'MacOS' / 'Chromium'), + 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', + 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', +} +``` + +You can find your own operating system, modify your chrome or chrome executable path. + + + + +%package -n python3-gerapy-pyppeteer +Summary: Pyppeteer Components for Scrapy & Gerapy +Provides: python-gerapy-pyppeteer +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-gerapy-pyppeteer + +# Gerapy Pyppeteer + +This is a package for supporting pyppeteer in Scrapy, also this +package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). + +## Installation + +```shell script +pip3 install gerapy-pyppeteer +``` + +## Usage + +You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. + +For example: + +```python +yield PyppeteerRequest(detail_url, callback=self.parse_detail) +``` + +And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: + +```python +DOWNLOADER_MIDDLEWARES = { + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, +} +``` + +Congratulate, you've finished the all of the required configuration. + +If you run the Spider again, Pyppeteer will be started to render every +web page which you configured the request as PyppeteerRequest. + +## Settings + +GerapyPyppeteer provides some optional settings. + +### Concurrency + +You can directly use Scrapy's setting to set Concurrency of Pyppeteer, +for example: + +```python +CONCURRENT_REQUESTS = 3 +``` + +### Pretend as Real Browser + +Some website will detect WebDriver or Headless, GerapyPyppeteer can +pretend Chromium by inject scripts. This is enabled by default. + +You can close it if website does not detect WebDriver to speed up: + +```python +GERAPY_PYPPETEER_PRETEND = False +``` + +Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this +configuration. + +### Logging Level + +By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer +configured the logging level of Pyppeteer to WARNING. + +If you want to see more logs from Pyppeteer, you can change the this setting: + +```python +import logging +GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG +``` + +### Download Timeout + +Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: + +```python +# pyppeteer timeout +GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 +``` + +### Headless + +By default, Pyppeteer is running in `Headless` mode, you can also +change it to `False` as you need, default is `True`: + +```python +GERAPY_PYPPETEER_HEADLESS = False +``` + +### Window Size + +You can also set the width and height of Pyppeteer window: + +```python +GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 +GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 +``` + +Default is 1400, 700. + +### Proxy + +You can set a proxy channel via below this config: + +```python +GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' +GERAPY_PYPPETEER_PROXY_CREDENTIAL = { + 'username': 'xxx', + 'password': 'xxxx' +} +``` + +### Pyppeteer Args + +You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. + +Optional settings and their default values: + +```python +GERAPY_PYPPETEER_DUMPIO = False +GERAPY_PYPPETEER_DEVTOOLS = False +GERAPY_PYPPETEER_EXECUTABLE_PATH = None +GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True +GERAPY_PYPPETEER_HIDE_SCROLLBARS = True +GERAPY_PYPPETEER_MUTE_AUDIO = True +GERAPY_PYPPETEER_NO_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_GPU = True +``` + +### Disable loading of specific resource type + +You can disable the loading of specific resource type to +decrease the loading time of web page. You can configure +the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] +``` + +For example, if you want to disable the loading of css and javascript, +you can set as below: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] +``` + +All of the optional resource type list: + +- document: the Original HTML document +- stylesheet: CSS files +- script: JavaScript files +- image: Images +- media: Media files such as audios or videos +- font: Fonts files +- texttrack: Text Track files +- xhr: Ajax Requests +- fetch: Fetch Requests +- eventsource: Event Source +- websocket: Websocket +- manifest: Manifest files +- other: Other files + +### Screenshot + +You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: + +- `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. +- `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. +- `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. +- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: + - `x` (int): x-coordinate of top-left corner of clip area. + - `y` (int): y-coordinate of top-left corner of clip area. + - `width` (int): width of clipping area. + - `height` (int): height of clipping area. +- `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. +- `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. + +For example: + +```python +yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ + 'type': 'png', + 'fullPage': True + }) +``` + +then you can get screenshot result in `response.meta['screenshot']`: + +Simplest save it to file: + +```python +def parse_index(self, response): + with open('screenshot.png', 'wb') as f: + f.write(response.meta['screenshot'].getbuffer()) +``` + +If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. + +For example: + +```python +GERAPY_PYPPETEER_SCREENSHOT = { + 'type': 'png', + 'fullPage': True +} +``` + +## PyppeteerRequest + +`PyppeteerRequest` provide args which can override global settings above. + +- url: request url +- callback: callback +- one of "load", "domcontentloaded", "networkidle0", "networkidle2". + see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` +- wait_for: wait for some element to load, also supports dict +- script: script to execute +- actions: actions defined for execution of Page object +- proxy: use proxy for this time, like `http://x.x.x.x:x` +- proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` +- sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` +- timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` +- ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` +- pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` +- screenshot: ignored resource types, see + https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, + override `GERAPY_PYPPETEER_SCREENSHOT` + +For example, you can configure PyppeteerRequest as: + +```python +from gerapy_pyppeteer import PyppeteerRequest + +def parse(self, response): + yield PyppeteerRequest(url, + callback=self.parse_detail, + wait_until='domcontentloaded', + wait_for='title', + script='() => { return {name: "Germey"} }', + sleep=2) +``` + +Then Pyppeteer will: + +- wait for document to load +- wait for title to load +- execute `console.log(document)` script +- sleep for 2s +- return the rendered web page content, get from `response.meta['screenshot']` +- return the script executed result, get from `response.meta['script_result']` + +For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: + +```python +js = '''async () => { + await new Promise(resolve => setTimeout(resolve, 10000)); + return { + 'name': 'Germey' + } +} +''' +yield PyppeteerRequest(url, callback=self.parse, script=js) +``` + +Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. + +If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: + +```python +async def execute_actions(page: Page): + await page.evaluate('() => { document.title = "Hello World"; }') + return 1 +yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) +``` + +Then you can get the actions result from `response.meta['actions_result']`, result is `1`. + +Also you can define proxy and proxy_credential for each Reqest, for example: + +```python +yield PyppeteerRequest( + self.base_url, + callback=self.parse_index, + priority=10, + proxy='http://tps254.kdlapi.com:15818', + proxy_credential={ + 'username': 'xxxx', + 'password': 'xxxx' +}) +``` + +`proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. + +## Example + +For more detail, please see [example](./example). + +Also you can directly run with Docker: + +``` +docker run germey/gerapy-pyppeteer-example +``` + +Outputs: + +```shell script +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit +2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor +2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: +{'BOT_NAME': 'example', + 'CONCURRENT_REQUESTS': 3, + 'NEWSPIDER_MODULE': 'example.spiders', + 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], + 'SPIDER_MODULES': ['example.spiders']} +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: +['scrapy.extensions.corestats.CoreStats', + 'scrapy.extensions.telnet.TelnetConsole', + 'scrapy.extensions.memusage.MemoryUsage', + 'scrapy.extensions.logstats.LogStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: +['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', + 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', + 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', + 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', + 'scrapy.downloadermiddlewares.retry.RetryMiddleware', + 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', + 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', + 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', + 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', + 'scrapy.downloadermiddlewares.stats.DownloaderStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: +['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', + 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', + 'scrapy.spidermiddlewares.referer.RefererMiddleware', + 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', + 'scrapy.spidermiddlewares.depth.DepthMiddleware'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: +[] +2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened +2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 +2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1> +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None) +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2> +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> +{'name': '壁穴ヘブンホール', + 'score': '5.6', + 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626> +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> +{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer +... +``` + +## Trouble Shooting + +### Pyppeteer does not start properly + +Chromium download speed is too slow, it can not be used normally. + +Here are two solutions: + +#### Solution 1 (Recommended) + +Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: + +```python +# Default: +DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' +# modify +DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror +``` + +#### Solution 2 + +Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: + +```python +# Default: +chromiumExecutable = { + 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', + 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / + 'Contents' / 'MacOS' / 'Chromium'), + 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', + 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', +} +``` + +You can find your own operating system, modify your chrome or chrome executable path. + + + + +%package help +Summary: Development documents and examples for gerapy-pyppeteer +Provides: python3-gerapy-pyppeteer-doc +%description help + +# Gerapy Pyppeteer + +This is a package for supporting pyppeteer in Scrapy, also this +package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). + +## Installation + +```shell script +pip3 install gerapy-pyppeteer +``` + +## Usage + +You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. + +For example: + +```python +yield PyppeteerRequest(detail_url, callback=self.parse_detail) +``` + +And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: + +```python +DOWNLOADER_MIDDLEWARES = { + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, +} +``` + +Congratulate, you've finished the all of the required configuration. + +If you run the Spider again, Pyppeteer will be started to render every +web page which you configured the request as PyppeteerRequest. + +## Settings + +GerapyPyppeteer provides some optional settings. + +### Concurrency + +You can directly use Scrapy's setting to set Concurrency of Pyppeteer, +for example: + +```python +CONCURRENT_REQUESTS = 3 +``` + +### Pretend as Real Browser + +Some website will detect WebDriver or Headless, GerapyPyppeteer can +pretend Chromium by inject scripts. This is enabled by default. + +You can close it if website does not detect WebDriver to speed up: + +```python +GERAPY_PYPPETEER_PRETEND = False +``` + +Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this +configuration. + +### Logging Level + +By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer +configured the logging level of Pyppeteer to WARNING. + +If you want to see more logs from Pyppeteer, you can change the this setting: + +```python +import logging +GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG +``` + +### Download Timeout + +Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: + +```python +# pyppeteer timeout +GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 +``` + +### Headless + +By default, Pyppeteer is running in `Headless` mode, you can also +change it to `False` as you need, default is `True`: + +```python +GERAPY_PYPPETEER_HEADLESS = False +``` + +### Window Size + +You can also set the width and height of Pyppeteer window: + +```python +GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 +GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 +``` + +Default is 1400, 700. + +### Proxy + +You can set a proxy channel via below this config: + +```python +GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' +GERAPY_PYPPETEER_PROXY_CREDENTIAL = { + 'username': 'xxx', + 'password': 'xxxx' +} +``` + +### Pyppeteer Args + +You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. + +Optional settings and their default values: + +```python +GERAPY_PYPPETEER_DUMPIO = False +GERAPY_PYPPETEER_DEVTOOLS = False +GERAPY_PYPPETEER_EXECUTABLE_PATH = None +GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True +GERAPY_PYPPETEER_HIDE_SCROLLBARS = True +GERAPY_PYPPETEER_MUTE_AUDIO = True +GERAPY_PYPPETEER_NO_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True +GERAPY_PYPPETEER_DISABLE_GPU = True +``` + +### Disable loading of specific resource type + +You can disable the loading of specific resource type to +decrease the loading time of web page. You can configure +the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] +``` + +For example, if you want to disable the loading of css and javascript, +you can set as below: + +```python +GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] +``` + +All of the optional resource type list: + +- document: the Original HTML document +- stylesheet: CSS files +- script: JavaScript files +- image: Images +- media: Media files such as audios or videos +- font: Fonts files +- texttrack: Text Track files +- xhr: Ajax Requests +- fetch: Fetch Requests +- eventsource: Event Source +- websocket: Websocket +- manifest: Manifest files +- other: Other files + +### Screenshot + +You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: + +- `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. +- `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. +- `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. +- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: + - `x` (int): x-coordinate of top-left corner of clip area. + - `y` (int): y-coordinate of top-left corner of clip area. + - `width` (int): width of clipping area. + - `height` (int): height of clipping area. +- `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. +- `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. + +For example: + +```python +yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ + 'type': 'png', + 'fullPage': True + }) +``` + +then you can get screenshot result in `response.meta['screenshot']`: + +Simplest save it to file: + +```python +def parse_index(self, response): + with open('screenshot.png', 'wb') as f: + f.write(response.meta['screenshot'].getbuffer()) +``` + +If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. + +For example: + +```python +GERAPY_PYPPETEER_SCREENSHOT = { + 'type': 'png', + 'fullPage': True +} +``` + +## PyppeteerRequest + +`PyppeteerRequest` provide args which can override global settings above. + +- url: request url +- callback: callback +- one of "load", "domcontentloaded", "networkidle0", "networkidle2". + see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` +- wait_for: wait for some element to load, also supports dict +- script: script to execute +- actions: actions defined for execution of Page object +- proxy: use proxy for this time, like `http://x.x.x.x:x` +- proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` +- sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` +- timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` +- ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` +- pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` +- screenshot: ignored resource types, see + https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, + override `GERAPY_PYPPETEER_SCREENSHOT` + +For example, you can configure PyppeteerRequest as: + +```python +from gerapy_pyppeteer import PyppeteerRequest + +def parse(self, response): + yield PyppeteerRequest(url, + callback=self.parse_detail, + wait_until='domcontentloaded', + wait_for='title', + script='() => { return {name: "Germey"} }', + sleep=2) +``` + +Then Pyppeteer will: + +- wait for document to load +- wait for title to load +- execute `console.log(document)` script +- sleep for 2s +- return the rendered web page content, get from `response.meta['screenshot']` +- return the script executed result, get from `response.meta['script_result']` + +For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: + +```python +js = '''async () => { + await new Promise(resolve => setTimeout(resolve, 10000)); + return { + 'name': 'Germey' + } +} +''' +yield PyppeteerRequest(url, callback=self.parse, script=js) +``` + +Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. + +If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: + +```python +async def execute_actions(page: Page): + await page.evaluate('() => { document.title = "Hello World"; }') + return 1 +yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) +``` + +Then you can get the actions result from `response.meta['actions_result']`, result is `1`. + +Also you can define proxy and proxy_credential for each Reqest, for example: + +```python +yield PyppeteerRequest( + self.base_url, + callback=self.parse_index, + priority=10, + proxy='http://tps254.kdlapi.com:15818', + proxy_credential={ + 'username': 'xxxx', + 'password': 'xxxx' +}) +``` + +`proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. + +## Example + +For more detail, please see [example](./example). + +Also you can directly run with Docker: + +``` +docker run germey/gerapy-pyppeteer-example +``` + +Outputs: + +```shell script +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) +2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit +2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor +2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: +{'BOT_NAME': 'example', + 'CONCURRENT_REQUESTS': 3, + 'NEWSPIDER_MODULE': 'example.spiders', + 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], + 'SPIDER_MODULES': ['example.spiders']} +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: +['scrapy.extensions.corestats.CoreStats', + 'scrapy.extensions.telnet.TelnetConsole', + 'scrapy.extensions.memusage.MemoryUsage', + 'scrapy.extensions.logstats.LogStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: +['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', + 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', + 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', + 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', + 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', + 'scrapy.downloadermiddlewares.retry.RetryMiddleware', + 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', + 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', + 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', + 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', + 'scrapy.downloadermiddlewares.stats.DownloaderStats'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: +['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', + 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', + 'scrapy.spidermiddlewares.referer.RefererMiddleware', + 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', + 'scrapy.spidermiddlewares.depth.DepthMiddleware'] +2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: +[] +2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened +2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) +2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 +2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1> +2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 +2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None) +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315> +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 +2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2> +2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> +{'name': '壁穴ヘブンホール', + 'score': '5.6', + 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer +2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1) +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626> +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} +2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> +{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished +2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished +2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer +... +``` + +## Trouble Shooting + +### Pyppeteer does not start properly + +Chromium download speed is too slow, it can not be used normally. + +Here are two solutions: + +#### Solution 1 (Recommended) + +Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: + +```python +# Default: +DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' +# modify +DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror +``` + +#### Solution 2 + +Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: + +```python +# Default: +chromiumExecutable = { + 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', + 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / + 'Contents' / 'MacOS' / 'Chromium'), + 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', + 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', +} +``` + +You can find your own operating system, modify your chrome or chrome executable path. + + + + +%prep +%autosetup -n gerapy-pyppeteer-0.2.4 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-gerapy-pyppeteer -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2.4-1 +- Package Spec generated |