%global _empty_manifest_terminate_build 0 Name: python-gerapy-pyppeteer Version: 0.2.4 Release: 1 Summary: Pyppeteer Components for Scrapy & Gerapy License: MIT URL: https://github.com/Gerapy/GerapyPyppeteer Source0: https://mirrors.aliyun.com/pypi/web/packages/09/c5/54df6684821697ec1bb5bef633852d7475b2c8a923991c19301e047fd40b/gerapy-pyppeteer-0.2.4.tar.gz BuildArch: noarch Requires: python3-scrapy Requires: python3-pyppeteer %description # Gerapy Pyppeteer This is a package for supporting pyppeteer in Scrapy, also this package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). ## Installation ```shell script pip3 install gerapy-pyppeteer ``` ## Usage You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. For example: ```python yield PyppeteerRequest(detail_url, callback=self.parse_detail) ``` And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: ```python DOWNLOADER_MIDDLEWARES = { 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, } ``` Congratulate, you've finished the all of the required configuration. If you run the Spider again, Pyppeteer will be started to render every web page which you configured the request as PyppeteerRequest. ## Settings GerapyPyppeteer provides some optional settings. ### Concurrency You can directly use Scrapy's setting to set Concurrency of Pyppeteer, for example: ```python CONCURRENT_REQUESTS = 3 ``` ### Pretend as Real Browser Some website will detect WebDriver or Headless, GerapyPyppeteer can pretend Chromium by inject scripts. This is enabled by default. You can close it if website does not detect WebDriver to speed up: ```python GERAPY_PYPPETEER_PRETEND = False ``` Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this configuration. ### Logging Level By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer configured the logging level of Pyppeteer to WARNING. If you want to see more logs from Pyppeteer, you can change the this setting: ```python import logging GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG ``` ### Download Timeout Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: ```python # pyppeteer timeout GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 ``` ### Headless By default, Pyppeteer is running in `Headless` mode, you can also change it to `False` as you need, default is `True`: ```python GERAPY_PYPPETEER_HEADLESS = False ``` ### Window Size You can also set the width and height of Pyppeteer window: ```python GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 ``` Default is 1400, 700. ### Proxy You can set a proxy channel via below this config: ```python GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' GERAPY_PYPPETEER_PROXY_CREDENTIAL = { 'username': 'xxx', 'password': 'xxxx' } ``` ### Pyppeteer Args You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. Optional settings and their default values: ```python GERAPY_PYPPETEER_DUMPIO = False GERAPY_PYPPETEER_DEVTOOLS = False GERAPY_PYPPETEER_EXECUTABLE_PATH = None GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True GERAPY_PYPPETEER_HIDE_SCROLLBARS = True GERAPY_PYPPETEER_MUTE_AUDIO = True GERAPY_PYPPETEER_NO_SANDBOX = True GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True GERAPY_PYPPETEER_DISABLE_GPU = True ``` ### Disable loading of specific resource type You can disable the loading of specific resource type to decrease the loading time of web page. You can configure the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] ``` For example, if you want to disable the loading of css and javascript, you can set as below: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] ``` All of the optional resource type list: - document: the Original HTML document - stylesheet: CSS files - script: JavaScript files - image: Images - media: Media files such as audios or videos - font: Fonts files - texttrack: Text Track files - xhr: Ajax Requests - fetch: Fetch Requests - eventsource: Event Source - websocket: Websocket - manifest: Manifest files - other: Other files ### Screenshot You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: - `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. - `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. - `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. - `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: - `x` (int): x-coordinate of top-left corner of clip area. - `y` (int): y-coordinate of top-left corner of clip area. - `width` (int): width of clipping area. - `height` (int): height of clipping area. - `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. - `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. For example: ```python yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ 'type': 'png', 'fullPage': True }) ``` then you can get screenshot result in `response.meta['screenshot']`: Simplest save it to file: ```python def parse_index(self, response): with open('screenshot.png', 'wb') as f: f.write(response.meta['screenshot'].getbuffer()) ``` If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. For example: ```python GERAPY_PYPPETEER_SCREENSHOT = { 'type': 'png', 'fullPage': True } ``` ## PyppeteerRequest `PyppeteerRequest` provide args which can override global settings above. - url: request url - callback: callback - one of "load", "domcontentloaded", "networkidle0", "networkidle2". see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` - wait_for: wait for some element to load, also supports dict - script: script to execute - actions: actions defined for execution of Page object - proxy: use proxy for this time, like `http://x.x.x.x:x` - proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` - sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` - timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` - ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` - pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` - screenshot: ignored resource types, see https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, override `GERAPY_PYPPETEER_SCREENSHOT` For example, you can configure PyppeteerRequest as: ```python from gerapy_pyppeteer import PyppeteerRequest def parse(self, response): yield PyppeteerRequest(url, callback=self.parse_detail, wait_until='domcontentloaded', wait_for='title', script='() => { return {name: "Germey"} }', sleep=2) ``` Then Pyppeteer will: - wait for document to load - wait for title to load - execute `console.log(document)` script - sleep for 2s - return the rendered web page content, get from `response.meta['screenshot']` - return the script executed result, get from `response.meta['script_result']` For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: ```python js = '''async () => { await new Promise(resolve => setTimeout(resolve, 10000)); return { 'name': 'Germey' } } ''' yield PyppeteerRequest(url, callback=self.parse, script=js) ``` Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: ```python async def execute_actions(page: Page): await page.evaluate('() => { document.title = "Hello World"; }') return 1 yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) ``` Then you can get the actions result from `response.meta['actions_result']`, result is `1`. Also you can define proxy and proxy_credential for each Reqest, for example: ```python yield PyppeteerRequest( self.base_url, callback=self.parse_index, priority=10, proxy='http://tps254.kdlapi.com:15818', proxy_credential={ 'username': 'xxxx', 'password': 'xxxx' }) ``` `proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. ## Example For more detail, please see [example](./example). Also you can directly run with Docker: ``` docker run germey/gerapy-pyppeteer-example ``` Outputs: ```shell script 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit 2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'example', 'CONCURRENT_REQUESTS': 3, 'NEWSPIDER_MODULE': 'example.spiders', 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], 'SPIDER_MODULES': ['example.spiders']} 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened 2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> {'name': '壁穴ヘブンホール', 'score': '5.6', 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> {'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer ... ``` ## Trouble Shooting ### Pyppeteer does not start properly Chromium download speed is too slow, it can not be used normally. Here are two solutions: #### Solution 1 (Recommended) Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: ```python # Default: DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' # modify DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror ``` #### Solution 2 Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: ```python # Default: chromiumExecutable = { 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / 'Contents' / 'MacOS' / 'Chromium'), 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', } ``` You can find your own operating system, modify your chrome or chrome executable path. %package -n python3-gerapy-pyppeteer Summary: Pyppeteer Components for Scrapy & Gerapy Provides: python-gerapy-pyppeteer BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-gerapy-pyppeteer # Gerapy Pyppeteer This is a package for supporting pyppeteer in Scrapy, also this package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). ## Installation ```shell script pip3 install gerapy-pyppeteer ``` ## Usage You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. For example: ```python yield PyppeteerRequest(detail_url, callback=self.parse_detail) ``` And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: ```python DOWNLOADER_MIDDLEWARES = { 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, } ``` Congratulate, you've finished the all of the required configuration. If you run the Spider again, Pyppeteer will be started to render every web page which you configured the request as PyppeteerRequest. ## Settings GerapyPyppeteer provides some optional settings. ### Concurrency You can directly use Scrapy's setting to set Concurrency of Pyppeteer, for example: ```python CONCURRENT_REQUESTS = 3 ``` ### Pretend as Real Browser Some website will detect WebDriver or Headless, GerapyPyppeteer can pretend Chromium by inject scripts. This is enabled by default. You can close it if website does not detect WebDriver to speed up: ```python GERAPY_PYPPETEER_PRETEND = False ``` Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this configuration. ### Logging Level By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer configured the logging level of Pyppeteer to WARNING. If you want to see more logs from Pyppeteer, you can change the this setting: ```python import logging GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG ``` ### Download Timeout Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: ```python # pyppeteer timeout GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 ``` ### Headless By default, Pyppeteer is running in `Headless` mode, you can also change it to `False` as you need, default is `True`: ```python GERAPY_PYPPETEER_HEADLESS = False ``` ### Window Size You can also set the width and height of Pyppeteer window: ```python GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 ``` Default is 1400, 700. ### Proxy You can set a proxy channel via below this config: ```python GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' GERAPY_PYPPETEER_PROXY_CREDENTIAL = { 'username': 'xxx', 'password': 'xxxx' } ``` ### Pyppeteer Args You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. Optional settings and their default values: ```python GERAPY_PYPPETEER_DUMPIO = False GERAPY_PYPPETEER_DEVTOOLS = False GERAPY_PYPPETEER_EXECUTABLE_PATH = None GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True GERAPY_PYPPETEER_HIDE_SCROLLBARS = True GERAPY_PYPPETEER_MUTE_AUDIO = True GERAPY_PYPPETEER_NO_SANDBOX = True GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True GERAPY_PYPPETEER_DISABLE_GPU = True ``` ### Disable loading of specific resource type You can disable the loading of specific resource type to decrease the loading time of web page. You can configure the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] ``` For example, if you want to disable the loading of css and javascript, you can set as below: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] ``` All of the optional resource type list: - document: the Original HTML document - stylesheet: CSS files - script: JavaScript files - image: Images - media: Media files such as audios or videos - font: Fonts files - texttrack: Text Track files - xhr: Ajax Requests - fetch: Fetch Requests - eventsource: Event Source - websocket: Websocket - manifest: Manifest files - other: Other files ### Screenshot You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: - `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. - `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. - `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. - `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: - `x` (int): x-coordinate of top-left corner of clip area. - `y` (int): y-coordinate of top-left corner of clip area. - `width` (int): width of clipping area. - `height` (int): height of clipping area. - `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. - `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. For example: ```python yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ 'type': 'png', 'fullPage': True }) ``` then you can get screenshot result in `response.meta['screenshot']`: Simplest save it to file: ```python def parse_index(self, response): with open('screenshot.png', 'wb') as f: f.write(response.meta['screenshot'].getbuffer()) ``` If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. For example: ```python GERAPY_PYPPETEER_SCREENSHOT = { 'type': 'png', 'fullPage': True } ``` ## PyppeteerRequest `PyppeteerRequest` provide args which can override global settings above. - url: request url - callback: callback - one of "load", "domcontentloaded", "networkidle0", "networkidle2". see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` - wait_for: wait for some element to load, also supports dict - script: script to execute - actions: actions defined for execution of Page object - proxy: use proxy for this time, like `http://x.x.x.x:x` - proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` - sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` - timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` - ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` - pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` - screenshot: ignored resource types, see https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, override `GERAPY_PYPPETEER_SCREENSHOT` For example, you can configure PyppeteerRequest as: ```python from gerapy_pyppeteer import PyppeteerRequest def parse(self, response): yield PyppeteerRequest(url, callback=self.parse_detail, wait_until='domcontentloaded', wait_for='title', script='() => { return {name: "Germey"} }', sleep=2) ``` Then Pyppeteer will: - wait for document to load - wait for title to load - execute `console.log(document)` script - sleep for 2s - return the rendered web page content, get from `response.meta['screenshot']` - return the script executed result, get from `response.meta['script_result']` For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: ```python js = '''async () => { await new Promise(resolve => setTimeout(resolve, 10000)); return { 'name': 'Germey' } } ''' yield PyppeteerRequest(url, callback=self.parse, script=js) ``` Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: ```python async def execute_actions(page: Page): await page.evaluate('() => { document.title = "Hello World"; }') return 1 yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) ``` Then you can get the actions result from `response.meta['actions_result']`, result is `1`. Also you can define proxy and proxy_credential for each Reqest, for example: ```python yield PyppeteerRequest( self.base_url, callback=self.parse_index, priority=10, proxy='http://tps254.kdlapi.com:15818', proxy_credential={ 'username': 'xxxx', 'password': 'xxxx' }) ``` `proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. ## Example For more detail, please see [example](./example). Also you can directly run with Docker: ``` docker run germey/gerapy-pyppeteer-example ``` Outputs: ```shell script 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit 2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'example', 'CONCURRENT_REQUESTS': 3, 'NEWSPIDER_MODULE': 'example.spiders', 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], 'SPIDER_MODULES': ['example.spiders']} 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened 2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> {'name': '壁穴ヘブンホール', 'score': '5.6', 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> {'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer ... ``` ## Trouble Shooting ### Pyppeteer does not start properly Chromium download speed is too slow, it can not be used normally. Here are two solutions: #### Solution 1 (Recommended) Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: ```python # Default: DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' # modify DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror ``` #### Solution 2 Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: ```python # Default: chromiumExecutable = { 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / 'Contents' / 'MacOS' / 'Chromium'), 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', } ``` You can find your own operating system, modify your chrome or chrome executable path. %package help Summary: Development documents and examples for gerapy-pyppeteer Provides: python3-gerapy-pyppeteer-doc %description help # Gerapy Pyppeteer This is a package for supporting pyppeteer in Scrapy, also this package is a module in [Gerapy](https://github.com/Gerapy/Gerapy). ## Installation ```shell script pip3 install gerapy-pyppeteer ``` ## Usage You can use `PyppeteerRequest` to specify a request which uses pyppeteer to render. For example: ```python yield PyppeteerRequest(detail_url, callback=self.parse_detail) ``` And you also need to enable `PyppeteerMiddleware` in `DOWNLOADER_MIDDLEWARES`: ```python DOWNLOADER_MIDDLEWARES = { 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543, } ``` Congratulate, you've finished the all of the required configuration. If you run the Spider again, Pyppeteer will be started to render every web page which you configured the request as PyppeteerRequest. ## Settings GerapyPyppeteer provides some optional settings. ### Concurrency You can directly use Scrapy's setting to set Concurrency of Pyppeteer, for example: ```python CONCURRENT_REQUESTS = 3 ``` ### Pretend as Real Browser Some website will detect WebDriver or Headless, GerapyPyppeteer can pretend Chromium by inject scripts. This is enabled by default. You can close it if website does not detect WebDriver to speed up: ```python GERAPY_PYPPETEER_PRETEND = False ``` Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this configuration. ### Logging Level By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer configured the logging level of Pyppeteer to WARNING. If you want to see more logs from Pyppeteer, you can change the this setting: ```python import logging GERAPY_PYPPETEER_LOGGING_LEVEL = logging.DEBUG ``` ### Download Timeout Pyppeteer may take some time to render the required web page, you can also change this setting, default is `30s`: ```python # pyppeteer timeout GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30 ``` ### Headless By default, Pyppeteer is running in `Headless` mode, you can also change it to `False` as you need, default is `True`: ```python GERAPY_PYPPETEER_HEADLESS = False ``` ### Window Size You can also set the width and height of Pyppeteer window: ```python GERAPY_PYPPETEER_WINDOW_WIDTH = 1400 GERAPY_PYPPETEER_WINDOW_HEIGHT = 700 ``` Default is 1400, 700. ### Proxy You can set a proxy channel via below this config: ```python GERAPY_PYPPETEER_PROXY = 'http://tps254.kdlapi.com:15818' GERAPY_PYPPETEER_PROXY_CREDENTIAL = { 'username': 'xxx', 'password': 'xxxx' } ``` ### Pyppeteer Args You can also change the args of Pyppeteer, such as `dumpio`, `devtools`, etc. Optional settings and their default values: ```python GERAPY_PYPPETEER_DUMPIO = False GERAPY_PYPPETEER_DEVTOOLS = False GERAPY_PYPPETEER_EXECUTABLE_PATH = None GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True GERAPY_PYPPETEER_HIDE_SCROLLBARS = True GERAPY_PYPPETEER_MUTE_AUDIO = True GERAPY_PYPPETEER_NO_SANDBOX = True GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True GERAPY_PYPPETEER_DISABLE_GPU = True ``` ### Disable loading of specific resource type You can disable the loading of specific resource type to decrease the loading time of web page. You can configure the disabled resource types using `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = [] ``` For example, if you want to disable the loading of css and javascript, you can set as below: ```python GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script'] ``` All of the optional resource type list: - document: the Original HTML document - stylesheet: CSS files - script: JavaScript files - image: Images - media: Media files such as audios or videos - font: Fonts files - texttrack: Text Track files - xhr: Ajax Requests - fetch: Fetch Requests - eventsource: Event Source - websocket: Websocket - manifest: Manifest files - other: Other files ### Screenshot You can get screenshot of loaded page, you can pass `screenshot` args to `PyppeteerRequest` as dict: - `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`. - `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image. - `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`. - `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields: - `x` (int): x-coordinate of top-left corner of clip area. - `y` (int): y-coordinate of top-left corner of clip area. - `width` (int): width of clipping area. - `height` (int): height of clipping area. - `omitBackground` (bool): Hide default white background and allow capturing screenshot with transparency. - `encoding` (str): The encoding of the image, can be either `base64` or `binary`. Defaults to `binary`. If binary it will return `BytesIO` object. For example: ```python yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={ 'type': 'png', 'fullPage': True }) ``` then you can get screenshot result in `response.meta['screenshot']`: Simplest save it to file: ```python def parse_index(self, response): with open('screenshot.png', 'wb') as f: f.write(response.meta['screenshot'].getbuffer()) ``` If you want to enable screenshot for all requests, you can configure it by `GERAPY_PYPPETEER_SCREENSHOT`. For example: ```python GERAPY_PYPPETEER_SCREENSHOT = { 'type': 'png', 'fullPage': True } ``` ## PyppeteerRequest `PyppeteerRequest` provide args which can override global settings above. - url: request url - callback: callback - one of "load", "domcontentloaded", "networkidle0", "networkidle2". see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded` - wait_for: wait for some element to load, also supports dict - script: script to execute - actions: actions defined for execution of Page object - proxy: use proxy for this time, like `http://x.x.x.x:x` - proxy_credential: the proxy credential, like `{'username': 'xxxx', 'password': 'xxxx'}` - sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP` - timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT` - ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES` - pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND` - screenshot: ignored resource types, see https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot, override `GERAPY_PYPPETEER_SCREENSHOT` For example, you can configure PyppeteerRequest as: ```python from gerapy_pyppeteer import PyppeteerRequest def parse(self, response): yield PyppeteerRequest(url, callback=self.parse_detail, wait_until='domcontentloaded', wait_for='title', script='() => { return {name: "Germey"} }', sleep=2) ``` Then Pyppeteer will: - wait for document to load - wait for title to load - execute `console.log(document)` script - sleep for 2s - return the rendered web page content, get from `response.meta['screenshot']` - return the script executed result, get from `response.meta['script_result']` For waiting mechanism controlled by JavaScript, you can use await in `script`, for example: ```python js = '''async () => { await new Promise(resolve => setTimeout(resolve, 10000)); return { 'name': 'Germey' } } ''' yield PyppeteerRequest(url, callback=self.parse, script=js) ``` Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`. If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example: ```python async def execute_actions(page: Page): await page.evaluate('() => { document.title = "Hello World"; }') return 1 yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions) ``` Then you can get the actions result from `response.meta['actions_result']`, result is `1`. Also you can define proxy and proxy_credential for each Reqest, for example: ```python yield PyppeteerRequest( self.base_url, callback=self.parse_index, priority=10, proxy='http://tps254.kdlapi.com:15818', proxy_credential={ 'username': 'xxxx', 'password': 'xxxx' }) ``` `proxy` and `proxy_credential` will override the settings `GERAPY_PYPPETEER_PROXY` and `GERAPY_PYPPETEER_PROXY_CREDENTIAL`. ## Example For more detail, please see [example](./example). Also you can directly run with Docker: ``` docker run germey/gerapy-pyppeteer-example ``` Outputs: ```shell script 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example) 2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit 2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'example', 'CONCURRENT_REQUESTS': 3, 'NEWSPIDER_MODULE': 'example.spiders', 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504], 'SPIDER_MODULES': ['example.spiders']} 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened 2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1 2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389 2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389> {'name': '壁穴ヘブンホール', 'score': '5.6', 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']} 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer 2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://dynamic5.scrape.center/page/1) 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']} 2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315> {'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']} 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished 2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished 2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer ... ``` ## Trouble Shooting ### Pyppeteer does not start properly Chromium download speed is too slow, it can not be used normally. Here are two solutions: #### Solution 1 (Recommended) Modify drive download source at `pyppeteer/chromium_downloader.py` line 22: ```python # Default: DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com' # modify DEFAULT_DOWNLOAD_HOST = http://npm.taobao.org/mirror ``` #### Solution 2 Modify drive execution path at `pyppeteer/chromium_downloader.py` line 45: ```python # Default: chromiumExecutable = { 'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome', 'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' / 'Contents' / 'MacOS' / 'Chromium'), 'win32': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', 'win64': DOWNLOADS_FOLDER / REVISION / windowsArchive / 'chrome.exe', } ``` You can find your own operating system, modify your chrome or chrome executable path. %prep %autosetup -n gerapy-pyppeteer-0.2.4 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-gerapy-pyppeteer -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue Jun 20 2023 Python_Bot - 0.2.4-1 - Package Spec generated