%global _empty_manifest_terminate_build 0
Name: python-chompjs
Version: 1.1.9
Release: 1
Summary: Parsing JavaScript objects into Python dictionaries
License: MIT License
URL: https://github.com/Nykakin/chompjs
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/24/72/22660aba976ba8e31aee62be5e69666df895286c535635660d4b925029fc/chompjs-1.1.9.tar.gz
BuildArch: noarch
%description
# Usage
`chompjs` can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
```python
>>> import chompjs
>>> chompjs.parse_js_object('{"my_data": "test"}')
{u'my_data': u'test'}
```
Think of it as a more powerful `json.loads`. For example, it can handle JSON objects containing embedded methods by storing their code in a string:
```python
>>> import chompjs
>>> js = """
... var myObj = {
... myMethod: function(params) {
... // ...
... },
... myValue: 100
... }
... """
>>> chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
```
An example usage with `scrapy`:
```python
import chompjs
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
script_css = 'script:contains("__NEXT_DATA__")::text'
script_pattern = r'__NEXT_DATA__ = (.*);'
# warning: for some pages you need to pass replace_entities=True
# into re_first to have JSON escaped properly
script_text = response.css(script_css).re_first(script_pattern)
try:
json_data = chompjs.parse_js_object(script_text)
except ValueError:
self.log('Failed to extract data from {}'.format(response.url))
return
# work on json_data
```
If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it:
```python
>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{u'a': 12}
```
`jsonlines=True` can be used to parse JSON Lines:
```python
>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True)
[[1, 2], [2, 3], [3, 4]]
```
By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:
```python
>>> chompjs.parse_js_object('
...
...
')
[1, 2, 3]
```
`json_params` argument can be used to pass options to underlying `json_loads`, such as `strict` or `object_hook`:
```python
>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal})
[Decimal('23.2')]
```
# Rationale
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
```html
...
...
...
```
Standard library function `json.loads` is usually sufficient to extract this data:
```python
>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}
```
The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote
* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all
* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array
* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0
As a result, `json.loads` fail to extract any of those:
```
>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
```
`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
```
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{'a': 'b'}")
{u'a': u'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{u'a': u'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{u'a': [1, 2, 3]}
```
Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.
```
>>> import json
>>> import _chompjs
>>>
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{u'a': 1}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}
```
# Installation
From PIP:
```bash
$ python3 -m venv venv
$ . venv/bin/activate
# pip install chompjs
```
From sources:
```bash
$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install
```
To run unittests
```
$ python -m unittest
```
%package -n python3-chompjs
Summary: Parsing JavaScript objects into Python dictionaries
Provides: python-chompjs
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-chompjs
# Usage
`chompjs` can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
```python
>>> import chompjs
>>> chompjs.parse_js_object('{"my_data": "test"}')
{u'my_data': u'test'}
```
Think of it as a more powerful `json.loads`. For example, it can handle JSON objects containing embedded methods by storing their code in a string:
```python
>>> import chompjs
>>> js = """
... var myObj = {
... myMethod: function(params) {
... // ...
... },
... myValue: 100
... }
... """
>>> chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
```
An example usage with `scrapy`:
```python
import chompjs
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
script_css = 'script:contains("__NEXT_DATA__")::text'
script_pattern = r'__NEXT_DATA__ = (.*);'
# warning: for some pages you need to pass replace_entities=True
# into re_first to have JSON escaped properly
script_text = response.css(script_css).re_first(script_pattern)
try:
json_data = chompjs.parse_js_object(script_text)
except ValueError:
self.log('Failed to extract data from {}'.format(response.url))
return
# work on json_data
```
If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it:
```python
>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{u'a': 12}
```
`jsonlines=True` can be used to parse JSON Lines:
```python
>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True)
[[1, 2], [2, 3], [3, 4]]
```
By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:
```python
>>> chompjs.parse_js_object('...
...
')
[1, 2, 3]
```
`json_params` argument can be used to pass options to underlying `json_loads`, such as `strict` or `object_hook`:
```python
>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal})
[Decimal('23.2')]
```
# Rationale
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
```html
...
...
...
```
Standard library function `json.loads` is usually sufficient to extract this data:
```python
>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}
```
The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote
* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all
* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array
* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0
As a result, `json.loads` fail to extract any of those:
```
>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
```
`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
```
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{'a': 'b'}")
{u'a': u'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{u'a': u'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{u'a': [1, 2, 3]}
```
Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.
```
>>> import json
>>> import _chompjs
>>>
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{u'a': 1}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}
```
# Installation
From PIP:
```bash
$ python3 -m venv venv
$ . venv/bin/activate
# pip install chompjs
```
From sources:
```bash
$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install
```
To run unittests
```
$ python -m unittest
```
%package help
Summary: Development documents and examples for chompjs
Provides: python3-chompjs-doc
%description help
# Usage
`chompjs` can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
```python
>>> import chompjs
>>> chompjs.parse_js_object('{"my_data": "test"}')
{u'my_data': u'test'}
```
Think of it as a more powerful `json.loads`. For example, it can handle JSON objects containing embedded methods by storing their code in a string:
```python
>>> import chompjs
>>> js = """
... var myObj = {
... myMethod: function(params) {
... // ...
... },
... myValue: 100
... }
... """
>>> chompjs.parse_js_object(js, json_params={'strict': False})
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
```
An example usage with `scrapy`:
```python
import chompjs
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
script_css = 'script:contains("__NEXT_DATA__")::text'
script_pattern = r'__NEXT_DATA__ = (.*);'
# warning: for some pages you need to pass replace_entities=True
# into re_first to have JSON escaped properly
script_text = response.css(script_css).re_first(script_pattern)
try:
json_data = chompjs.parse_js_object(script_text)
except ValueError:
self.log('Failed to extract data from {}'.format(response.url))
return
# work on json_data
```
If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it:
```python
>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{u'a': 12}
```
`jsonlines=True` can be used to parse JSON Lines:
```python
>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True)
[[1, 2], [2, 3], [3, 4]]
```
By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:
```python
>>> chompjs.parse_js_object('...
...
')
[1, 2, 3]
```
`json_params` argument can be used to pass options to underlying `json_loads`, such as `strict` or `object_hook`:
```python
>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal})
[Decimal('23.2')]
```
# Rationale
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
```html
...
...
...
```
Standard library function `json.loads` is usually sufficient to extract this data:
```python
>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}
```
The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote
* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all
* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array
* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0
As a result, `json.loads` fail to extract any of those:
```
>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
```
`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
```
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{'a': 'b'}")
{u'a': u'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{u'a': u'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{u'a': [1, 2, 3]}
```
Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.
```
>>> import json
>>> import _chompjs
>>>
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{u'a': 1}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}
```
# Installation
From PIP:
```bash
$ python3 -m venv venv
$ . venv/bin/activate
# pip install chompjs
```
From sources:
```bash
$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install
```
To run unittests
```
$ python -m unittest
```
%prep
%autosetup -n chompjs-1.1.9
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-chompjs -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot - 1.1.9-1
- Package Spec generated