summaryrefslogtreecommitdiff
path: root/python-tabulator.spec
diff options
context:
space:
mode:
Diffstat (limited to 'python-tabulator.spec')
-rw-r--r--python-tabulator.spec5248
1 files changed, 5248 insertions, 0 deletions
diff --git a/python-tabulator.spec b/python-tabulator.spec
new file mode 100644
index 0000000..2879202
--- /dev/null
+++ b/python-tabulator.spec
@@ -0,0 +1,5248 @@
+%global _empty_manifest_terminate_build 0
+Name: python-tabulator
+Version: 1.53.5
+Release: 1
+Summary: Consistent interface for stream reading and writing tabular data (csv/xls/json/etc)
+License: MIT
+URL: https://github.com/frictionlessdata/tabulator-py
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/10/ab/f8d380340b180ebf35d9b2849282f422bc2a48927cceb4a93df9e6075582/tabulator-1.53.5.tar.gz
+BuildArch: noarch
+
+Requires: python3-six
+Requires: python3-click
+Requires: python3-requests
+Requires: python3-chardet
+Requires: python3-boto3
+Requires: python3-unicodecsv
+Requires: python3-ijson
+Requires: python3-jsonlines
+Requires: python3-sqlalchemy
+Requires: python3-linear-tsv
+Requires: python3-xlrd
+Requires: python3-openpyxl
+Requires: python3-cchardet
+Requires: python3-datapackage
+Requires: python3-mock
+Requires: python3-pylama
+Requires: python3-pytest
+Requires: python3-pytest-cov
+Requires: python3-moto[server]
+Requires: python3-pyquery
+Requires: python3-ezodf
+Requires: python3-lxml
+
+%description
+# tabulator-py
+
+[![Travis](https://img.shields.io/travis/frictionlessdata/tabulator-py/master.svg)](https://travis-ci.org/frictionlessdata/tabulator-py)
+[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/tabulator-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/tabulator-py?branch=master)
+[![PyPi](https://img.shields.io/pypi/v/tabulator.svg)](https://pypi.python.org/pypi/tabulator)
+[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/tabulator-py)
+[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
+
+A library for reading and writing tabular data (csv/xls/json/etc).
+
+> **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework is logical continuation of `tabulator` that was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `tabulator` to Frictionless Framework.
+> - we continue to bug-fix `tabulator@1.x` in this [repository](https://github.com/frictionlessdata/tabulator-py) as well as it's available on [PyPi](https://pypi.org/project/tabulator/) as it was before
+> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
+> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
+
+## Features
+
+- **Supports most common tabular formats**: CSV, XLS, ODS, JSON, Google Sheets, SQL, and others. See complete list [below](#supported-file-formats).
+- **Loads local and remote data**: Supports HTTP, FTP and S3.
+- **Low memory usage**: Only the current row is kept in memory, so you can
+ large datasets.
+- **Supports compressed files**: Using ZIP or GZIP algorithms.
+- **Extensible**: You can add support for custom file formats and loaders (e.g.
+ FTP).
+
+## Contents
+
+<!--TOC-->
+
+ - [Getting started](#getting-started)
+ - [Installation](#installation)
+ - [Running on CLI](#running-on-cli)
+ - [Running on Python](#running-on-python)
+ - [Documentation](#documentation)
+ - [Working with Stream](#working-with-stream)
+ - [Supported schemes](#supported-schemes)
+ - [Supported file formats](#supported-file-formats)
+ - [Custom file sources and formats](#custom-file-sources-and-formats)
+ - [API Reference](#api-reference)
+ - [`cli`](#cli)
+ - [`Stream`](#stream)
+ - [`Loader`](#loader)
+ - [`Parser`](#parser)
+ - [`Writer`](#writer)
+ - [`validate`](#validate)
+ - [`TabulatorException`](#tabulatorexception)
+ - [`SourceError`](#sourceerror)
+ - [`SchemeError`](#schemeerror)
+ - [`FormatError`](#formaterror)
+ - [`EncodingError`](#encodingerror)
+ - [`CompressionError`](#compressionerror)
+ - [`IOError`](#ioerror)
+ - [`LoadingError`](#loadingerror)
+ - [`HTTPError`](#httperror)
+ - [Contributing](#contributing)
+ - [Changelog](#changelog)
+
+<!--TOC-->
+
+## Getting started
+
+### Installation
+
+```bash
+$ pip install tabulator
+```
+
+### Running on CLI
+
+Tabulator ships with a simple CLI called `tabulator` to read tabular data. For
+example:
+
+```bash
+$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
+id,name
+1,english
+2,中国人
+```
+
+You can see all supported options by running `tabulator --help`.
+
+### Running on Python
+
+```python
+from tabulator import Stream
+
+with Stream('data.csv', headers=1) as stream:
+ stream.headers # [header1, header2, ..]
+ for row in stream:
+ print(row) # [value1, value2, ..]
+```
+
+You can find other examples in the [examples][examples-dir] directory.
+
+## Documentation
+
+In the following sections, we'll walk through some usage examples of
+this library. All examples were tested with Python 3.6, but should
+run fine with Python 3.3+.
+
+### Working with Stream
+
+The `Stream` class represents a tabular stream. It takes the file path as the
+`source` argument. For example:
+
+```
+<scheme>://path/to/file.<format>
+```
+
+It uses this path to determine the file format (e.g. CSV or XLS) and scheme
+(e.g. HTTP or postgresql). It also supports format extraction from URLs like `http://example.com?format=csv`. If necessary, you also can define these explicitly.
+
+Let's try it out. First, we create a `Stream` object passing the path to a CSV file.
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+```
+
+At this point, the file haven't been read yet. Let's open the stream so we can
+read the contents.
+
+```python
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+```
+
+This will open the underlying data stream, read a small sample to detect the
+file encoding, and prepare the data to be read. We catch
+`tabulator.TabulatorException` here, in case something goes wrong.
+
+We can now read the file contents. To iterate over each row, we do:
+
+```python
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+```
+
+The `stream.iter()` method will return each row data as a list of values. If
+you prefer, you could call `stream.iter(keyed=True)` instead, which returns a
+dictionary with the column names as keys. Either way, this method keeps only a
+single row in memory at a time. This means it can handle handle large files
+without consuming too much memory.
+
+If you want to read the entire file, use `stream.read()`. It accepts the same
+arguments as `stream.iter()`, but returns all rows at once.
+
+```python
+stream.reset()
+rows = stream.read()
+```
+
+Notice that we called `stream.reset()` before reading the rows. This is because
+internally, tabulator only keeps a pointer to its current location in the file.
+If we didn't reset this pointer, we would read starting from where we stopped.
+For example, if we ran `stream.read()` again, we would get an empty list, as
+the internal file pointer is at the end of the file (because we've already read
+it all). Depending on the file location, it might be necessary to download the
+file again to rewind (e.g. when the file was loaded from the web).
+
+After we're done, close the stream with:
+
+```python
+stream.close()
+```
+
+The entire example looks like:
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+
+stream.reset() # Rewind internal file pointer
+rows = stream.read()
+
+stream.close()
+```
+
+It could be rewritten to use Python's context manager interface as:
+
+```python
+import tabulator
+
+try:
+ with tabulator.Stream('data.csv') as stream:
+ for row in stream.iter():
+ print(row)
+
+ stream.reset()
+ rows = stream.read()
+except tabulator.TabulatorException as e:
+ pass
+```
+
+This is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.
+
+The full API documentation is available as docstrings in the [Stream source code][stream.py].
+
+#### Headers
+
+By default, tabulator considers that all file rows are values (i.e. there is no
+header).
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.headers # None
+ stream.read() # [['name', 'age'], ['Alex', 21]]
+```
+
+If you have a header row, you can use the `headers` argument with the its row
+number (starting from 1).
+
+```python
+# Integer
+with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+You can also pass a lists of strings to define the headers explicitly:
+
+```python
+with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+Tabulator also supports multiline headers for the `xls` and `xlsx` formats.
+
+```python
+with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
+ stream.headers # ['header from row 1-3']
+ stream.read() # [['value1', 'value2', 'value3']]
+```
+
+#### Encoding
+
+You can specify the file encoding (e.g. `utf-8` and `latin1`) via the `encoding`
+argument.
+
+```python
+with Stream(source, encoding='latin1') as stream:
+ stream.read()
+```
+
+If this argument isn't set, Tabulator will try to infer it from the data. If you
+get a `UnicodeDecodeError` while loading a file, try setting the encoding to
+`utf-8`.
+
+#### Compression (Python3-only)
+
+Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:
+
+```python
+with Stream('http://example.com/data.csv.zip') as stream:
+ stream.read()
+```
+
+You can also set it explicitly:
+
+```python
+with Stream('data.csv.ext', compression='gz') as stream:
+ stream.read()
+```
+**Options**
+
+- **filename**: filename in zip file to process (default is first file)
+
+#### Allow html
+
+The `Stream` class raises `tabulator.exceptions.FormatError` if it detects HTML
+contents. This helps avoiding the relatively common mistake of trying to load a
+CSV file inside an HTML page, for example on GitHub.
+
+You can disable this behaviour using the `allow_html` option:
+
+```python
+with Stream(source_with_html, allow_html=True) as stream:
+ stream.read() # no exception on open
+```
+
+#### Sample size
+
+To detect the file's headers, and run other checks like validating that the file
+doesn't contain HTML, Tabulator reads a sample of rows on the `stream.open()`
+method. This data is available via the `stream.sample` property. The number of
+rows used can be defined via the `sample_size` parameters (defaults to 100).
+
+```python
+with Stream(two_rows_source, sample_size=1) as stream:
+ stream.sample # only first row
+ stream.read() # first and second rows
+```
+
+You can disable this by setting `sample_size` to zero. This way, no data will be
+read on `stream.open()`.
+
+#### Bytes sample size
+
+Tabulator needs to read a part of the file to infer its encoding. The
+`bytes_sample_size` arguments controls how many bytes will be read for this
+detection (defaults to 10000).
+
+```python
+source = 'data/special/latin1.csv'
+with Stream(source) as stream:
+ stream.encoding # 'iso8859-2'
+```
+
+You can disable this by setting `bytes_sample_size` to zero, in which case it'll
+use the machine locale's default encoding.
+
+#### Ignore blank headers
+
+When `True`, tabulator will ignore columns that have blank headers (defaults to
+`False`).
+
+```python
+# Default behaviour
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1) as stream:
+ stream.headers # ['header1', '', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}
+
+# Ignoring columns with blank headers
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
+ stream.headers # ['header1', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}
+```
+
+#### Ignore listed/not-listed headers
+
+The option is similar to the `ignore_blank_headers`. It removes arbitrary columns from the data based on the corresponding column names:
+
+```python
+# Ignore listed headers (omit columns)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header1', 'header3']
+ assert stream.read(keyed=True) == [
+ {'header1': 'value1', 'header3': 'value3'},
+ ]
+
+# Ignore NOT listed headers (pick colums)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_not_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header2']
+ assert stream.read(keyed=True) == [
+ {'header2': 'value2'},
+ ]
+```
+
+#### Force strings
+
+When `True`, all rows' values will be converted to strings (defaults to
+`False`). `None` values will be converted to empty strings.
+
+```python
+# Default behaviour
+with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
+ stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]
+
+# Forcing rows' values as strings
+with Stream([['string', 1]], force_strings=True) as stream:
+ stream.read() # [['string', '1', '2017-12-01 17:00:00']]
+```
+
+#### Force parse
+
+When `True`, don't raise an exception when parsing a malformed row, but simply
+return an empty row. Otherwise, tabulator raises
+`tabulator.exceptions.SourceError` when a row can't be parsed. Defaults to `False`.
+
+```python
+# Default behaviour
+with Stream([[1], 'bad', [3]]) as stream:
+ stream.read() # raises tabulator.exceptions.SourceError
+
+# With force_parse
+with Stream([[1], 'bad', [3]], force_parse=True) as stream:
+ stream.read() # [[1], [], [3]]
+```
+
+#### Skip rows
+
+List of row numbers and/or strings to skip.
+If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//').
+If it's the empty string, all rows that begin with an empty column will be skipped.
+
+```python
+source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
+with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
+ stream.read() # [['Mike', 4]]
+```
+
+If the `headers` parameter is also set to be an integer, it will use the first not skipped row as a headers.
+
+```python
+source = [['#comment'], ['name', 'order'], ['John', 1], ['Alex', 2]]
+with Stream(source, headers=1, skip_rows=['#']) as stream:
+ stream.headers # [['name', 'order']]
+ stream.read() # [['Jogn', 1], ['Alex', 2]]
+```
+
+#### Post parse
+
+List of functions that can filter or transform rows after they are parsed. These
+functions receive the `extended_rows` containing the row's number, headers
+list, and the row values list. They then process the rows, and yield or discard
+them, modified or not.
+
+```python
+def skip_odd_rows(extended_rows):
+ for row_number, headers, row in extended_rows:
+ if not row_number % 2:
+ yield (row_number, headers, row)
+
+def multiply_by_two(extended_rows):
+ for row_number, headers, row in extended_rows:
+ doubled_row = list(map(lambda value: value * 2, row))
+ yield (row_number, headers, doubled_row)
+
+rows = [
+ [1],
+ [2],
+ [3],
+ [4],
+]
+with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
+ stream.read() # [[4], [8]]
+```
+
+These functions are applied in order, as a simple data pipeline. In the example
+above, `multiply_by_two` just sees the rows yielded by `skip_odd_rows`.
+
+#### Keyed and extended rows
+
+The methods `stream.iter()` and `stream.read()` accept the `keyed` and
+`extended` flag arguments to modify how the rows are returned.
+
+By default, every row is returned as a list of its cells values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read() # [['Alex', 21]]
+```
+
+With `keyed=True`, the rows are returned as dictionaries, mapping the column names to their values in the row:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]
+```
+
+And with `extended=True`, the rows are returned as a tuple of `(row_number,
+headers, row)`, there `row_number` is the current row number (starting from 1),
+`headers` is a list with the headers names, and `row` is a list with the rows
+values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])
+```
+
+### Supported schemes
+
+#### s3
+
+It loads data from AWS S3. For private files you should provide credentials supported by the `boto3` library, for example, corresponding environment variables. Read more about [configuring `boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html).
+
+```python
+stream = Stream('s3://bucket/data.csv')
+```
+
+**Options**
+
+- **s3\_endpoint\_url** - the endpoint URL to use. By default it's `https://s3.amazonaws.com`. For complex use cases, for example, `goodtables`'s runs on a data package this option can be provided as an environment variable `S3_ENDPOINT_URL`.
+
+#### file
+
+The default scheme, a file in the local filesystem.
+
+```python
+stream = Stream('data.csv')
+```
+
+#### http/https/ftp/ftps
+
+> In Python 2, `tabulator` can't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.
+
+```python
+stream = Stream('https://example.com/data.csv')
+```
+
+**Options**
+
+- **http\_session** - a `requests.Session` object. Read more in the [requests docs][requests-session].
+- **http\_stream** - Enables or disables HTTP streaming, when possible (enabled by default). Disable it if you'd like to preload the whole file into memory.
+- **http\_timeout** - This timeout will be used for a `requests` session construction.
+
+#### stream
+
+The source is a file-like Python object.
+
+
+```python
+with open('data.csv') as fp:
+ stream = Stream(fp)
+```
+
+#### text
+
+The source is a string containing the tabular data. Both `scheme` and `format`
+must be set explicitly, as it's not possible to infer them.
+
+```python
+stream = Stream(
+ 'name,age\nJohn, 21\n',
+ scheme='text',
+ format='csv'
+)
+```
+
+### Supported file formats
+
+In this section, we'll describe the supported file formats, and their respective
+configuration options and operations. Some formats only support read operations,
+while others support both reading and writing.
+
+#### csv (read & write)
+
+```python
+stream = Stream('data.csv', delimiter=',')
+```
+
+**Options**
+
+It supports all options from the Python CSV library. Check [their
+documentation][pydoc-csv] for more information.
+
+#### xls/xlsx (read & write)
+
+> Tabulator is unable to stream `xls` files, so the entire file is loaded in
+> memory. Streaming is supported for `xlsx` files.
+
+```python
+stream = Stream('data.xls', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1).
+- **workbook_cache**: An empty dictionary to handle workbook caching. If provided, `tabulator` will fill the dictionary with `source: tmpfile_path` pairs for remote workbooks. Each workbook will be downloaded only once and all the temporary files will be deleted on the process exit. Defauts: None
+- **fill_merged_cells**: if `True` it will unmerge and fill all merged cells by
+ a visible value. With this option enabled the parser can't stream data and
+ load the whole document into memory.
+- **preserve_formatting**: if `True` it will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)
+- **adjust_floating_point_error**: if `True` it will correct the Excel behaviour regarding floating point numbers
+
+#### ods (read only)
+
+> This format is not included to package by default. To use it please install `tabulator` with an `ods` extras: `$ pip install tabulator[ods]`
+
+Source should be a valid Open Office document.
+
+```python
+stream = Stream('data.ods', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1)
+
+#### gsheet (read only)
+
+A publicly-accessible Google Spreadsheet.
+
+```python
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>?usp=sharing')
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>edit#gid=<gid>')
+```
+
+#### sql (read & write)
+
+Any database URL supported by [sqlalchemy][sqlalchemy].
+
+```python
+stream = Stream('postgresql://name:pass@host:5432/database', table='data')
+```
+
+**Options**
+
+- **table (required)**: Database table name
+- **order_by**: SQL expression for row ordering (e.g. `name DESC`)
+
+#### Data Package (read only)
+
+> This format is not included to package by default. You can enable it by
+> installing tabulator using `pip install tabulator[datapackage]`.
+
+A [Tabular Data Package][tdp].
+
+```python
+stream = Stream('datapackage.json', resource=1)
+```
+
+**Options**
+
+- **resource**: Resource name or index (starting from 0)
+
+#### inline (read only)
+
+Either a list of lists, or a list of dicts mapping the column names to their
+respective values.
+
+```python
+stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
+stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])
+```
+
+#### json (read & write)
+
+JSON document containing a list of lists, or a list of dicts mapping the column
+names to their respective values (see the `inline` format for an example).
+
+```python
+stream = Stream('data.json', property='key1.key2')
+```
+
+**Options**
+
+- **property**: JSON Path to the property containing the tabular data. For example, considering the JSON `{"response": {"data": [...]}}`, the `property` should be set to `response.data`.
+- **keyed** (write): Save as array of arrays (default) or as array of dicts (keyed).
+
+#### ndjson (read only)
+
+```python
+stream = Stream('data.ndjson')
+```
+
+#### tsv (read only)
+
+```python
+stream = Stream('data.tsv')
+```
+
+#### html (read only)
+
+
+> This format is not included to package by default. To use it please install `tabulator` with the `html` extra: `$ pip install tabulator[html]`
+
+An HTML table element residing inside an HTML document.
+
+Supports simple tables (no merged cells) with any legal combination of the td, th, tbody & thead elements.
+
+Usually `foramt='html'` would need to be specified explicitly as web URLs don't always use the `.html` extension.
+
+```python
+stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
+```
+
+**Options**
+
+- **selector**: CSS selector for specifying which `table` element to extract. By default it's `table`, which takes the first `table` element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
+
+- **raw_html**: False (default) to extract the textual contents of each cell. True to return the inner html without modification.
+
+### Custom file sources and formats
+
+Tabulator is written with extensibility in mind, allowing you to add support for
+new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There
+are three components that allow this:
+
+* Loaders
+ * Loads a stream from some location (e.g. ssh)
+* Parsers
+ * Parses a stream of tabular data in some format (e.g. xls)
+* Writers
+ * Writes tabular data to some destination (e.g. MongoDB)
+
+In this section, we'll see how to write custom classes to extend any of these components.
+
+#### Custom loaders
+
+You can add support for a new scheme (e.g. ssh) by creating a custom loader.
+Custom loaders are implemented by inheriting from the `Loader` class, and
+implementing its methods. This loader can then be used by `Stream` to load data
+by passing it via the `custom_loaders={'scheme': CustomLoader}` argument.
+
+The skeleton of a custom loader looks like:
+
+```python
+from tabulator import Loader
+
+class CustomLoader(Loader):
+ options = []
+
+ def __init__(self, bytes_sample_size, **options):
+ pass
+
+ def load(self, source, mode='t', encoding=None):
+ # load logic
+
+with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
+ stream.read()
+```
+
+You can see examples of how the loaders are implemented by looking in the
+`tabulator.loaders` module.
+
+#### Custom parsers
+
+You can add support for a new file format by creating a custom parser. Similarly
+to custom loaders, custom parsers are implemented by inheriting from the
+`Parser` class, and implementing its methods. This parser can then be used by
+`Stream` to parse data by passing it via the `custom_parsers={'format':
+CustomParser}` argument.
+
+The skeleton of a custom parser looks like:
+
+```python
+from tabulator import Parser
+
+class CustomParser(Parser):
+ options = []
+
+ def __init__(self, loader, force_parse, **options):
+ self.__loader = loader
+
+ def open(self, source, encoding=None):
+ # open logic
+
+ def close(self):
+ # close logic
+
+ def reset(self):
+ # reset logic
+
+ @property
+ def closed(self):
+ return False
+
+ @property
+ def extended_rows(self):
+ # extended rows logic
+
+with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
+ stream.read()
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.parsers` module.
+
+#### Custom writers
+
+You can add support to write files in a specific format by creating a custom
+writer. The custom writers are implemented by inheriting from the base `Writer`
+class, and implementing its methods. This writer can then be used by `Stream` to
+write data via the `custom_writers={'format': CustomWriter}` argument.
+
+The skeleton of a custom writer looks like:
+
+```python
+from tabulator import Writer
+
+class CustomWriter(Writer):
+ options = []
+
+ def __init__(self, **options):
+ pass
+
+ def write(self, source, target, headers=None, encoding=None):
+ # write logic
+
+with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
+ stream.save(target)
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.writers` module.
+
+## API Reference
+
+### `cli`
+```python
+cli(source, limit, **options)
+```
+Command-line interface
+
+```
+Usage: tabulator [OPTIONS] SOURCE
+
+Options:
+ --headers INTEGER
+ --scheme TEXT
+ --format TEXT
+ --encoding TEXT
+ --limit INTEGER
+ --sheet TEXT/INTEGER (excel)
+ --fill-merged-cells BOOLEAN (excel)
+ --preserve-formatting BOOLEAN (excel)
+ --adjust-floating-point-error BOOLEAN (excel)
+ --table TEXT (sql)
+ --order_by TEXT (sql)
+ --resource TEXT/INTEGER (datapackage)
+ --property TEXT (json)
+ --keyed BOOLEAN (json)
+ --version Show the version and exit.
+ --help Show this message and exit.
+```
+
+
+### `Stream`
+```python
+Stream(self,
+ source,
+ headers=None,
+ scheme=None,
+ format=None,
+ encoding=None,
+ compression=None,
+ allow_html=False,
+ sample_size=100,
+ bytes_sample_size=10000,
+ ignore_blank_headers=False,
+ ignore_listed_headers=None,
+ ignore_not_listed_headers=None,
+ multiline_headers_joiner=' ',
+ multiline_headers_duplicates=False,
+ force_strings=False,
+ force_parse=False,
+ pick_rows=None,
+ skip_rows=None,
+ pick_fields=None,
+ skip_fields=None,
+ pick_columns=None,
+ skip_columns=None,
+ post_parse=[],
+ custom_loaders={},
+ custom_parsers={},
+ custom_writers={},
+ **options)
+```
+Stream of tabular data.
+
+This is the main `tabulator` class. It loads a data source, and allows you
+to stream its parsed contents.
+
+__Arguments__
+
+
+ source (str):
+ Path to file as ``<scheme>://path/to/file.<format>``.
+ If not explicitly set, the scheme (file, http, ...) and
+ format (csv, xls, ...) are inferred from the source string.
+
+ headers (Union[int, List[int], List[str]], optional):
+ Either a row
+ number or list of row numbers (in case of multi-line headers) to be
+ considered as headers (rows start counting at 1), or the actual
+ headers defined a list of strings. If not set, all rows will be
+ treated as containing values.
+
+ scheme (str, optional):
+ Scheme for loading the file (file, http, ...).
+ If not set, it'll be inferred from `source`.
+
+ format (str, optional):
+ File source's format (csv, xls, ...). If not
+ set, it'll be inferred from `source`. inferred
+
+ encoding (str, optional):
+ Source encoding. If not set, it'll be inferred.
+
+ compression (str, optional):
+ Source file compression (zip, ...). If not set, it'll be inferred.
+
+ pick_rows (List[Union[int, str, dict]], optional):
+ The same as `skip_rows` but it's for picking rows instead of skipping.
+
+ skip_rows (List[Union[int, str, dict]], optional):
+ List of row numbers, strings and regex patterns as dicts to skip.
+ If a string, it'll skip rows that their first cells begin with it e.g. '#' and '//'.
+ To skip only completely blank rows use `{'type': 'preset', 'value': 'blank'}`
+ To provide a regex pattern use `{'type': 'regex', 'value': '^#'}`
+ For example: `skip_rows=[1, '# comment', {'type': 'regex', 'value': '^# (regex|comment)'}]`
+
+ pick_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list DOES NOT include
+
+ skip_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list includes. If it contains an empty string it will skip
+ empty headers
+
+ sample_size (int, optional):
+ Controls the number of sample rows used to
+ infer properties from the data (headers, encoding, etc.). Set to
+ ``0`` to disable sampling, in which case nothing will be inferred
+ from the data. Defaults to ``config.DEFAULT_SAMPLE_SIZE``.
+
+ bytes_sample_size (int, optional):
+ Same as `sample_size`, but instead
+ of number of rows, controls number of bytes. Defaults to
+ ``config.DEFAULT_BYTES_SAMPLE_SIZE``.
+
+ allow_html (bool, optional):
+ Allow the file source to be an HTML page.
+ If False, raises ``exceptions.FormatError`` if the loaded file is
+ an HTML page. Defaults to False.
+
+ multiline_headers_joiner (str, optional):
+ When passed, it's used to join multiline headers
+ as `<passed-value>.join(header1_1, header1_2)`
+ Defaults to ' ' (space).
+
+ multiline_headers_duplicates (bool, optional):
+ By default tabulator will exclude a cell of a miltilne header from joining
+ if it's exactly the same as the previous seen value in this field.
+ Enabling this option will force duplicates inclusion
+ Defaults to False.
+
+ force_strings (bool, optional):
+ When True, casts all data to strings.
+ Defaults to False.
+
+ force_parse (bool, optional):
+ When True, don't raise exceptions when
+ parsing malformed rows, simply returning an empty value. Defaults
+ to False.
+
+ post_parse (List[function], optional):
+ List of generator functions that
+ receives a list of rows and headers, processes them, and yields
+ them (or not). Useful to pre-process the data. Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as scheme names,
+ and values as their respective ``Loader`` class implementations.
+ Defaults to None.
+
+ custom_parsers (dict, optional):
+ Dictionary with keys as format names,
+ and values as their respective ``Parser`` class implementations.
+ Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as writer format
+ names, and values as their respective ``Writer`` class
+ implementations. Defaults to None.
+
+ **options (Any, optional): Extra options passed to the loaders and parsers.
+
+
+
+#### `stream.closed`
+Returns True if the underlying stream is closed, False otherwise.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `stream.compression`
+Stream's compression ("no" if no compression)
+
+__Returns__
+
+`str`: compression
+
+
+
+#### `stream.encoding`
+Stream's encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `stream.format`
+Path's format
+
+__Returns__
+
+`str`: format
+
+
+
+#### `stream.fragment`
+Path's fragment
+
+__Returns__
+
+`str`: fragment
+
+
+
+#### `stream.hash`
+Returns the SHA256 hash of the read chunks if available
+
+__Returns__
+
+`str/None`: SHA256 hash
+
+
+
+#### `stream.headers`
+Headers
+
+__Returns__
+
+`str[]/None`: headers if available
+
+
+
+#### `stream.sample`
+Returns the stream's rows used as sample.
+
+These sample rows are used internally to infer characteristics of the
+source file (e.g. encoding, headers, ...).
+
+__Returns__
+
+`list[]`: sample
+
+
+
+#### `stream.scheme`
+Path's scheme
+
+__Returns__
+
+`str`: scheme
+
+
+
+#### `stream.size`
+Returns the BYTE count of the read chunks if available
+
+__Returns__
+
+`int/None`: BYTE count
+
+
+
+#### `stream.source`
+Source
+
+__Returns__
+
+`any`: stream source
+
+
+
+#### `stream.open`
+```python
+stream.open()
+```
+Opens the stream for reading.
+
+__Raises:__
+
+ TabulatorException: if an error
+
+
+
+#### `stream.close`
+```python
+stream.close()
+```
+Closes the stream.
+
+
+#### `stream.reset`
+```python
+stream.reset()
+```
+Resets the stream pointer to the beginning of the file.
+
+
+#### `stream.iter`
+```python
+stream.iter(keyed=False, extended=False)
+```
+Iterate over the rows.
+
+Each row is returned in a format that depends on the arguments `keyed`
+and `extended`. By default, each row is returned as list of their
+values.
+
+__Arguments__
+- __keyed (bool, optional)__:
+ When True, each returned row will be a
+ `dict` mapping the header name to its value in the current row.
+ For example, `[{'name': 'J Smith', 'value': '10'}]`. Ignored if
+ ``extended`` is True. Defaults to False.
+- __extended (bool, optional)__:
+ When True, returns each row as a tuple
+ with row number (starts at 1), list of headers, and list of row
+ values. For example, `(1, ['name', 'value'], ['J Smith', '10'])`.
+ Defaults to False.
+
+__Raises__
+- `exceptions.TabulatorException`: If the stream is closed.
+
+__Returns__
+
+`Iterator[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The row itself. The format depends on the values of `keyed` and
+ `extended` arguments.
+
+
+
+#### `stream.read`
+```python
+stream.read(keyed=False, extended=False, limit=None)
+```
+Returns a list of rows.
+
+__Arguments__
+- __keyed (bool, optional)__: See :func:`Stream.iter`.
+- __extended (bool, optional)__: See :func:`Stream.iter`.
+- __limit (int, optional)__:
+ Number of rows to return. If None, returns all rows. Defaults to None.
+
+__Returns__
+
+`List[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The list of rows. The format depends on the values of `keyed`
+ and `extended` arguments.
+
+
+#### `stream.save`
+```python
+stream.save(target, format=None, encoding=None, **options)
+```
+Save stream to the local filesystem.
+
+__Arguments__
+- __target (str)__: Path where to save the stream.
+- __format (str, optional)__:
+ The format the stream will be saved as. If
+ None, detects from the ``target`` path. Defaults to None.
+- __encoding (str, optional)__:
+ Saved file encoding. Defaults to ``config.DEFAULT_ENCODING``.
+- __**options__: Extra options passed to the writer.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+Building index...
+Started generating documentation...
+
+### `Loader`
+```python
+Loader(self, bytes_sample_size, **options)
+```
+Abstract class implemented by the data loaders
+
+The loaders inherit and implement this class' methods to add support for a
+new scheme (e.g. ssh).
+
+__Arguments__
+- __bytes_sample_size (int)__: Sample size in bytes
+- __**options (dict)__: Loader options
+
+
+
+#### `loader.options`
+
+
+#### `loader.load`
+```python
+loader.load(source, mode='t', encoding=None)
+```
+Load source file.
+
+__Arguments__
+- __source (str)__: Path to tabular source file.
+- __mode (str, optional)__:
+ Text stream mode, `t` (text) or `b` (binary). Defaults to `t`.
+- __encoding (str, optional)__:
+ Source encoding. Auto-detect by default.
+
+__Returns__
+
+`Union[TextIO, BinaryIO]`: I/O stream opened either as text or binary.
+
+
+### `Parser`
+```python
+Parser(self, loader, force_parse, **options)
+```
+Abstract class implemented by the data parsers.
+
+The parsers inherit and implement this class' methods to add support for a
+new file type.
+
+__Arguments__
+- __loader (tabulator.Loader)__: Loader instance to read the file.
+- __force_parse (bool)__:
+ When `True`, the parser yields an empty extended
+ row tuple `(row_number, None, [])` when there is an error parsing a
+ row. Otherwise, it stops the iteration by raising the exception
+ `tabulator.exceptions.SourceError`.
+- __**options (dict)__: Loader options
+
+
+
+#### `parser.closed`
+Flag telling if the parser is closed.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `parser.encoding`
+Encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `parser.extended_rows`
+Returns extended rows iterator.
+
+The extended rows are tuples containing `(row_number, headers, row)`,
+
+__Raises__
+- `SourceError`:
+ If `force_parse` is `False` and
+ a row can't be parsed, this exception will be raised.
+ Otherwise, an empty extended row is returned (i.e.
+ `(row_number, None, [])`).
+
+Returns:
+ Iterator[Tuple[int, List[str], List[Any]]]:
+ Extended rows containing
+ `(row_number, headers, row)`, where `headers` is a list of the
+ header names (can be `None`), and `row` is a list of row
+ values.
+
+
+
+#### `parser.options`
+
+
+#### `parser.open`
+```python
+parser.open(source, encoding=None)
+```
+Open underlying file stream in the beginning of the file.
+
+The parser gets a byte or text stream from the `tabulator.Loader`
+instance and start emitting items.
+
+__Arguments__
+- __source (str)__: Path to source table.
+- __encoding (str, optional)__: Source encoding. Auto-detect by default.
+
+__Returns__
+
+ None
+
+
+
+#### `parser.close`
+```python
+parser.close()
+```
+Closes underlying file stream.
+
+
+#### `parser.reset`
+```python
+parser.reset()
+```
+Resets underlying stream and current items list.
+
+After `reset()` is called, iterating over the items will start from the beginning.
+
+### `Writer`
+```python
+Writer(self, **options)
+```
+Abstract class implemented by the data writers.
+
+The writers inherit and implement this class' methods to add support for a
+new file destination.
+
+__Arguments__
+- __**options (dict)__: Writer options.
+
+
+
+#### `writer.options`
+
+
+#### `writer.write`
+```python
+writer.write(source, target, headers, encoding=None)
+```
+Writes source data to target.
+
+__Arguments__
+- __source (str)__: Source data.
+- __target (str)__: Write target.
+- __headers (List[str])__: List of header names.
+- __encoding (str, optional)__: Source file encoding.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+
+
+### `validate`
+```python
+validate(source, scheme=None, format=None)
+```
+Check if tabulator is able to load the source.
+
+__Arguments__
+- __source (Union[str, IO])__: The source path or IO object.
+- __scheme (str, optional)__: The source scheme. Auto-detect by default.
+- __format (str, optional)__: The source file format. Auto-detect by default.
+
+__Raises__
+- `SchemeError`: The file scheme is not supported.
+- `FormatError`: The file format is not supported.
+
+__Returns__
+
+`bool`: Whether tabulator is able to load the source file.
+
+
+### `TabulatorException`
+```python
+TabulatorException()
+```
+Base class for all tabulator exceptions.
+
+
+### `SourceError`
+```python
+SourceError()
+```
+The source file could not be parsed correctly.
+
+
+### `SchemeError`
+```python
+SchemeError()
+```
+The file scheme is not supported.
+
+
+### `FormatError`
+```python
+FormatError()
+```
+The file format is unsupported or invalid.
+
+
+### `EncodingError`
+```python
+EncodingError()
+```
+Encoding error
+
+
+### `CompressionError`
+```python
+CompressionError()
+```
+Compression error
+
+
+### `IOError`
+```python
+IOError()
+```
+Local loading error
+
+
+### `LoadingError`
+```python
+LoadingError()
+```
+Local loading error
+
+
+### `HTTPError`
+```python
+HTTPError()
+```
+Remote loading error
+
+## Contributing
+
+> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
+
+Recommended way to get started is to create and activate a project virtual environment.
+To install package and development dependencies into active environment:
+
+```bash
+$ make install
+```
+
+To run tests with linting and coverage:
+
+```bash
+$ make test
+```
+
+To run tests without Internet:
+
+```
+$ pytest -m 'not remote
+```
+
+## Changelog
+
+Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/tabulator-py/commits/master).
+
+#### v1.53
+
+- Add support for raw_html extraction in html parser (#341)
+
+#### v1.52
+
+- Published stream.dialect (works only for csv, for now)
+
+#### v1.51
+
+- Added experimental table discovery options
+
+#### v1.50
+
+- Ensure that headers is always a list of strings
+
+#### v1.49
+
+- Added `workbook_cache` argument for XLSX formats
+
+#### v1.48
+
+- Published `stream.hashing_algorithm` property
+
+#### v1.47
+
+- Added `hashing_algorithm` parameter
+
+#### v1.46
+
+- Fixed multiline headers merging
+- Introduced a `multiline_headers_duplicates` flag
+
+#### v1.45
+
+- HTML format: adds support for empty selector (#321)
+
+#### v1.44
+
+- Exposed `stream.compression`
+
+#### v1.43
+
+- Exposed `stream.source`
+
+#### v1.42
+
+- Exposed format option to the CLI
+
+#### v1.41
+
+- Implemented a `pick_rows` parameter (opposite to `skip_rows`)
+
+#### v1.40
+
+- Implemented `stream.save()` returning count of written rows
+
+#### v1.39
+
+- Implemented JSON writer (#311)
+
+#### v1.38
+
+- Use `chardet` for encoding detection by default. For `cchardet`: `pip install tabulator[cchardet]`. Due to a great deal of problems caused by `ccharted` for non-Linux/Conda installations we're returning back to using `chardet` by default.
+
+#### v1.37
+
+- Raise IOError/HTTPError even a not-existent file has a bad format (#304)
+
+#### v1.36
+
+- Implemented `blank` preset for `skip_rows` (#302)
+
+#### v1.35
+
+- Added `skip/pick_columns` aliases for (#293)
+
+#### v1.34
+
+- Added `multiline_headers_joiner` argument (#291)
+
+#### v1.33
+
+- Added support for regex patterns in `skip_rows` (#290)
+
+#### v1.32
+
+- Added ability to skip columns (#293)
+
+#### v1.31
+
+- Added `xlsx` writer
+- Added `html` reader
+
+#### v1.30
+
+- Added `adjust_floating_point_error` parameter to the `xlsx` parser
+
+#### v1.29
+
+- Implemented the `stream.size` and `stream.hash` properties
+
+#### v1.28
+
+- Added SQL writer
+
+#### v1.27
+
+- Added `http_timeout` argument for the `http/https` format
+
+#### v1.26
+
+- Added `stream.fragment` field showing e.g. Excel sheet's or DP resource's name
+
+#### v1.25
+
+- Added support for the `s3` file scheme (data loading from AWS S3)
+
+#### v1.24
+
+- Added support for compressed file-like objects
+
+#### v1.23
+
+- Added a setter for the `stream.headers` property
+
+#### v1.22
+
+- The `headers` parameter will now use the first not skipped row if the `skip_rows` parameter is provided and there are comments on the top of a data source (see #264)
+
+#### v1.21
+
+- Implemented experimental `preserve_formatting` for xlsx
+
+#### v1.20
+
+- Added support for specifying filename in zip source
+
+#### v1.19
+
+Updated behaviour:
+- For `ods` format the boolean, integer and datetime native types are detected now
+
+#### v1.18
+
+Updated behaviour:
+- For `xls` format the boolean, integer and datetime native types are detected now
+
+#### v1.17
+
+Updated behaviour:
+- Added support for Python 3.7
+
+#### v1.16
+
+New API added:
+- `skip_rows` support for an empty string to skip rows with an empty first column
+
+#### v1.15
+
+New API added:
+- Format will be extracted from URLs like `http://example.com?format=csv`
+
+#### v1.14
+
+Updated behaviour:
+- Now `xls` booleans will be parsed as booleans not integers
+
+#### v1.13
+
+New API added:
+- The `skip_rows` argument now supports negative numbers to skip rows starting from the end
+
+#### v1.12
+
+Updated behaviour:
+- Instead of raising an exception, a `UserWarning` warning will be emitted if an option isn't recognized.
+
+#### v1.11
+
+New API added:
+- Added `http_session` argument for the `http/https` format (it uses `requests` now)
+- Added support for multiline headers: `headers` argument accept ranges like `[1,3]`
+
+#### v1.10
+
+New API added:
+- Added support for compressed files i.e. `zip` and `gz` on Python3
+- The `Stream` constructor now accepts a `compression` argument
+- The `http/https` scheme now accepts a `http_stream` flag
+
+#### v1.9
+
+Improved behaviour:
+- The `headers` argument allows to set the order for keyed sources and cherry-pick values
+
+#### v1.8
+
+New API added:
+- Formats `XLS/XLSX/ODS` supports sheet names passed via the `sheet` argument
+- The `Stream` constructor accepts an `ignore_blank_headers` option
+
+#### v1.7
+
+Improved behaviour:
+- Rebased `datapackage` format on `datapackage@1` library
+
+#### v1.6
+
+New API added:
+- Argument `source` for the `Stream` constructor can be a `pathlib.Path`
+
+#### v1.5
+
+New API added:
+- Argument `bytes_sample_size` for the `Stream` constructor
+
+#### v1.4
+
+Improved behaviour:
+- Updated encoding name to a canonical form
+
+#### v1.3
+
+New API added:
+- `stream.scheme`
+- `stream.format`
+- `stream.encoding`
+
+Promoted provisional API to stable API:
+- `Loader` (custom loaders)
+- `Parser` (custom parsers)
+- `Writer` (custom writers)
+- `validate`
+
+#### v1.2
+
+Improved behaviour:
+- Autodetect common CSV delimiters
+
+#### v1.1
+
+New API added:
+- Added `fill_merged_cells` option to `xls/xlsx` formats
+
+#### v1.0
+
+New API added:
+- published `Loader/Parser/Writer` API
+- Added `Stream` argument `force_strings`
+- Added `Stream` argument `force_parse`
+- Added `Stream` argument `custom_writers`
+
+Deprecated API removal:
+- removed `topen` and `Table` - use `Stream` instead
+- removed `Stream` arguments `loader/parser_options` - use `**options` instead
+
+Provisional API changed:
+- Updated the `Loader/Parser/Writer` API - please use an updated version
+
+#### v0.15
+
+Provisional API added:
+- Unofficial support for `Stream` arguments `custom_loaders/parsers`
+
+
+[stream.py]: tabulator/stream.py
+[examples-dir]: examples "Examples"
+[requests-session]: https://docs.puthon-requests.org/en/master/user/advanced/#session-objects
+[pydoc-csv]: https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters "Python CSV options"
+[sqlalchemy]: https://www.sqlalchemy.org/
+[tdp]: https://frictionlessdata.io/specs/tabular-data-package/ "Tabular Data Package"
+[tabulator.exceptions]: tabulator/exceptions.py "Tabulator Exceptions"
+
+
+
+%package -n python3-tabulator
+Summary: Consistent interface for stream reading and writing tabular data (csv/xls/json/etc)
+Provides: python-tabulator
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-tabulator
+# tabulator-py
+
+[![Travis](https://img.shields.io/travis/frictionlessdata/tabulator-py/master.svg)](https://travis-ci.org/frictionlessdata/tabulator-py)
+[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/tabulator-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/tabulator-py?branch=master)
+[![PyPi](https://img.shields.io/pypi/v/tabulator.svg)](https://pypi.python.org/pypi/tabulator)
+[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/tabulator-py)
+[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
+
+A library for reading and writing tabular data (csv/xls/json/etc).
+
+> **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework is logical continuation of `tabulator` that was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `tabulator` to Frictionless Framework.
+> - we continue to bug-fix `tabulator@1.x` in this [repository](https://github.com/frictionlessdata/tabulator-py) as well as it's available on [PyPi](https://pypi.org/project/tabulator/) as it was before
+> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
+> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
+
+## Features
+
+- **Supports most common tabular formats**: CSV, XLS, ODS, JSON, Google Sheets, SQL, and others. See complete list [below](#supported-file-formats).
+- **Loads local and remote data**: Supports HTTP, FTP and S3.
+- **Low memory usage**: Only the current row is kept in memory, so you can
+ large datasets.
+- **Supports compressed files**: Using ZIP or GZIP algorithms.
+- **Extensible**: You can add support for custom file formats and loaders (e.g.
+ FTP).
+
+## Contents
+
+<!--TOC-->
+
+ - [Getting started](#getting-started)
+ - [Installation](#installation)
+ - [Running on CLI](#running-on-cli)
+ - [Running on Python](#running-on-python)
+ - [Documentation](#documentation)
+ - [Working with Stream](#working-with-stream)
+ - [Supported schemes](#supported-schemes)
+ - [Supported file formats](#supported-file-formats)
+ - [Custom file sources and formats](#custom-file-sources-and-formats)
+ - [API Reference](#api-reference)
+ - [`cli`](#cli)
+ - [`Stream`](#stream)
+ - [`Loader`](#loader)
+ - [`Parser`](#parser)
+ - [`Writer`](#writer)
+ - [`validate`](#validate)
+ - [`TabulatorException`](#tabulatorexception)
+ - [`SourceError`](#sourceerror)
+ - [`SchemeError`](#schemeerror)
+ - [`FormatError`](#formaterror)
+ - [`EncodingError`](#encodingerror)
+ - [`CompressionError`](#compressionerror)
+ - [`IOError`](#ioerror)
+ - [`LoadingError`](#loadingerror)
+ - [`HTTPError`](#httperror)
+ - [Contributing](#contributing)
+ - [Changelog](#changelog)
+
+<!--TOC-->
+
+## Getting started
+
+### Installation
+
+```bash
+$ pip install tabulator
+```
+
+### Running on CLI
+
+Tabulator ships with a simple CLI called `tabulator` to read tabular data. For
+example:
+
+```bash
+$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
+id,name
+1,english
+2,中国人
+```
+
+You can see all supported options by running `tabulator --help`.
+
+### Running on Python
+
+```python
+from tabulator import Stream
+
+with Stream('data.csv', headers=1) as stream:
+ stream.headers # [header1, header2, ..]
+ for row in stream:
+ print(row) # [value1, value2, ..]
+```
+
+You can find other examples in the [examples][examples-dir] directory.
+
+## Documentation
+
+In the following sections, we'll walk through some usage examples of
+this library. All examples were tested with Python 3.6, but should
+run fine with Python 3.3+.
+
+### Working with Stream
+
+The `Stream` class represents a tabular stream. It takes the file path as the
+`source` argument. For example:
+
+```
+<scheme>://path/to/file.<format>
+```
+
+It uses this path to determine the file format (e.g. CSV or XLS) and scheme
+(e.g. HTTP or postgresql). It also supports format extraction from URLs like `http://example.com?format=csv`. If necessary, you also can define these explicitly.
+
+Let's try it out. First, we create a `Stream` object passing the path to a CSV file.
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+```
+
+At this point, the file haven't been read yet. Let's open the stream so we can
+read the contents.
+
+```python
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+```
+
+This will open the underlying data stream, read a small sample to detect the
+file encoding, and prepare the data to be read. We catch
+`tabulator.TabulatorException` here, in case something goes wrong.
+
+We can now read the file contents. To iterate over each row, we do:
+
+```python
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+```
+
+The `stream.iter()` method will return each row data as a list of values. If
+you prefer, you could call `stream.iter(keyed=True)` instead, which returns a
+dictionary with the column names as keys. Either way, this method keeps only a
+single row in memory at a time. This means it can handle handle large files
+without consuming too much memory.
+
+If you want to read the entire file, use `stream.read()`. It accepts the same
+arguments as `stream.iter()`, but returns all rows at once.
+
+```python
+stream.reset()
+rows = stream.read()
+```
+
+Notice that we called `stream.reset()` before reading the rows. This is because
+internally, tabulator only keeps a pointer to its current location in the file.
+If we didn't reset this pointer, we would read starting from where we stopped.
+For example, if we ran `stream.read()` again, we would get an empty list, as
+the internal file pointer is at the end of the file (because we've already read
+it all). Depending on the file location, it might be necessary to download the
+file again to rewind (e.g. when the file was loaded from the web).
+
+After we're done, close the stream with:
+
+```python
+stream.close()
+```
+
+The entire example looks like:
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+
+stream.reset() # Rewind internal file pointer
+rows = stream.read()
+
+stream.close()
+```
+
+It could be rewritten to use Python's context manager interface as:
+
+```python
+import tabulator
+
+try:
+ with tabulator.Stream('data.csv') as stream:
+ for row in stream.iter():
+ print(row)
+
+ stream.reset()
+ rows = stream.read()
+except tabulator.TabulatorException as e:
+ pass
+```
+
+This is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.
+
+The full API documentation is available as docstrings in the [Stream source code][stream.py].
+
+#### Headers
+
+By default, tabulator considers that all file rows are values (i.e. there is no
+header).
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.headers # None
+ stream.read() # [['name', 'age'], ['Alex', 21]]
+```
+
+If you have a header row, you can use the `headers` argument with the its row
+number (starting from 1).
+
+```python
+# Integer
+with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+You can also pass a lists of strings to define the headers explicitly:
+
+```python
+with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+Tabulator also supports multiline headers for the `xls` and `xlsx` formats.
+
+```python
+with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
+ stream.headers # ['header from row 1-3']
+ stream.read() # [['value1', 'value2', 'value3']]
+```
+
+#### Encoding
+
+You can specify the file encoding (e.g. `utf-8` and `latin1`) via the `encoding`
+argument.
+
+```python
+with Stream(source, encoding='latin1') as stream:
+ stream.read()
+```
+
+If this argument isn't set, Tabulator will try to infer it from the data. If you
+get a `UnicodeDecodeError` while loading a file, try setting the encoding to
+`utf-8`.
+
+#### Compression (Python3-only)
+
+Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:
+
+```python
+with Stream('http://example.com/data.csv.zip') as stream:
+ stream.read()
+```
+
+You can also set it explicitly:
+
+```python
+with Stream('data.csv.ext', compression='gz') as stream:
+ stream.read()
+```
+**Options**
+
+- **filename**: filename in zip file to process (default is first file)
+
+#### Allow html
+
+The `Stream` class raises `tabulator.exceptions.FormatError` if it detects HTML
+contents. This helps avoiding the relatively common mistake of trying to load a
+CSV file inside an HTML page, for example on GitHub.
+
+You can disable this behaviour using the `allow_html` option:
+
+```python
+with Stream(source_with_html, allow_html=True) as stream:
+ stream.read() # no exception on open
+```
+
+#### Sample size
+
+To detect the file's headers, and run other checks like validating that the file
+doesn't contain HTML, Tabulator reads a sample of rows on the `stream.open()`
+method. This data is available via the `stream.sample` property. The number of
+rows used can be defined via the `sample_size` parameters (defaults to 100).
+
+```python
+with Stream(two_rows_source, sample_size=1) as stream:
+ stream.sample # only first row
+ stream.read() # first and second rows
+```
+
+You can disable this by setting `sample_size` to zero. This way, no data will be
+read on `stream.open()`.
+
+#### Bytes sample size
+
+Tabulator needs to read a part of the file to infer its encoding. The
+`bytes_sample_size` arguments controls how many bytes will be read for this
+detection (defaults to 10000).
+
+```python
+source = 'data/special/latin1.csv'
+with Stream(source) as stream:
+ stream.encoding # 'iso8859-2'
+```
+
+You can disable this by setting `bytes_sample_size` to zero, in which case it'll
+use the machine locale's default encoding.
+
+#### Ignore blank headers
+
+When `True`, tabulator will ignore columns that have blank headers (defaults to
+`False`).
+
+```python
+# Default behaviour
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1) as stream:
+ stream.headers # ['header1', '', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}
+
+# Ignoring columns with blank headers
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
+ stream.headers # ['header1', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}
+```
+
+#### Ignore listed/not-listed headers
+
+The option is similar to the `ignore_blank_headers`. It removes arbitrary columns from the data based on the corresponding column names:
+
+```python
+# Ignore listed headers (omit columns)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header1', 'header3']
+ assert stream.read(keyed=True) == [
+ {'header1': 'value1', 'header3': 'value3'},
+ ]
+
+# Ignore NOT listed headers (pick colums)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_not_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header2']
+ assert stream.read(keyed=True) == [
+ {'header2': 'value2'},
+ ]
+```
+
+#### Force strings
+
+When `True`, all rows' values will be converted to strings (defaults to
+`False`). `None` values will be converted to empty strings.
+
+```python
+# Default behaviour
+with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
+ stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]
+
+# Forcing rows' values as strings
+with Stream([['string', 1]], force_strings=True) as stream:
+ stream.read() # [['string', '1', '2017-12-01 17:00:00']]
+```
+
+#### Force parse
+
+When `True`, don't raise an exception when parsing a malformed row, but simply
+return an empty row. Otherwise, tabulator raises
+`tabulator.exceptions.SourceError` when a row can't be parsed. Defaults to `False`.
+
+```python
+# Default behaviour
+with Stream([[1], 'bad', [3]]) as stream:
+ stream.read() # raises tabulator.exceptions.SourceError
+
+# With force_parse
+with Stream([[1], 'bad', [3]], force_parse=True) as stream:
+ stream.read() # [[1], [], [3]]
+```
+
+#### Skip rows
+
+List of row numbers and/or strings to skip.
+If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//').
+If it's the empty string, all rows that begin with an empty column will be skipped.
+
+```python
+source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
+with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
+ stream.read() # [['Mike', 4]]
+```
+
+If the `headers` parameter is also set to be an integer, it will use the first not skipped row as a headers.
+
+```python
+source = [['#comment'], ['name', 'order'], ['John', 1], ['Alex', 2]]
+with Stream(source, headers=1, skip_rows=['#']) as stream:
+ stream.headers # [['name', 'order']]
+ stream.read() # [['Jogn', 1], ['Alex', 2]]
+```
+
+#### Post parse
+
+List of functions that can filter or transform rows after they are parsed. These
+functions receive the `extended_rows` containing the row's number, headers
+list, and the row values list. They then process the rows, and yield or discard
+them, modified or not.
+
+```python
+def skip_odd_rows(extended_rows):
+ for row_number, headers, row in extended_rows:
+ if not row_number % 2:
+ yield (row_number, headers, row)
+
+def multiply_by_two(extended_rows):
+ for row_number, headers, row in extended_rows:
+ doubled_row = list(map(lambda value: value * 2, row))
+ yield (row_number, headers, doubled_row)
+
+rows = [
+ [1],
+ [2],
+ [3],
+ [4],
+]
+with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
+ stream.read() # [[4], [8]]
+```
+
+These functions are applied in order, as a simple data pipeline. In the example
+above, `multiply_by_two` just sees the rows yielded by `skip_odd_rows`.
+
+#### Keyed and extended rows
+
+The methods `stream.iter()` and `stream.read()` accept the `keyed` and
+`extended` flag arguments to modify how the rows are returned.
+
+By default, every row is returned as a list of its cells values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read() # [['Alex', 21]]
+```
+
+With `keyed=True`, the rows are returned as dictionaries, mapping the column names to their values in the row:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]
+```
+
+And with `extended=True`, the rows are returned as a tuple of `(row_number,
+headers, row)`, there `row_number` is the current row number (starting from 1),
+`headers` is a list with the headers names, and `row` is a list with the rows
+values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])
+```
+
+### Supported schemes
+
+#### s3
+
+It loads data from AWS S3. For private files you should provide credentials supported by the `boto3` library, for example, corresponding environment variables. Read more about [configuring `boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html).
+
+```python
+stream = Stream('s3://bucket/data.csv')
+```
+
+**Options**
+
+- **s3\_endpoint\_url** - the endpoint URL to use. By default it's `https://s3.amazonaws.com`. For complex use cases, for example, `goodtables`'s runs on a data package this option can be provided as an environment variable `S3_ENDPOINT_URL`.
+
+#### file
+
+The default scheme, a file in the local filesystem.
+
+```python
+stream = Stream('data.csv')
+```
+
+#### http/https/ftp/ftps
+
+> In Python 2, `tabulator` can't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.
+
+```python
+stream = Stream('https://example.com/data.csv')
+```
+
+**Options**
+
+- **http\_session** - a `requests.Session` object. Read more in the [requests docs][requests-session].
+- **http\_stream** - Enables or disables HTTP streaming, when possible (enabled by default). Disable it if you'd like to preload the whole file into memory.
+- **http\_timeout** - This timeout will be used for a `requests` session construction.
+
+#### stream
+
+The source is a file-like Python object.
+
+
+```python
+with open('data.csv') as fp:
+ stream = Stream(fp)
+```
+
+#### text
+
+The source is a string containing the tabular data. Both `scheme` and `format`
+must be set explicitly, as it's not possible to infer them.
+
+```python
+stream = Stream(
+ 'name,age\nJohn, 21\n',
+ scheme='text',
+ format='csv'
+)
+```
+
+### Supported file formats
+
+In this section, we'll describe the supported file formats, and their respective
+configuration options and operations. Some formats only support read operations,
+while others support both reading and writing.
+
+#### csv (read & write)
+
+```python
+stream = Stream('data.csv', delimiter=',')
+```
+
+**Options**
+
+It supports all options from the Python CSV library. Check [their
+documentation][pydoc-csv] for more information.
+
+#### xls/xlsx (read & write)
+
+> Tabulator is unable to stream `xls` files, so the entire file is loaded in
+> memory. Streaming is supported for `xlsx` files.
+
+```python
+stream = Stream('data.xls', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1).
+- **workbook_cache**: An empty dictionary to handle workbook caching. If provided, `tabulator` will fill the dictionary with `source: tmpfile_path` pairs for remote workbooks. Each workbook will be downloaded only once and all the temporary files will be deleted on the process exit. Defauts: None
+- **fill_merged_cells**: if `True` it will unmerge and fill all merged cells by
+ a visible value. With this option enabled the parser can't stream data and
+ load the whole document into memory.
+- **preserve_formatting**: if `True` it will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)
+- **adjust_floating_point_error**: if `True` it will correct the Excel behaviour regarding floating point numbers
+
+#### ods (read only)
+
+> This format is not included to package by default. To use it please install `tabulator` with an `ods` extras: `$ pip install tabulator[ods]`
+
+Source should be a valid Open Office document.
+
+```python
+stream = Stream('data.ods', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1)
+
+#### gsheet (read only)
+
+A publicly-accessible Google Spreadsheet.
+
+```python
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>?usp=sharing')
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>edit#gid=<gid>')
+```
+
+#### sql (read & write)
+
+Any database URL supported by [sqlalchemy][sqlalchemy].
+
+```python
+stream = Stream('postgresql://name:pass@host:5432/database', table='data')
+```
+
+**Options**
+
+- **table (required)**: Database table name
+- **order_by**: SQL expression for row ordering (e.g. `name DESC`)
+
+#### Data Package (read only)
+
+> This format is not included to package by default. You can enable it by
+> installing tabulator using `pip install tabulator[datapackage]`.
+
+A [Tabular Data Package][tdp].
+
+```python
+stream = Stream('datapackage.json', resource=1)
+```
+
+**Options**
+
+- **resource**: Resource name or index (starting from 0)
+
+#### inline (read only)
+
+Either a list of lists, or a list of dicts mapping the column names to their
+respective values.
+
+```python
+stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
+stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])
+```
+
+#### json (read & write)
+
+JSON document containing a list of lists, or a list of dicts mapping the column
+names to their respective values (see the `inline` format for an example).
+
+```python
+stream = Stream('data.json', property='key1.key2')
+```
+
+**Options**
+
+- **property**: JSON Path to the property containing the tabular data. For example, considering the JSON `{"response": {"data": [...]}}`, the `property` should be set to `response.data`.
+- **keyed** (write): Save as array of arrays (default) or as array of dicts (keyed).
+
+#### ndjson (read only)
+
+```python
+stream = Stream('data.ndjson')
+```
+
+#### tsv (read only)
+
+```python
+stream = Stream('data.tsv')
+```
+
+#### html (read only)
+
+
+> This format is not included to package by default. To use it please install `tabulator` with the `html` extra: `$ pip install tabulator[html]`
+
+An HTML table element residing inside an HTML document.
+
+Supports simple tables (no merged cells) with any legal combination of the td, th, tbody & thead elements.
+
+Usually `foramt='html'` would need to be specified explicitly as web URLs don't always use the `.html` extension.
+
+```python
+stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
+```
+
+**Options**
+
+- **selector**: CSS selector for specifying which `table` element to extract. By default it's `table`, which takes the first `table` element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
+
+- **raw_html**: False (default) to extract the textual contents of each cell. True to return the inner html without modification.
+
+### Custom file sources and formats
+
+Tabulator is written with extensibility in mind, allowing you to add support for
+new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There
+are three components that allow this:
+
+* Loaders
+ * Loads a stream from some location (e.g. ssh)
+* Parsers
+ * Parses a stream of tabular data in some format (e.g. xls)
+* Writers
+ * Writes tabular data to some destination (e.g. MongoDB)
+
+In this section, we'll see how to write custom classes to extend any of these components.
+
+#### Custom loaders
+
+You can add support for a new scheme (e.g. ssh) by creating a custom loader.
+Custom loaders are implemented by inheriting from the `Loader` class, and
+implementing its methods. This loader can then be used by `Stream` to load data
+by passing it via the `custom_loaders={'scheme': CustomLoader}` argument.
+
+The skeleton of a custom loader looks like:
+
+```python
+from tabulator import Loader
+
+class CustomLoader(Loader):
+ options = []
+
+ def __init__(self, bytes_sample_size, **options):
+ pass
+
+ def load(self, source, mode='t', encoding=None):
+ # load logic
+
+with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
+ stream.read()
+```
+
+You can see examples of how the loaders are implemented by looking in the
+`tabulator.loaders` module.
+
+#### Custom parsers
+
+You can add support for a new file format by creating a custom parser. Similarly
+to custom loaders, custom parsers are implemented by inheriting from the
+`Parser` class, and implementing its methods. This parser can then be used by
+`Stream` to parse data by passing it via the `custom_parsers={'format':
+CustomParser}` argument.
+
+The skeleton of a custom parser looks like:
+
+```python
+from tabulator import Parser
+
+class CustomParser(Parser):
+ options = []
+
+ def __init__(self, loader, force_parse, **options):
+ self.__loader = loader
+
+ def open(self, source, encoding=None):
+ # open logic
+
+ def close(self):
+ # close logic
+
+ def reset(self):
+ # reset logic
+
+ @property
+ def closed(self):
+ return False
+
+ @property
+ def extended_rows(self):
+ # extended rows logic
+
+with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
+ stream.read()
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.parsers` module.
+
+#### Custom writers
+
+You can add support to write files in a specific format by creating a custom
+writer. The custom writers are implemented by inheriting from the base `Writer`
+class, and implementing its methods. This writer can then be used by `Stream` to
+write data via the `custom_writers={'format': CustomWriter}` argument.
+
+The skeleton of a custom writer looks like:
+
+```python
+from tabulator import Writer
+
+class CustomWriter(Writer):
+ options = []
+
+ def __init__(self, **options):
+ pass
+
+ def write(self, source, target, headers=None, encoding=None):
+ # write logic
+
+with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
+ stream.save(target)
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.writers` module.
+
+## API Reference
+
+### `cli`
+```python
+cli(source, limit, **options)
+```
+Command-line interface
+
+```
+Usage: tabulator [OPTIONS] SOURCE
+
+Options:
+ --headers INTEGER
+ --scheme TEXT
+ --format TEXT
+ --encoding TEXT
+ --limit INTEGER
+ --sheet TEXT/INTEGER (excel)
+ --fill-merged-cells BOOLEAN (excel)
+ --preserve-formatting BOOLEAN (excel)
+ --adjust-floating-point-error BOOLEAN (excel)
+ --table TEXT (sql)
+ --order_by TEXT (sql)
+ --resource TEXT/INTEGER (datapackage)
+ --property TEXT (json)
+ --keyed BOOLEAN (json)
+ --version Show the version and exit.
+ --help Show this message and exit.
+```
+
+
+### `Stream`
+```python
+Stream(self,
+ source,
+ headers=None,
+ scheme=None,
+ format=None,
+ encoding=None,
+ compression=None,
+ allow_html=False,
+ sample_size=100,
+ bytes_sample_size=10000,
+ ignore_blank_headers=False,
+ ignore_listed_headers=None,
+ ignore_not_listed_headers=None,
+ multiline_headers_joiner=' ',
+ multiline_headers_duplicates=False,
+ force_strings=False,
+ force_parse=False,
+ pick_rows=None,
+ skip_rows=None,
+ pick_fields=None,
+ skip_fields=None,
+ pick_columns=None,
+ skip_columns=None,
+ post_parse=[],
+ custom_loaders={},
+ custom_parsers={},
+ custom_writers={},
+ **options)
+```
+Stream of tabular data.
+
+This is the main `tabulator` class. It loads a data source, and allows you
+to stream its parsed contents.
+
+__Arguments__
+
+
+ source (str):
+ Path to file as ``<scheme>://path/to/file.<format>``.
+ If not explicitly set, the scheme (file, http, ...) and
+ format (csv, xls, ...) are inferred from the source string.
+
+ headers (Union[int, List[int], List[str]], optional):
+ Either a row
+ number or list of row numbers (in case of multi-line headers) to be
+ considered as headers (rows start counting at 1), or the actual
+ headers defined a list of strings. If not set, all rows will be
+ treated as containing values.
+
+ scheme (str, optional):
+ Scheme for loading the file (file, http, ...).
+ If not set, it'll be inferred from `source`.
+
+ format (str, optional):
+ File source's format (csv, xls, ...). If not
+ set, it'll be inferred from `source`. inferred
+
+ encoding (str, optional):
+ Source encoding. If not set, it'll be inferred.
+
+ compression (str, optional):
+ Source file compression (zip, ...). If not set, it'll be inferred.
+
+ pick_rows (List[Union[int, str, dict]], optional):
+ The same as `skip_rows` but it's for picking rows instead of skipping.
+
+ skip_rows (List[Union[int, str, dict]], optional):
+ List of row numbers, strings and regex patterns as dicts to skip.
+ If a string, it'll skip rows that their first cells begin with it e.g. '#' and '//'.
+ To skip only completely blank rows use `{'type': 'preset', 'value': 'blank'}`
+ To provide a regex pattern use `{'type': 'regex', 'value': '^#'}`
+ For example: `skip_rows=[1, '# comment', {'type': 'regex', 'value': '^# (regex|comment)'}]`
+
+ pick_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list DOES NOT include
+
+ skip_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list includes. If it contains an empty string it will skip
+ empty headers
+
+ sample_size (int, optional):
+ Controls the number of sample rows used to
+ infer properties from the data (headers, encoding, etc.). Set to
+ ``0`` to disable sampling, in which case nothing will be inferred
+ from the data. Defaults to ``config.DEFAULT_SAMPLE_SIZE``.
+
+ bytes_sample_size (int, optional):
+ Same as `sample_size`, but instead
+ of number of rows, controls number of bytes. Defaults to
+ ``config.DEFAULT_BYTES_SAMPLE_SIZE``.
+
+ allow_html (bool, optional):
+ Allow the file source to be an HTML page.
+ If False, raises ``exceptions.FormatError`` if the loaded file is
+ an HTML page. Defaults to False.
+
+ multiline_headers_joiner (str, optional):
+ When passed, it's used to join multiline headers
+ as `<passed-value>.join(header1_1, header1_2)`
+ Defaults to ' ' (space).
+
+ multiline_headers_duplicates (bool, optional):
+ By default tabulator will exclude a cell of a miltilne header from joining
+ if it's exactly the same as the previous seen value in this field.
+ Enabling this option will force duplicates inclusion
+ Defaults to False.
+
+ force_strings (bool, optional):
+ When True, casts all data to strings.
+ Defaults to False.
+
+ force_parse (bool, optional):
+ When True, don't raise exceptions when
+ parsing malformed rows, simply returning an empty value. Defaults
+ to False.
+
+ post_parse (List[function], optional):
+ List of generator functions that
+ receives a list of rows and headers, processes them, and yields
+ them (or not). Useful to pre-process the data. Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as scheme names,
+ and values as their respective ``Loader`` class implementations.
+ Defaults to None.
+
+ custom_parsers (dict, optional):
+ Dictionary with keys as format names,
+ and values as their respective ``Parser`` class implementations.
+ Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as writer format
+ names, and values as their respective ``Writer`` class
+ implementations. Defaults to None.
+
+ **options (Any, optional): Extra options passed to the loaders and parsers.
+
+
+
+#### `stream.closed`
+Returns True if the underlying stream is closed, False otherwise.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `stream.compression`
+Stream's compression ("no" if no compression)
+
+__Returns__
+
+`str`: compression
+
+
+
+#### `stream.encoding`
+Stream's encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `stream.format`
+Path's format
+
+__Returns__
+
+`str`: format
+
+
+
+#### `stream.fragment`
+Path's fragment
+
+__Returns__
+
+`str`: fragment
+
+
+
+#### `stream.hash`
+Returns the SHA256 hash of the read chunks if available
+
+__Returns__
+
+`str/None`: SHA256 hash
+
+
+
+#### `stream.headers`
+Headers
+
+__Returns__
+
+`str[]/None`: headers if available
+
+
+
+#### `stream.sample`
+Returns the stream's rows used as sample.
+
+These sample rows are used internally to infer characteristics of the
+source file (e.g. encoding, headers, ...).
+
+__Returns__
+
+`list[]`: sample
+
+
+
+#### `stream.scheme`
+Path's scheme
+
+__Returns__
+
+`str`: scheme
+
+
+
+#### `stream.size`
+Returns the BYTE count of the read chunks if available
+
+__Returns__
+
+`int/None`: BYTE count
+
+
+
+#### `stream.source`
+Source
+
+__Returns__
+
+`any`: stream source
+
+
+
+#### `stream.open`
+```python
+stream.open()
+```
+Opens the stream for reading.
+
+__Raises:__
+
+ TabulatorException: if an error
+
+
+
+#### `stream.close`
+```python
+stream.close()
+```
+Closes the stream.
+
+
+#### `stream.reset`
+```python
+stream.reset()
+```
+Resets the stream pointer to the beginning of the file.
+
+
+#### `stream.iter`
+```python
+stream.iter(keyed=False, extended=False)
+```
+Iterate over the rows.
+
+Each row is returned in a format that depends on the arguments `keyed`
+and `extended`. By default, each row is returned as list of their
+values.
+
+__Arguments__
+- __keyed (bool, optional)__:
+ When True, each returned row will be a
+ `dict` mapping the header name to its value in the current row.
+ For example, `[{'name': 'J Smith', 'value': '10'}]`. Ignored if
+ ``extended`` is True. Defaults to False.
+- __extended (bool, optional)__:
+ When True, returns each row as a tuple
+ with row number (starts at 1), list of headers, and list of row
+ values. For example, `(1, ['name', 'value'], ['J Smith', '10'])`.
+ Defaults to False.
+
+__Raises__
+- `exceptions.TabulatorException`: If the stream is closed.
+
+__Returns__
+
+`Iterator[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The row itself. The format depends on the values of `keyed` and
+ `extended` arguments.
+
+
+
+#### `stream.read`
+```python
+stream.read(keyed=False, extended=False, limit=None)
+```
+Returns a list of rows.
+
+__Arguments__
+- __keyed (bool, optional)__: See :func:`Stream.iter`.
+- __extended (bool, optional)__: See :func:`Stream.iter`.
+- __limit (int, optional)__:
+ Number of rows to return. If None, returns all rows. Defaults to None.
+
+__Returns__
+
+`List[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The list of rows. The format depends on the values of `keyed`
+ and `extended` arguments.
+
+
+#### `stream.save`
+```python
+stream.save(target, format=None, encoding=None, **options)
+```
+Save stream to the local filesystem.
+
+__Arguments__
+- __target (str)__: Path where to save the stream.
+- __format (str, optional)__:
+ The format the stream will be saved as. If
+ None, detects from the ``target`` path. Defaults to None.
+- __encoding (str, optional)__:
+ Saved file encoding. Defaults to ``config.DEFAULT_ENCODING``.
+- __**options__: Extra options passed to the writer.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+Building index...
+Started generating documentation...
+
+### `Loader`
+```python
+Loader(self, bytes_sample_size, **options)
+```
+Abstract class implemented by the data loaders
+
+The loaders inherit and implement this class' methods to add support for a
+new scheme (e.g. ssh).
+
+__Arguments__
+- __bytes_sample_size (int)__: Sample size in bytes
+- __**options (dict)__: Loader options
+
+
+
+#### `loader.options`
+
+
+#### `loader.load`
+```python
+loader.load(source, mode='t', encoding=None)
+```
+Load source file.
+
+__Arguments__
+- __source (str)__: Path to tabular source file.
+- __mode (str, optional)__:
+ Text stream mode, `t` (text) or `b` (binary). Defaults to `t`.
+- __encoding (str, optional)__:
+ Source encoding. Auto-detect by default.
+
+__Returns__
+
+`Union[TextIO, BinaryIO]`: I/O stream opened either as text or binary.
+
+
+### `Parser`
+```python
+Parser(self, loader, force_parse, **options)
+```
+Abstract class implemented by the data parsers.
+
+The parsers inherit and implement this class' methods to add support for a
+new file type.
+
+__Arguments__
+- __loader (tabulator.Loader)__: Loader instance to read the file.
+- __force_parse (bool)__:
+ When `True`, the parser yields an empty extended
+ row tuple `(row_number, None, [])` when there is an error parsing a
+ row. Otherwise, it stops the iteration by raising the exception
+ `tabulator.exceptions.SourceError`.
+- __**options (dict)__: Loader options
+
+
+
+#### `parser.closed`
+Flag telling if the parser is closed.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `parser.encoding`
+Encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `parser.extended_rows`
+Returns extended rows iterator.
+
+The extended rows are tuples containing `(row_number, headers, row)`,
+
+__Raises__
+- `SourceError`:
+ If `force_parse` is `False` and
+ a row can't be parsed, this exception will be raised.
+ Otherwise, an empty extended row is returned (i.e.
+ `(row_number, None, [])`).
+
+Returns:
+ Iterator[Tuple[int, List[str], List[Any]]]:
+ Extended rows containing
+ `(row_number, headers, row)`, where `headers` is a list of the
+ header names (can be `None`), and `row` is a list of row
+ values.
+
+
+
+#### `parser.options`
+
+
+#### `parser.open`
+```python
+parser.open(source, encoding=None)
+```
+Open underlying file stream in the beginning of the file.
+
+The parser gets a byte or text stream from the `tabulator.Loader`
+instance and start emitting items.
+
+__Arguments__
+- __source (str)__: Path to source table.
+- __encoding (str, optional)__: Source encoding. Auto-detect by default.
+
+__Returns__
+
+ None
+
+
+
+#### `parser.close`
+```python
+parser.close()
+```
+Closes underlying file stream.
+
+
+#### `parser.reset`
+```python
+parser.reset()
+```
+Resets underlying stream and current items list.
+
+After `reset()` is called, iterating over the items will start from the beginning.
+
+### `Writer`
+```python
+Writer(self, **options)
+```
+Abstract class implemented by the data writers.
+
+The writers inherit and implement this class' methods to add support for a
+new file destination.
+
+__Arguments__
+- __**options (dict)__: Writer options.
+
+
+
+#### `writer.options`
+
+
+#### `writer.write`
+```python
+writer.write(source, target, headers, encoding=None)
+```
+Writes source data to target.
+
+__Arguments__
+- __source (str)__: Source data.
+- __target (str)__: Write target.
+- __headers (List[str])__: List of header names.
+- __encoding (str, optional)__: Source file encoding.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+
+
+### `validate`
+```python
+validate(source, scheme=None, format=None)
+```
+Check if tabulator is able to load the source.
+
+__Arguments__
+- __source (Union[str, IO])__: The source path or IO object.
+- __scheme (str, optional)__: The source scheme. Auto-detect by default.
+- __format (str, optional)__: The source file format. Auto-detect by default.
+
+__Raises__
+- `SchemeError`: The file scheme is not supported.
+- `FormatError`: The file format is not supported.
+
+__Returns__
+
+`bool`: Whether tabulator is able to load the source file.
+
+
+### `TabulatorException`
+```python
+TabulatorException()
+```
+Base class for all tabulator exceptions.
+
+
+### `SourceError`
+```python
+SourceError()
+```
+The source file could not be parsed correctly.
+
+
+### `SchemeError`
+```python
+SchemeError()
+```
+The file scheme is not supported.
+
+
+### `FormatError`
+```python
+FormatError()
+```
+The file format is unsupported or invalid.
+
+
+### `EncodingError`
+```python
+EncodingError()
+```
+Encoding error
+
+
+### `CompressionError`
+```python
+CompressionError()
+```
+Compression error
+
+
+### `IOError`
+```python
+IOError()
+```
+Local loading error
+
+
+### `LoadingError`
+```python
+LoadingError()
+```
+Local loading error
+
+
+### `HTTPError`
+```python
+HTTPError()
+```
+Remote loading error
+
+## Contributing
+
+> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
+
+Recommended way to get started is to create and activate a project virtual environment.
+To install package and development dependencies into active environment:
+
+```bash
+$ make install
+```
+
+To run tests with linting and coverage:
+
+```bash
+$ make test
+```
+
+To run tests without Internet:
+
+```
+$ pytest -m 'not remote
+```
+
+## Changelog
+
+Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/tabulator-py/commits/master).
+
+#### v1.53
+
+- Add support for raw_html extraction in html parser (#341)
+
+#### v1.52
+
+- Published stream.dialect (works only for csv, for now)
+
+#### v1.51
+
+- Added experimental table discovery options
+
+#### v1.50
+
+- Ensure that headers is always a list of strings
+
+#### v1.49
+
+- Added `workbook_cache` argument for XLSX formats
+
+#### v1.48
+
+- Published `stream.hashing_algorithm` property
+
+#### v1.47
+
+- Added `hashing_algorithm` parameter
+
+#### v1.46
+
+- Fixed multiline headers merging
+- Introduced a `multiline_headers_duplicates` flag
+
+#### v1.45
+
+- HTML format: adds support for empty selector (#321)
+
+#### v1.44
+
+- Exposed `stream.compression`
+
+#### v1.43
+
+- Exposed `stream.source`
+
+#### v1.42
+
+- Exposed format option to the CLI
+
+#### v1.41
+
+- Implemented a `pick_rows` parameter (opposite to `skip_rows`)
+
+#### v1.40
+
+- Implemented `stream.save()` returning count of written rows
+
+#### v1.39
+
+- Implemented JSON writer (#311)
+
+#### v1.38
+
+- Use `chardet` for encoding detection by default. For `cchardet`: `pip install tabulator[cchardet]`. Due to a great deal of problems caused by `ccharted` for non-Linux/Conda installations we're returning back to using `chardet` by default.
+
+#### v1.37
+
+- Raise IOError/HTTPError even a not-existent file has a bad format (#304)
+
+#### v1.36
+
+- Implemented `blank` preset for `skip_rows` (#302)
+
+#### v1.35
+
+- Added `skip/pick_columns` aliases for (#293)
+
+#### v1.34
+
+- Added `multiline_headers_joiner` argument (#291)
+
+#### v1.33
+
+- Added support for regex patterns in `skip_rows` (#290)
+
+#### v1.32
+
+- Added ability to skip columns (#293)
+
+#### v1.31
+
+- Added `xlsx` writer
+- Added `html` reader
+
+#### v1.30
+
+- Added `adjust_floating_point_error` parameter to the `xlsx` parser
+
+#### v1.29
+
+- Implemented the `stream.size` and `stream.hash` properties
+
+#### v1.28
+
+- Added SQL writer
+
+#### v1.27
+
+- Added `http_timeout` argument for the `http/https` format
+
+#### v1.26
+
+- Added `stream.fragment` field showing e.g. Excel sheet's or DP resource's name
+
+#### v1.25
+
+- Added support for the `s3` file scheme (data loading from AWS S3)
+
+#### v1.24
+
+- Added support for compressed file-like objects
+
+#### v1.23
+
+- Added a setter for the `stream.headers` property
+
+#### v1.22
+
+- The `headers` parameter will now use the first not skipped row if the `skip_rows` parameter is provided and there are comments on the top of a data source (see #264)
+
+#### v1.21
+
+- Implemented experimental `preserve_formatting` for xlsx
+
+#### v1.20
+
+- Added support for specifying filename in zip source
+
+#### v1.19
+
+Updated behaviour:
+- For `ods` format the boolean, integer and datetime native types are detected now
+
+#### v1.18
+
+Updated behaviour:
+- For `xls` format the boolean, integer and datetime native types are detected now
+
+#### v1.17
+
+Updated behaviour:
+- Added support for Python 3.7
+
+#### v1.16
+
+New API added:
+- `skip_rows` support for an empty string to skip rows with an empty first column
+
+#### v1.15
+
+New API added:
+- Format will be extracted from URLs like `http://example.com?format=csv`
+
+#### v1.14
+
+Updated behaviour:
+- Now `xls` booleans will be parsed as booleans not integers
+
+#### v1.13
+
+New API added:
+- The `skip_rows` argument now supports negative numbers to skip rows starting from the end
+
+#### v1.12
+
+Updated behaviour:
+- Instead of raising an exception, a `UserWarning` warning will be emitted if an option isn't recognized.
+
+#### v1.11
+
+New API added:
+- Added `http_session` argument for the `http/https` format (it uses `requests` now)
+- Added support for multiline headers: `headers` argument accept ranges like `[1,3]`
+
+#### v1.10
+
+New API added:
+- Added support for compressed files i.e. `zip` and `gz` on Python3
+- The `Stream` constructor now accepts a `compression` argument
+- The `http/https` scheme now accepts a `http_stream` flag
+
+#### v1.9
+
+Improved behaviour:
+- The `headers` argument allows to set the order for keyed sources and cherry-pick values
+
+#### v1.8
+
+New API added:
+- Formats `XLS/XLSX/ODS` supports sheet names passed via the `sheet` argument
+- The `Stream` constructor accepts an `ignore_blank_headers` option
+
+#### v1.7
+
+Improved behaviour:
+- Rebased `datapackage` format on `datapackage@1` library
+
+#### v1.6
+
+New API added:
+- Argument `source` for the `Stream` constructor can be a `pathlib.Path`
+
+#### v1.5
+
+New API added:
+- Argument `bytes_sample_size` for the `Stream` constructor
+
+#### v1.4
+
+Improved behaviour:
+- Updated encoding name to a canonical form
+
+#### v1.3
+
+New API added:
+- `stream.scheme`
+- `stream.format`
+- `stream.encoding`
+
+Promoted provisional API to stable API:
+- `Loader` (custom loaders)
+- `Parser` (custom parsers)
+- `Writer` (custom writers)
+- `validate`
+
+#### v1.2
+
+Improved behaviour:
+- Autodetect common CSV delimiters
+
+#### v1.1
+
+New API added:
+- Added `fill_merged_cells` option to `xls/xlsx` formats
+
+#### v1.0
+
+New API added:
+- published `Loader/Parser/Writer` API
+- Added `Stream` argument `force_strings`
+- Added `Stream` argument `force_parse`
+- Added `Stream` argument `custom_writers`
+
+Deprecated API removal:
+- removed `topen` and `Table` - use `Stream` instead
+- removed `Stream` arguments `loader/parser_options` - use `**options` instead
+
+Provisional API changed:
+- Updated the `Loader/Parser/Writer` API - please use an updated version
+
+#### v0.15
+
+Provisional API added:
+- Unofficial support for `Stream` arguments `custom_loaders/parsers`
+
+
+[stream.py]: tabulator/stream.py
+[examples-dir]: examples "Examples"
+[requests-session]: https://docs.puthon-requests.org/en/master/user/advanced/#session-objects
+[pydoc-csv]: https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters "Python CSV options"
+[sqlalchemy]: https://www.sqlalchemy.org/
+[tdp]: https://frictionlessdata.io/specs/tabular-data-package/ "Tabular Data Package"
+[tabulator.exceptions]: tabulator/exceptions.py "Tabulator Exceptions"
+
+
+
+%package help
+Summary: Development documents and examples for tabulator
+Provides: python3-tabulator-doc
+%description help
+# tabulator-py
+
+[![Travis](https://img.shields.io/travis/frictionlessdata/tabulator-py/master.svg)](https://travis-ci.org/frictionlessdata/tabulator-py)
+[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/tabulator-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/tabulator-py?branch=master)
+[![PyPi](https://img.shields.io/pypi/v/tabulator.svg)](https://pypi.python.org/pypi/tabulator)
+[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/tabulator-py)
+[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
+
+A library for reading and writing tabular data (csv/xls/json/etc).
+
+> **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework is logical continuation of `tabulator` that was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `tabulator` to Frictionless Framework.
+> - we continue to bug-fix `tabulator@1.x` in this [repository](https://github.com/frictionlessdata/tabulator-py) as well as it's available on [PyPi](https://pypi.org/project/tabulator/) as it was before
+> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
+> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
+
+## Features
+
+- **Supports most common tabular formats**: CSV, XLS, ODS, JSON, Google Sheets, SQL, and others. See complete list [below](#supported-file-formats).
+- **Loads local and remote data**: Supports HTTP, FTP and S3.
+- **Low memory usage**: Only the current row is kept in memory, so you can
+ large datasets.
+- **Supports compressed files**: Using ZIP or GZIP algorithms.
+- **Extensible**: You can add support for custom file formats and loaders (e.g.
+ FTP).
+
+## Contents
+
+<!--TOC-->
+
+ - [Getting started](#getting-started)
+ - [Installation](#installation)
+ - [Running on CLI](#running-on-cli)
+ - [Running on Python](#running-on-python)
+ - [Documentation](#documentation)
+ - [Working with Stream](#working-with-stream)
+ - [Supported schemes](#supported-schemes)
+ - [Supported file formats](#supported-file-formats)
+ - [Custom file sources and formats](#custom-file-sources-and-formats)
+ - [API Reference](#api-reference)
+ - [`cli`](#cli)
+ - [`Stream`](#stream)
+ - [`Loader`](#loader)
+ - [`Parser`](#parser)
+ - [`Writer`](#writer)
+ - [`validate`](#validate)
+ - [`TabulatorException`](#tabulatorexception)
+ - [`SourceError`](#sourceerror)
+ - [`SchemeError`](#schemeerror)
+ - [`FormatError`](#formaterror)
+ - [`EncodingError`](#encodingerror)
+ - [`CompressionError`](#compressionerror)
+ - [`IOError`](#ioerror)
+ - [`LoadingError`](#loadingerror)
+ - [`HTTPError`](#httperror)
+ - [Contributing](#contributing)
+ - [Changelog](#changelog)
+
+<!--TOC-->
+
+## Getting started
+
+### Installation
+
+```bash
+$ pip install tabulator
+```
+
+### Running on CLI
+
+Tabulator ships with a simple CLI called `tabulator` to read tabular data. For
+example:
+
+```bash
+$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
+id,name
+1,english
+2,中国人
+```
+
+You can see all supported options by running `tabulator --help`.
+
+### Running on Python
+
+```python
+from tabulator import Stream
+
+with Stream('data.csv', headers=1) as stream:
+ stream.headers # [header1, header2, ..]
+ for row in stream:
+ print(row) # [value1, value2, ..]
+```
+
+You can find other examples in the [examples][examples-dir] directory.
+
+## Documentation
+
+In the following sections, we'll walk through some usage examples of
+this library. All examples were tested with Python 3.6, but should
+run fine with Python 3.3+.
+
+### Working with Stream
+
+The `Stream` class represents a tabular stream. It takes the file path as the
+`source` argument. For example:
+
+```
+<scheme>://path/to/file.<format>
+```
+
+It uses this path to determine the file format (e.g. CSV or XLS) and scheme
+(e.g. HTTP or postgresql). It also supports format extraction from URLs like `http://example.com?format=csv`. If necessary, you also can define these explicitly.
+
+Let's try it out. First, we create a `Stream` object passing the path to a CSV file.
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+```
+
+At this point, the file haven't been read yet. Let's open the stream so we can
+read the contents.
+
+```python
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+```
+
+This will open the underlying data stream, read a small sample to detect the
+file encoding, and prepare the data to be read. We catch
+`tabulator.TabulatorException` here, in case something goes wrong.
+
+We can now read the file contents. To iterate over each row, we do:
+
+```python
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+```
+
+The `stream.iter()` method will return each row data as a list of values. If
+you prefer, you could call `stream.iter(keyed=True)` instead, which returns a
+dictionary with the column names as keys. Either way, this method keeps only a
+single row in memory at a time. This means it can handle handle large files
+without consuming too much memory.
+
+If you want to read the entire file, use `stream.read()`. It accepts the same
+arguments as `stream.iter()`, but returns all rows at once.
+
+```python
+stream.reset()
+rows = stream.read()
+```
+
+Notice that we called `stream.reset()` before reading the rows. This is because
+internally, tabulator only keeps a pointer to its current location in the file.
+If we didn't reset this pointer, we would read starting from where we stopped.
+For example, if we ran `stream.read()` again, we would get an empty list, as
+the internal file pointer is at the end of the file (because we've already read
+it all). Depending on the file location, it might be necessary to download the
+file again to rewind (e.g. when the file was loaded from the web).
+
+After we're done, close the stream with:
+
+```python
+stream.close()
+```
+
+The entire example looks like:
+
+```python
+import tabulator
+
+stream = tabulator.Stream('data.csv')
+try:
+ stream.open()
+except tabulator.TabulatorException as e:
+ pass # Handle exception
+
+for row in stream.iter():
+ print(row) # [value1, value2, ...]
+
+stream.reset() # Rewind internal file pointer
+rows = stream.read()
+
+stream.close()
+```
+
+It could be rewritten to use Python's context manager interface as:
+
+```python
+import tabulator
+
+try:
+ with tabulator.Stream('data.csv') as stream:
+ for row in stream.iter():
+ print(row)
+
+ stream.reset()
+ rows = stream.read()
+except tabulator.TabulatorException as e:
+ pass
+```
+
+This is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.
+
+The full API documentation is available as docstrings in the [Stream source code][stream.py].
+
+#### Headers
+
+By default, tabulator considers that all file rows are values (i.e. there is no
+header).
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.headers # None
+ stream.read() # [['name', 'age'], ['Alex', 21]]
+```
+
+If you have a header row, you can use the `headers` argument with the its row
+number (starting from 1).
+
+```python
+# Integer
+with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+You can also pass a lists of strings to define the headers explicitly:
+
+```python
+with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
+ stream.headers # ['name', 'age']
+ stream.read() # [['Alex', 21]]
+```
+
+Tabulator also supports multiline headers for the `xls` and `xlsx` formats.
+
+```python
+with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
+ stream.headers # ['header from row 1-3']
+ stream.read() # [['value1', 'value2', 'value3']]
+```
+
+#### Encoding
+
+You can specify the file encoding (e.g. `utf-8` and `latin1`) via the `encoding`
+argument.
+
+```python
+with Stream(source, encoding='latin1') as stream:
+ stream.read()
+```
+
+If this argument isn't set, Tabulator will try to infer it from the data. If you
+get a `UnicodeDecodeError` while loading a file, try setting the encoding to
+`utf-8`.
+
+#### Compression (Python3-only)
+
+Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:
+
+```python
+with Stream('http://example.com/data.csv.zip') as stream:
+ stream.read()
+```
+
+You can also set it explicitly:
+
+```python
+with Stream('data.csv.ext', compression='gz') as stream:
+ stream.read()
+```
+**Options**
+
+- **filename**: filename in zip file to process (default is first file)
+
+#### Allow html
+
+The `Stream` class raises `tabulator.exceptions.FormatError` if it detects HTML
+contents. This helps avoiding the relatively common mistake of trying to load a
+CSV file inside an HTML page, for example on GitHub.
+
+You can disable this behaviour using the `allow_html` option:
+
+```python
+with Stream(source_with_html, allow_html=True) as stream:
+ stream.read() # no exception on open
+```
+
+#### Sample size
+
+To detect the file's headers, and run other checks like validating that the file
+doesn't contain HTML, Tabulator reads a sample of rows on the `stream.open()`
+method. This data is available via the `stream.sample` property. The number of
+rows used can be defined via the `sample_size` parameters (defaults to 100).
+
+```python
+with Stream(two_rows_source, sample_size=1) as stream:
+ stream.sample # only first row
+ stream.read() # first and second rows
+```
+
+You can disable this by setting `sample_size` to zero. This way, no data will be
+read on `stream.open()`.
+
+#### Bytes sample size
+
+Tabulator needs to read a part of the file to infer its encoding. The
+`bytes_sample_size` arguments controls how many bytes will be read for this
+detection (defaults to 10000).
+
+```python
+source = 'data/special/latin1.csv'
+with Stream(source) as stream:
+ stream.encoding # 'iso8859-2'
+```
+
+You can disable this by setting `bytes_sample_size` to zero, in which case it'll
+use the machine locale's default encoding.
+
+#### Ignore blank headers
+
+When `True`, tabulator will ignore columns that have blank headers (defaults to
+`False`).
+
+```python
+# Default behaviour
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1) as stream:
+ stream.headers # ['header1', '', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}
+
+# Ignoring columns with blank headers
+source = 'text://header1,,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
+ stream.headers # ['header1', 'header3']
+ stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}
+```
+
+#### Ignore listed/not-listed headers
+
+The option is similar to the `ignore_blank_headers`. It removes arbitrary columns from the data based on the corresponding column names:
+
+```python
+# Ignore listed headers (omit columns)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header1', 'header3']
+ assert stream.read(keyed=True) == [
+ {'header1': 'value1', 'header3': 'value3'},
+ ]
+
+# Ignore NOT listed headers (pick colums)
+source = 'text://header1,header2,header3\nvalue1,value2,value3'
+with Stream(source, format='csv', headers=1, ignore_not_listed_headers=['header2']) as stream:
+ assert stream.headers == ['header2']
+ assert stream.read(keyed=True) == [
+ {'header2': 'value2'},
+ ]
+```
+
+#### Force strings
+
+When `True`, all rows' values will be converted to strings (defaults to
+`False`). `None` values will be converted to empty strings.
+
+```python
+# Default behaviour
+with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
+ stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]
+
+# Forcing rows' values as strings
+with Stream([['string', 1]], force_strings=True) as stream:
+ stream.read() # [['string', '1', '2017-12-01 17:00:00']]
+```
+
+#### Force parse
+
+When `True`, don't raise an exception when parsing a malformed row, but simply
+return an empty row. Otherwise, tabulator raises
+`tabulator.exceptions.SourceError` when a row can't be parsed. Defaults to `False`.
+
+```python
+# Default behaviour
+with Stream([[1], 'bad', [3]]) as stream:
+ stream.read() # raises tabulator.exceptions.SourceError
+
+# With force_parse
+with Stream([[1], 'bad', [3]], force_parse=True) as stream:
+ stream.read() # [[1], [], [3]]
+```
+
+#### Skip rows
+
+List of row numbers and/or strings to skip.
+If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//').
+If it's the empty string, all rows that begin with an empty column will be skipped.
+
+```python
+source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
+with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
+ stream.read() # [['Mike', 4]]
+```
+
+If the `headers` parameter is also set to be an integer, it will use the first not skipped row as a headers.
+
+```python
+source = [['#comment'], ['name', 'order'], ['John', 1], ['Alex', 2]]
+with Stream(source, headers=1, skip_rows=['#']) as stream:
+ stream.headers # [['name', 'order']]
+ stream.read() # [['Jogn', 1], ['Alex', 2]]
+```
+
+#### Post parse
+
+List of functions that can filter or transform rows after they are parsed. These
+functions receive the `extended_rows` containing the row's number, headers
+list, and the row values list. They then process the rows, and yield or discard
+them, modified or not.
+
+```python
+def skip_odd_rows(extended_rows):
+ for row_number, headers, row in extended_rows:
+ if not row_number % 2:
+ yield (row_number, headers, row)
+
+def multiply_by_two(extended_rows):
+ for row_number, headers, row in extended_rows:
+ doubled_row = list(map(lambda value: value * 2, row))
+ yield (row_number, headers, doubled_row)
+
+rows = [
+ [1],
+ [2],
+ [3],
+ [4],
+]
+with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
+ stream.read() # [[4], [8]]
+```
+
+These functions are applied in order, as a simple data pipeline. In the example
+above, `multiply_by_two` just sees the rows yielded by `skip_odd_rows`.
+
+#### Keyed and extended rows
+
+The methods `stream.iter()` and `stream.read()` accept the `keyed` and
+`extended` flag arguments to modify how the rows are returned.
+
+By default, every row is returned as a list of its cells values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read() # [['Alex', 21]]
+```
+
+With `keyed=True`, the rows are returned as dictionaries, mapping the column names to their values in the row:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]
+```
+
+And with `extended=True`, the rows are returned as a tuple of `(row_number,
+headers, row)`, there `row_number` is the current row number (starting from 1),
+`headers` is a list with the headers names, and `row` is a list with the rows
+values:
+
+```python
+with Stream([['name', 'age'], ['Alex', 21]]) as stream:
+ stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])
+```
+
+### Supported schemes
+
+#### s3
+
+It loads data from AWS S3. For private files you should provide credentials supported by the `boto3` library, for example, corresponding environment variables. Read more about [configuring `boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html).
+
+```python
+stream = Stream('s3://bucket/data.csv')
+```
+
+**Options**
+
+- **s3\_endpoint\_url** - the endpoint URL to use. By default it's `https://s3.amazonaws.com`. For complex use cases, for example, `goodtables`'s runs on a data package this option can be provided as an environment variable `S3_ENDPOINT_URL`.
+
+#### file
+
+The default scheme, a file in the local filesystem.
+
+```python
+stream = Stream('data.csv')
+```
+
+#### http/https/ftp/ftps
+
+> In Python 2, `tabulator` can't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.
+
+```python
+stream = Stream('https://example.com/data.csv')
+```
+
+**Options**
+
+- **http\_session** - a `requests.Session` object. Read more in the [requests docs][requests-session].
+- **http\_stream** - Enables or disables HTTP streaming, when possible (enabled by default). Disable it if you'd like to preload the whole file into memory.
+- **http\_timeout** - This timeout will be used for a `requests` session construction.
+
+#### stream
+
+The source is a file-like Python object.
+
+
+```python
+with open('data.csv') as fp:
+ stream = Stream(fp)
+```
+
+#### text
+
+The source is a string containing the tabular data. Both `scheme` and `format`
+must be set explicitly, as it's not possible to infer them.
+
+```python
+stream = Stream(
+ 'name,age\nJohn, 21\n',
+ scheme='text',
+ format='csv'
+)
+```
+
+### Supported file formats
+
+In this section, we'll describe the supported file formats, and their respective
+configuration options and operations. Some formats only support read operations,
+while others support both reading and writing.
+
+#### csv (read & write)
+
+```python
+stream = Stream('data.csv', delimiter=',')
+```
+
+**Options**
+
+It supports all options from the Python CSV library. Check [their
+documentation][pydoc-csv] for more information.
+
+#### xls/xlsx (read & write)
+
+> Tabulator is unable to stream `xls` files, so the entire file is loaded in
+> memory. Streaming is supported for `xlsx` files.
+
+```python
+stream = Stream('data.xls', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1).
+- **workbook_cache**: An empty dictionary to handle workbook caching. If provided, `tabulator` will fill the dictionary with `source: tmpfile_path` pairs for remote workbooks. Each workbook will be downloaded only once and all the temporary files will be deleted on the process exit. Defauts: None
+- **fill_merged_cells**: if `True` it will unmerge and fill all merged cells by
+ a visible value. With this option enabled the parser can't stream data and
+ load the whole document into memory.
+- **preserve_formatting**: if `True` it will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)
+- **adjust_floating_point_error**: if `True` it will correct the Excel behaviour regarding floating point numbers
+
+#### ods (read only)
+
+> This format is not included to package by default. To use it please install `tabulator` with an `ods` extras: `$ pip install tabulator[ods]`
+
+Source should be a valid Open Office document.
+
+```python
+stream = Stream('data.ods', sheet=1)
+```
+
+**Options**
+
+- **sheet**: Sheet name or number (starting from 1)
+
+#### gsheet (read only)
+
+A publicly-accessible Google Spreadsheet.
+
+```python
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>?usp=sharing')
+stream = Stream('https://docs.google.com/spreadsheets/d/<id>edit#gid=<gid>')
+```
+
+#### sql (read & write)
+
+Any database URL supported by [sqlalchemy][sqlalchemy].
+
+```python
+stream = Stream('postgresql://name:pass@host:5432/database', table='data')
+```
+
+**Options**
+
+- **table (required)**: Database table name
+- **order_by**: SQL expression for row ordering (e.g. `name DESC`)
+
+#### Data Package (read only)
+
+> This format is not included to package by default. You can enable it by
+> installing tabulator using `pip install tabulator[datapackage]`.
+
+A [Tabular Data Package][tdp].
+
+```python
+stream = Stream('datapackage.json', resource=1)
+```
+
+**Options**
+
+- **resource**: Resource name or index (starting from 0)
+
+#### inline (read only)
+
+Either a list of lists, or a list of dicts mapping the column names to their
+respective values.
+
+```python
+stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
+stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])
+```
+
+#### json (read & write)
+
+JSON document containing a list of lists, or a list of dicts mapping the column
+names to their respective values (see the `inline` format for an example).
+
+```python
+stream = Stream('data.json', property='key1.key2')
+```
+
+**Options**
+
+- **property**: JSON Path to the property containing the tabular data. For example, considering the JSON `{"response": {"data": [...]}}`, the `property` should be set to `response.data`.
+- **keyed** (write): Save as array of arrays (default) or as array of dicts (keyed).
+
+#### ndjson (read only)
+
+```python
+stream = Stream('data.ndjson')
+```
+
+#### tsv (read only)
+
+```python
+stream = Stream('data.tsv')
+```
+
+#### html (read only)
+
+
+> This format is not included to package by default. To use it please install `tabulator` with the `html` extra: `$ pip install tabulator[html]`
+
+An HTML table element residing inside an HTML document.
+
+Supports simple tables (no merged cells) with any legal combination of the td, th, tbody & thead elements.
+
+Usually `foramt='html'` would need to be specified explicitly as web URLs don't always use the `.html` extension.
+
+```python
+stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
+```
+
+**Options**
+
+- **selector**: CSS selector for specifying which `table` element to extract. By default it's `table`, which takes the first `table` element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
+
+- **raw_html**: False (default) to extract the textual contents of each cell. True to return the inner html without modification.
+
+### Custom file sources and formats
+
+Tabulator is written with extensibility in mind, allowing you to add support for
+new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There
+are three components that allow this:
+
+* Loaders
+ * Loads a stream from some location (e.g. ssh)
+* Parsers
+ * Parses a stream of tabular data in some format (e.g. xls)
+* Writers
+ * Writes tabular data to some destination (e.g. MongoDB)
+
+In this section, we'll see how to write custom classes to extend any of these components.
+
+#### Custom loaders
+
+You can add support for a new scheme (e.g. ssh) by creating a custom loader.
+Custom loaders are implemented by inheriting from the `Loader` class, and
+implementing its methods. This loader can then be used by `Stream` to load data
+by passing it via the `custom_loaders={'scheme': CustomLoader}` argument.
+
+The skeleton of a custom loader looks like:
+
+```python
+from tabulator import Loader
+
+class CustomLoader(Loader):
+ options = []
+
+ def __init__(self, bytes_sample_size, **options):
+ pass
+
+ def load(self, source, mode='t', encoding=None):
+ # load logic
+
+with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
+ stream.read()
+```
+
+You can see examples of how the loaders are implemented by looking in the
+`tabulator.loaders` module.
+
+#### Custom parsers
+
+You can add support for a new file format by creating a custom parser. Similarly
+to custom loaders, custom parsers are implemented by inheriting from the
+`Parser` class, and implementing its methods. This parser can then be used by
+`Stream` to parse data by passing it via the `custom_parsers={'format':
+CustomParser}` argument.
+
+The skeleton of a custom parser looks like:
+
+```python
+from tabulator import Parser
+
+class CustomParser(Parser):
+ options = []
+
+ def __init__(self, loader, force_parse, **options):
+ self.__loader = loader
+
+ def open(self, source, encoding=None):
+ # open logic
+
+ def close(self):
+ # close logic
+
+ def reset(self):
+ # reset logic
+
+ @property
+ def closed(self):
+ return False
+
+ @property
+ def extended_rows(self):
+ # extended rows logic
+
+with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
+ stream.read()
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.parsers` module.
+
+#### Custom writers
+
+You can add support to write files in a specific format by creating a custom
+writer. The custom writers are implemented by inheriting from the base `Writer`
+class, and implementing its methods. This writer can then be used by `Stream` to
+write data via the `custom_writers={'format': CustomWriter}` argument.
+
+The skeleton of a custom writer looks like:
+
+```python
+from tabulator import Writer
+
+class CustomWriter(Writer):
+ options = []
+
+ def __init__(self, **options):
+ pass
+
+ def write(self, source, target, headers=None, encoding=None):
+ # write logic
+
+with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
+ stream.save(target)
+```
+
+You can see examples of how parsers are implemented by looking in the
+`tabulator.writers` module.
+
+## API Reference
+
+### `cli`
+```python
+cli(source, limit, **options)
+```
+Command-line interface
+
+```
+Usage: tabulator [OPTIONS] SOURCE
+
+Options:
+ --headers INTEGER
+ --scheme TEXT
+ --format TEXT
+ --encoding TEXT
+ --limit INTEGER
+ --sheet TEXT/INTEGER (excel)
+ --fill-merged-cells BOOLEAN (excel)
+ --preserve-formatting BOOLEAN (excel)
+ --adjust-floating-point-error BOOLEAN (excel)
+ --table TEXT (sql)
+ --order_by TEXT (sql)
+ --resource TEXT/INTEGER (datapackage)
+ --property TEXT (json)
+ --keyed BOOLEAN (json)
+ --version Show the version and exit.
+ --help Show this message and exit.
+```
+
+
+### `Stream`
+```python
+Stream(self,
+ source,
+ headers=None,
+ scheme=None,
+ format=None,
+ encoding=None,
+ compression=None,
+ allow_html=False,
+ sample_size=100,
+ bytes_sample_size=10000,
+ ignore_blank_headers=False,
+ ignore_listed_headers=None,
+ ignore_not_listed_headers=None,
+ multiline_headers_joiner=' ',
+ multiline_headers_duplicates=False,
+ force_strings=False,
+ force_parse=False,
+ pick_rows=None,
+ skip_rows=None,
+ pick_fields=None,
+ skip_fields=None,
+ pick_columns=None,
+ skip_columns=None,
+ post_parse=[],
+ custom_loaders={},
+ custom_parsers={},
+ custom_writers={},
+ **options)
+```
+Stream of tabular data.
+
+This is the main `tabulator` class. It loads a data source, and allows you
+to stream its parsed contents.
+
+__Arguments__
+
+
+ source (str):
+ Path to file as ``<scheme>://path/to/file.<format>``.
+ If not explicitly set, the scheme (file, http, ...) and
+ format (csv, xls, ...) are inferred from the source string.
+
+ headers (Union[int, List[int], List[str]], optional):
+ Either a row
+ number or list of row numbers (in case of multi-line headers) to be
+ considered as headers (rows start counting at 1), or the actual
+ headers defined a list of strings. If not set, all rows will be
+ treated as containing values.
+
+ scheme (str, optional):
+ Scheme for loading the file (file, http, ...).
+ If not set, it'll be inferred from `source`.
+
+ format (str, optional):
+ File source's format (csv, xls, ...). If not
+ set, it'll be inferred from `source`. inferred
+
+ encoding (str, optional):
+ Source encoding. If not set, it'll be inferred.
+
+ compression (str, optional):
+ Source file compression (zip, ...). If not set, it'll be inferred.
+
+ pick_rows (List[Union[int, str, dict]], optional):
+ The same as `skip_rows` but it's for picking rows instead of skipping.
+
+ skip_rows (List[Union[int, str, dict]], optional):
+ List of row numbers, strings and regex patterns as dicts to skip.
+ If a string, it'll skip rows that their first cells begin with it e.g. '#' and '//'.
+ To skip only completely blank rows use `{'type': 'preset', 'value': 'blank'}`
+ To provide a regex pattern use `{'type': 'regex', 'value': '^#'}`
+ For example: `skip_rows=[1, '# comment', {'type': 'regex', 'value': '^# (regex|comment)'}]`
+
+ pick_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list DOES NOT include
+
+ skip_fields (str[]):
+ When passed, ignores all columns with headers
+ that the given list includes. If it contains an empty string it will skip
+ empty headers
+
+ sample_size (int, optional):
+ Controls the number of sample rows used to
+ infer properties from the data (headers, encoding, etc.). Set to
+ ``0`` to disable sampling, in which case nothing will be inferred
+ from the data. Defaults to ``config.DEFAULT_SAMPLE_SIZE``.
+
+ bytes_sample_size (int, optional):
+ Same as `sample_size`, but instead
+ of number of rows, controls number of bytes. Defaults to
+ ``config.DEFAULT_BYTES_SAMPLE_SIZE``.
+
+ allow_html (bool, optional):
+ Allow the file source to be an HTML page.
+ If False, raises ``exceptions.FormatError`` if the loaded file is
+ an HTML page. Defaults to False.
+
+ multiline_headers_joiner (str, optional):
+ When passed, it's used to join multiline headers
+ as `<passed-value>.join(header1_1, header1_2)`
+ Defaults to ' ' (space).
+
+ multiline_headers_duplicates (bool, optional):
+ By default tabulator will exclude a cell of a miltilne header from joining
+ if it's exactly the same as the previous seen value in this field.
+ Enabling this option will force duplicates inclusion
+ Defaults to False.
+
+ force_strings (bool, optional):
+ When True, casts all data to strings.
+ Defaults to False.
+
+ force_parse (bool, optional):
+ When True, don't raise exceptions when
+ parsing malformed rows, simply returning an empty value. Defaults
+ to False.
+
+ post_parse (List[function], optional):
+ List of generator functions that
+ receives a list of rows and headers, processes them, and yields
+ them (or not). Useful to pre-process the data. Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as scheme names,
+ and values as their respective ``Loader`` class implementations.
+ Defaults to None.
+
+ custom_parsers (dict, optional):
+ Dictionary with keys as format names,
+ and values as their respective ``Parser`` class implementations.
+ Defaults to None.
+
+ custom_loaders (dict, optional):
+ Dictionary with keys as writer format
+ names, and values as their respective ``Writer`` class
+ implementations. Defaults to None.
+
+ **options (Any, optional): Extra options passed to the loaders and parsers.
+
+
+
+#### `stream.closed`
+Returns True if the underlying stream is closed, False otherwise.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `stream.compression`
+Stream's compression ("no" if no compression)
+
+__Returns__
+
+`str`: compression
+
+
+
+#### `stream.encoding`
+Stream's encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `stream.format`
+Path's format
+
+__Returns__
+
+`str`: format
+
+
+
+#### `stream.fragment`
+Path's fragment
+
+__Returns__
+
+`str`: fragment
+
+
+
+#### `stream.hash`
+Returns the SHA256 hash of the read chunks if available
+
+__Returns__
+
+`str/None`: SHA256 hash
+
+
+
+#### `stream.headers`
+Headers
+
+__Returns__
+
+`str[]/None`: headers if available
+
+
+
+#### `stream.sample`
+Returns the stream's rows used as sample.
+
+These sample rows are used internally to infer characteristics of the
+source file (e.g. encoding, headers, ...).
+
+__Returns__
+
+`list[]`: sample
+
+
+
+#### `stream.scheme`
+Path's scheme
+
+__Returns__
+
+`str`: scheme
+
+
+
+#### `stream.size`
+Returns the BYTE count of the read chunks if available
+
+__Returns__
+
+`int/None`: BYTE count
+
+
+
+#### `stream.source`
+Source
+
+__Returns__
+
+`any`: stream source
+
+
+
+#### `stream.open`
+```python
+stream.open()
+```
+Opens the stream for reading.
+
+__Raises:__
+
+ TabulatorException: if an error
+
+
+
+#### `stream.close`
+```python
+stream.close()
+```
+Closes the stream.
+
+
+#### `stream.reset`
+```python
+stream.reset()
+```
+Resets the stream pointer to the beginning of the file.
+
+
+#### `stream.iter`
+```python
+stream.iter(keyed=False, extended=False)
+```
+Iterate over the rows.
+
+Each row is returned in a format that depends on the arguments `keyed`
+and `extended`. By default, each row is returned as list of their
+values.
+
+__Arguments__
+- __keyed (bool, optional)__:
+ When True, each returned row will be a
+ `dict` mapping the header name to its value in the current row.
+ For example, `[{'name': 'J Smith', 'value': '10'}]`. Ignored if
+ ``extended`` is True. Defaults to False.
+- __extended (bool, optional)__:
+ When True, returns each row as a tuple
+ with row number (starts at 1), list of headers, and list of row
+ values. For example, `(1, ['name', 'value'], ['J Smith', '10'])`.
+ Defaults to False.
+
+__Raises__
+- `exceptions.TabulatorException`: If the stream is closed.
+
+__Returns__
+
+`Iterator[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The row itself. The format depends on the values of `keyed` and
+ `extended` arguments.
+
+
+
+#### `stream.read`
+```python
+stream.read(keyed=False, extended=False, limit=None)
+```
+Returns a list of rows.
+
+__Arguments__
+- __keyed (bool, optional)__: See :func:`Stream.iter`.
+- __extended (bool, optional)__: See :func:`Stream.iter`.
+- __limit (int, optional)__:
+ Number of rows to return. If None, returns all rows. Defaults to None.
+
+__Returns__
+
+`List[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]`:
+ The list of rows. The format depends on the values of `keyed`
+ and `extended` arguments.
+
+
+#### `stream.save`
+```python
+stream.save(target, format=None, encoding=None, **options)
+```
+Save stream to the local filesystem.
+
+__Arguments__
+- __target (str)__: Path where to save the stream.
+- __format (str, optional)__:
+ The format the stream will be saved as. If
+ None, detects from the ``target`` path. Defaults to None.
+- __encoding (str, optional)__:
+ Saved file encoding. Defaults to ``config.DEFAULT_ENCODING``.
+- __**options__: Extra options passed to the writer.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+Building index...
+Started generating documentation...
+
+### `Loader`
+```python
+Loader(self, bytes_sample_size, **options)
+```
+Abstract class implemented by the data loaders
+
+The loaders inherit and implement this class' methods to add support for a
+new scheme (e.g. ssh).
+
+__Arguments__
+- __bytes_sample_size (int)__: Sample size in bytes
+- __**options (dict)__: Loader options
+
+
+
+#### `loader.options`
+
+
+#### `loader.load`
+```python
+loader.load(source, mode='t', encoding=None)
+```
+Load source file.
+
+__Arguments__
+- __source (str)__: Path to tabular source file.
+- __mode (str, optional)__:
+ Text stream mode, `t` (text) or `b` (binary). Defaults to `t`.
+- __encoding (str, optional)__:
+ Source encoding. Auto-detect by default.
+
+__Returns__
+
+`Union[TextIO, BinaryIO]`: I/O stream opened either as text or binary.
+
+
+### `Parser`
+```python
+Parser(self, loader, force_parse, **options)
+```
+Abstract class implemented by the data parsers.
+
+The parsers inherit and implement this class' methods to add support for a
+new file type.
+
+__Arguments__
+- __loader (tabulator.Loader)__: Loader instance to read the file.
+- __force_parse (bool)__:
+ When `True`, the parser yields an empty extended
+ row tuple `(row_number, None, [])` when there is an error parsing a
+ row. Otherwise, it stops the iteration by raising the exception
+ `tabulator.exceptions.SourceError`.
+- __**options (dict)__: Loader options
+
+
+
+#### `parser.closed`
+Flag telling if the parser is closed.
+
+__Returns__
+
+`bool`: whether closed
+
+
+
+#### `parser.encoding`
+Encoding
+
+__Returns__
+
+`str`: encoding
+
+
+
+#### `parser.extended_rows`
+Returns extended rows iterator.
+
+The extended rows are tuples containing `(row_number, headers, row)`,
+
+__Raises__
+- `SourceError`:
+ If `force_parse` is `False` and
+ a row can't be parsed, this exception will be raised.
+ Otherwise, an empty extended row is returned (i.e.
+ `(row_number, None, [])`).
+
+Returns:
+ Iterator[Tuple[int, List[str], List[Any]]]:
+ Extended rows containing
+ `(row_number, headers, row)`, where `headers` is a list of the
+ header names (can be `None`), and `row` is a list of row
+ values.
+
+
+
+#### `parser.options`
+
+
+#### `parser.open`
+```python
+parser.open(source, encoding=None)
+```
+Open underlying file stream in the beginning of the file.
+
+The parser gets a byte or text stream from the `tabulator.Loader`
+instance and start emitting items.
+
+__Arguments__
+- __source (str)__: Path to source table.
+- __encoding (str, optional)__: Source encoding. Auto-detect by default.
+
+__Returns__
+
+ None
+
+
+
+#### `parser.close`
+```python
+parser.close()
+```
+Closes underlying file stream.
+
+
+#### `parser.reset`
+```python
+parser.reset()
+```
+Resets underlying stream and current items list.
+
+After `reset()` is called, iterating over the items will start from the beginning.
+
+### `Writer`
+```python
+Writer(self, **options)
+```
+Abstract class implemented by the data writers.
+
+The writers inherit and implement this class' methods to add support for a
+new file destination.
+
+__Arguments__
+- __**options (dict)__: Writer options.
+
+
+
+#### `writer.options`
+
+
+#### `writer.write`
+```python
+writer.write(source, target, headers, encoding=None)
+```
+Writes source data to target.
+
+__Arguments__
+- __source (str)__: Source data.
+- __target (str)__: Write target.
+- __headers (List[str])__: List of header names.
+- __encoding (str, optional)__: Source file encoding.
+
+__Returns__
+
+`count (int?)`: Written rows count if available
+
+
+### `validate`
+```python
+validate(source, scheme=None, format=None)
+```
+Check if tabulator is able to load the source.
+
+__Arguments__
+- __source (Union[str, IO])__: The source path or IO object.
+- __scheme (str, optional)__: The source scheme. Auto-detect by default.
+- __format (str, optional)__: The source file format. Auto-detect by default.
+
+__Raises__
+- `SchemeError`: The file scheme is not supported.
+- `FormatError`: The file format is not supported.
+
+__Returns__
+
+`bool`: Whether tabulator is able to load the source file.
+
+
+### `TabulatorException`
+```python
+TabulatorException()
+```
+Base class for all tabulator exceptions.
+
+
+### `SourceError`
+```python
+SourceError()
+```
+The source file could not be parsed correctly.
+
+
+### `SchemeError`
+```python
+SchemeError()
+```
+The file scheme is not supported.
+
+
+### `FormatError`
+```python
+FormatError()
+```
+The file format is unsupported or invalid.
+
+
+### `EncodingError`
+```python
+EncodingError()
+```
+Encoding error
+
+
+### `CompressionError`
+```python
+CompressionError()
+```
+Compression error
+
+
+### `IOError`
+```python
+IOError()
+```
+Local loading error
+
+
+### `LoadingError`
+```python
+LoadingError()
+```
+Local loading error
+
+
+### `HTTPError`
+```python
+HTTPError()
+```
+Remote loading error
+
+## Contributing
+
+> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
+
+Recommended way to get started is to create and activate a project virtual environment.
+To install package and development dependencies into active environment:
+
+```bash
+$ make install
+```
+
+To run tests with linting and coverage:
+
+```bash
+$ make test
+```
+
+To run tests without Internet:
+
+```
+$ pytest -m 'not remote
+```
+
+## Changelog
+
+Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/tabulator-py/commits/master).
+
+#### v1.53
+
+- Add support for raw_html extraction in html parser (#341)
+
+#### v1.52
+
+- Published stream.dialect (works only for csv, for now)
+
+#### v1.51
+
+- Added experimental table discovery options
+
+#### v1.50
+
+- Ensure that headers is always a list of strings
+
+#### v1.49
+
+- Added `workbook_cache` argument for XLSX formats
+
+#### v1.48
+
+- Published `stream.hashing_algorithm` property
+
+#### v1.47
+
+- Added `hashing_algorithm` parameter
+
+#### v1.46
+
+- Fixed multiline headers merging
+- Introduced a `multiline_headers_duplicates` flag
+
+#### v1.45
+
+- HTML format: adds support for empty selector (#321)
+
+#### v1.44
+
+- Exposed `stream.compression`
+
+#### v1.43
+
+- Exposed `stream.source`
+
+#### v1.42
+
+- Exposed format option to the CLI
+
+#### v1.41
+
+- Implemented a `pick_rows` parameter (opposite to `skip_rows`)
+
+#### v1.40
+
+- Implemented `stream.save()` returning count of written rows
+
+#### v1.39
+
+- Implemented JSON writer (#311)
+
+#### v1.38
+
+- Use `chardet` for encoding detection by default. For `cchardet`: `pip install tabulator[cchardet]`. Due to a great deal of problems caused by `ccharted` for non-Linux/Conda installations we're returning back to using `chardet` by default.
+
+#### v1.37
+
+- Raise IOError/HTTPError even a not-existent file has a bad format (#304)
+
+#### v1.36
+
+- Implemented `blank` preset for `skip_rows` (#302)
+
+#### v1.35
+
+- Added `skip/pick_columns` aliases for (#293)
+
+#### v1.34
+
+- Added `multiline_headers_joiner` argument (#291)
+
+#### v1.33
+
+- Added support for regex patterns in `skip_rows` (#290)
+
+#### v1.32
+
+- Added ability to skip columns (#293)
+
+#### v1.31
+
+- Added `xlsx` writer
+- Added `html` reader
+
+#### v1.30
+
+- Added `adjust_floating_point_error` parameter to the `xlsx` parser
+
+#### v1.29
+
+- Implemented the `stream.size` and `stream.hash` properties
+
+#### v1.28
+
+- Added SQL writer
+
+#### v1.27
+
+- Added `http_timeout` argument for the `http/https` format
+
+#### v1.26
+
+- Added `stream.fragment` field showing e.g. Excel sheet's or DP resource's name
+
+#### v1.25
+
+- Added support for the `s3` file scheme (data loading from AWS S3)
+
+#### v1.24
+
+- Added support for compressed file-like objects
+
+#### v1.23
+
+- Added a setter for the `stream.headers` property
+
+#### v1.22
+
+- The `headers` parameter will now use the first not skipped row if the `skip_rows` parameter is provided and there are comments on the top of a data source (see #264)
+
+#### v1.21
+
+- Implemented experimental `preserve_formatting` for xlsx
+
+#### v1.20
+
+- Added support for specifying filename in zip source
+
+#### v1.19
+
+Updated behaviour:
+- For `ods` format the boolean, integer and datetime native types are detected now
+
+#### v1.18
+
+Updated behaviour:
+- For `xls` format the boolean, integer and datetime native types are detected now
+
+#### v1.17
+
+Updated behaviour:
+- Added support for Python 3.7
+
+#### v1.16
+
+New API added:
+- `skip_rows` support for an empty string to skip rows with an empty first column
+
+#### v1.15
+
+New API added:
+- Format will be extracted from URLs like `http://example.com?format=csv`
+
+#### v1.14
+
+Updated behaviour:
+- Now `xls` booleans will be parsed as booleans not integers
+
+#### v1.13
+
+New API added:
+- The `skip_rows` argument now supports negative numbers to skip rows starting from the end
+
+#### v1.12
+
+Updated behaviour:
+- Instead of raising an exception, a `UserWarning` warning will be emitted if an option isn't recognized.
+
+#### v1.11
+
+New API added:
+- Added `http_session` argument for the `http/https` format (it uses `requests` now)
+- Added support for multiline headers: `headers` argument accept ranges like `[1,3]`
+
+#### v1.10
+
+New API added:
+- Added support for compressed files i.e. `zip` and `gz` on Python3
+- The `Stream` constructor now accepts a `compression` argument
+- The `http/https` scheme now accepts a `http_stream` flag
+
+#### v1.9
+
+Improved behaviour:
+- The `headers` argument allows to set the order for keyed sources and cherry-pick values
+
+#### v1.8
+
+New API added:
+- Formats `XLS/XLSX/ODS` supports sheet names passed via the `sheet` argument
+- The `Stream` constructor accepts an `ignore_blank_headers` option
+
+#### v1.7
+
+Improved behaviour:
+- Rebased `datapackage` format on `datapackage@1` library
+
+#### v1.6
+
+New API added:
+- Argument `source` for the `Stream` constructor can be a `pathlib.Path`
+
+#### v1.5
+
+New API added:
+- Argument `bytes_sample_size` for the `Stream` constructor
+
+#### v1.4
+
+Improved behaviour:
+- Updated encoding name to a canonical form
+
+#### v1.3
+
+New API added:
+- `stream.scheme`
+- `stream.format`
+- `stream.encoding`
+
+Promoted provisional API to stable API:
+- `Loader` (custom loaders)
+- `Parser` (custom parsers)
+- `Writer` (custom writers)
+- `validate`
+
+#### v1.2
+
+Improved behaviour:
+- Autodetect common CSV delimiters
+
+#### v1.1
+
+New API added:
+- Added `fill_merged_cells` option to `xls/xlsx` formats
+
+#### v1.0
+
+New API added:
+- published `Loader/Parser/Writer` API
+- Added `Stream` argument `force_strings`
+- Added `Stream` argument `force_parse`
+- Added `Stream` argument `custom_writers`
+
+Deprecated API removal:
+- removed `topen` and `Table` - use `Stream` instead
+- removed `Stream` arguments `loader/parser_options` - use `**options` instead
+
+Provisional API changed:
+- Updated the `Loader/Parser/Writer` API - please use an updated version
+
+#### v0.15
+
+Provisional API added:
+- Unofficial support for `Stream` arguments `custom_loaders/parsers`
+
+
+[stream.py]: tabulator/stream.py
+[examples-dir]: examples "Examples"
+[requests-session]: https://docs.puthon-requests.org/en/master/user/advanced/#session-objects
+[pydoc-csv]: https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters "Python CSV options"
+[sqlalchemy]: https://www.sqlalchemy.org/
+[tdp]: https://frictionlessdata.io/specs/tabular-data-package/ "Tabular Data Package"
+[tabulator.exceptions]: tabulator/exceptions.py "Tabulator Exceptions"
+
+
+
+%prep
+%autosetup -n tabulator-1.53.5
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-tabulator -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 1.53.5-1
+- Package Spec generated