%global _empty_manifest_terminate_build 0
Name: python-goodtables
Version: 2.5.4
Release: 1
Summary: Goodtables is a framework to inspect tabular data.
License: MIT
URL: https://github.com/frictionlessdata/goodtables
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/93/25/c691e85a93d0411ef50903a56ff26ffcc784f392e2765d2555548a0dd1a0/goodtables-2.5.4.tar.gz
BuildArch: noarch
Requires: python3-six
Requires: python3-click
Requires: python3-click-default-group
Requires: python3-requests
Requires: python3-simpleeval
Requires: python3-statistics
Requires: python3-tabulator
Requires: python3-tableschema
Requires: python3-datapackage
Requires: python3-mock
Requires: python3-pylama
Requires: python3-pytest
Requires: python3-pytest-cov
Requires: python3-pyyaml
Requires: python3-ezodf
Requires: python3-lxml
%description
# goodtables-py
[![Travis](https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg)](https://travis-ci.org/frictionlessdata/goodtables-py)
[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master)
[![PyPi](https://img.shields.io/pypi/v/goodtables.svg)](https://pypi.python.org/pypi/goodtables)
[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/goodtables-py)
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
Goodtables is a framework to validate tabular data. It can check the structure
of your data (e.g. all rows have the same number of columns), and its contents
(e.g. all dates are valid).
> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python.
> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before
> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
## Features
* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc.
* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18").
* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc.
* **Parallelized validations for multi-table datasets**
* **Command line interface**
## Contents
- [Getting Started](#getting-started)
- [Installing](#installing)
- [Running on CLI](#running-on-cli)
- [Running on Python](#running-on-python)
- [Documentation](#documentation)
- [Report](#report)
- [Checks](#checks)
- [Presets](#presets)
- [Data Quality Errors](#data-quality-errors)
- [Frequently Asked Questions](#frequently-asked-questions)
- [API Reference](#api-reference)
- [`cli`](#cli)
- [`validate`](#validate)
- [`preset`](#preset)
- [`check`](#check)
- [`Error`](#error)
- [`spec`](#spec)
- [`GoodtablesException`](#goodtablesexception)
- [Contributing](#contributing)
- [Changelog](#changelog)
## Getting Started
> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py
### Installing
```
pip install goodtables
pip install goodtables[ods] # If you need LibreOffice's ODS file support
```
### Running on CLI
```
goodtables data.csv
```
Use `goodtables --help` to see the different options.
### Running on Python
```python
from goodtables import validate
report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'
```
You can read a more in depth explanation on using goodtables with Python on
the [developer documentation](#developer-documentation) section. Check also
the [examples](examples) folder for other examples.
## Documentation
Goodtables validates your tabular dataset to find structural and content
errors. Consider you have a file named `invalid.csv`. Let's validate it:
```python
report = validate('invalid.csv')
```
We could also pass a remote URI instead of a local path. It supports CSV, XLS,
XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator]
library.
### Report
> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema].
The output of the `validate()` method is a report dictionary. It includes
information if the data was valid, count of errors, list of table reports, which
individual checks failed, etc. A report will be looking like this:
```json
{
"time": 0.009,
"error-count": 1,
"warnings": [
"Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
],
"preset": "table",
"valid": false,
"tables": [
{
"errors": [
{
"row-number": null,
"message": "Header in column 3 is blank",
"row": null,
"column-number": 3,
"code": "blank-header"
}
],
"error-count": 1,
"headers": [
"id",
"name",
"",
"name"
],
"scheme": "file",
"row-count": 2,
"valid": false,
"encoding": "utf-8",
"time": 0.007,
"schema": null,
"format": "csv",
"source": "data/invalid"
}
],
"table-count": 1
}
```
The errors are divided in one of the following categories:
- `source` - data can't be loaded or parsed
- `structure` - general tabular errors like duplicate headers
- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/)
- `custom` - custom checks errors
### Checks
Check is a main validation actor in goodtables. The list of enabled checks can
be changed using `checks` and `skip_checks` arguments. Let's explore the options
on an example:
```python
report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'
```
By default a dataset will be validated against all available Data Quality Spec
errors. Some checks can be unavailable for validation. For example, if the
schema isn't provided, only the `structure` checks will be done.
### Presets
Goodtables support different formats of tabular datasets. They're called
presets. A tabular dataset is some data that can be split in a list of data
tables, as:
![Dataset](data/dataset.png)
We can change the preset using the `preset` argument for `validate()`. By
default, it'll be inferred from the source, falling back to `table`. To validate
a [data package][datapackage], we can do:
```python
report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset
```
This will validate all tabular resources in the datapackage.
It's also possible to validate a list of files using the "nested" preset. To do
so, the first argument to `validate()` should be a list of dictionaries, where
each key in the dictionary is named after a parameter on `validate()`. For example:
```python
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset
```
Is similar to:
```python
report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')
```
The difference is that goodtables validates multiple tables in parallel, so
calling using the "nested" preset should run faster.
### Data Quality Errors
Base report errors are standardized and described in
[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json).
#### Source errors
The basic checks can't be disabled, as they deal with goodtables being able to read the files.
| check | description |
| --- | --- |
| io-error | Data reading error because of IO error. |
| http-error | Data reading error because of HTTP error. |
| source-error | Data reading error because of not supported or inconsistent contents. |
| scheme-error | Data reading error because of incorrect scheme. |
| format-error | Data reading error because of incorrect format. |
| encoding-error | Data reading error because of an encoding problem. |
#### Structure errors
These checks validate that the structure of the file are valid.
| check | description |
| --- | --- |
| blank-header | There is a blank header name. All cells in the header row must have a value. |
| duplicate-header | There are multiple columns with the same name. All column names must be unique. |
| blank-row | Rows must have at least one non-blank cell. |
| duplicate-row | Rows can't be duplicated. |
| extra-value | A row has more columns than the header. |
| missing-value | A row has less columns than the header. |
#### Schema errors
These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option.
If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields.
Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option.
| check | description |
| --- | --- |
| schema-error | Schema is not valid. |
| non-matching-header | The header's name in the schema is different from what's in the data. |
| extra-header | The data contains a header not defined in the schema. |
| missing-header | The data doesn't contain a header defined in the schema. |
| type-or-format-error | The value can’t be cast based on the schema type and format for this field. |
| required-constraint | This field is a required field, but it contains no value. |
| pattern-constraint | This field value's should conform to the defined pattern. |
| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |
| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |
| minimum-constraint | This field value should be greater or equal than constraint value. |
| maximum-constraint | This field value should be less or equal than constraint value. |
| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |
| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |
#### Custom errors
| check | description |
| --- | --- |
| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. |
| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. |
| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package |
| [sequential-value](#sequential-value) | Ensure numbers are sequential. |
| [truncated-value](#truncated-value) | Detect values that were potentially truncated. |
| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). |
##### blacklisted-value
Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| blacklist | list of str | List of blacklisted values |
Consider the following CSV file:
```csv
id,name
1,John
2,bug
3,bad
5,Alex
```
Let's check that the `name` column doesn't contain rows with `bug` or `bad`:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"
```
##### deviated-value
This check helps to find outlines in a column containing positive numbers. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| average | str | Average type, either "mean", "median" or "mode" |
| interval | int | Values must be inside range `average ± standard deviation * interval` |
Consider the following CSV file:
```csv
temperature
1
-2
7
0
1
2
5
-4
100
8
3
```
We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval.
```python
report = validate('data.csv', checks=[
{'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"
```
##### foreign-key
> We support here relative paths. It MUST be used only for trusted data sources.
This check validate foreign keys within a data package. Consider we have a data package defined below:
```python
DESCRIPTOR = {
'resources': [
{
'name': 'cities',
'data': [
['id', 'name', 'next_id'],
[1, 'london', 2],
[2, 'paris', 3],
[3, 'rome', 4],
# [4, 'rio', None],
],
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'next_id', 'type': 'integer'},
],
'foreignKeys': [
{
'fields': 'next_id',
'reference': {'resource': '', 'fields': 'id'},
},
{
'fields': 'id',
'reference': {'resource': 'people', 'fields': 'label'},
},
],
},
}, {
'name': 'people',
'data': [
['label', 'population'],
[1, 8],
[2, 2],
# [3, 3],
# [4, 6],
],
},
],
}
```
Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data:
```python
report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key'])
print(report)
```
```
{'error-count': 2,
'preset': 'datapackage',
'table-count': 2,
'tables': [{'datapackage': '...',
'error-count': 2,
'errors': [{'code': 'foreign-key',
'message': 'Foreign key "[\'next_id\']" violation in '
'row 4',
'message-data': {'fields': ['next_id']},
'row-number': 4},
{'code': 'foreign-key',
'message': 'Foreign key "[\'id\']" violation in row 4',
'message-data': {'fields': ['id']},
'row-number': 4}],
'format': 'inline',
'headers': ['id', 'name', 'next_id'],
'resource-name': 'cities',
'row-count': 4,
'schema': 'table-schema',
'source': 'inline',
'time': 0.031,
'valid': False},
{'datapackage': '...',
'error-count': 0,
'errors': [],
'format': 'inline',
'headers': ['label', 'population'],
'resource-name': 'people',
'row-count': 3,
'source': 'inline',
'time': 0.038,
'valid': True}],
'time': 0.117,
'valid': False,
'warnings': []}
```
It experimetally supports external resource checks, for example, for a `foreignKey` definition like these:
```json
{"package": "../people/datapackage.json", "resource": "people", "fields": "label"}
{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"}
```
##### sequential-value
This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
Consider the following CSV file:
```csv
id,name
1,one
2,two
3,three
5,five
```
Let's check if the `id` column contains sequential integers:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"
```
##### truncated-value
Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information.
Consider the following CSV file:
```csv
id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad
```
To detect all probably truncated values we could use `truncated-value` check:
```python
report = validate('data.csv', checks=[
'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"
```
##### custom-constraint
With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options:
- constraint (str)
- Constraint definition (e.g.
col1 + col2 == col3
)
Consider csv file like this:
```csv
id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000
```
Let's say our business rule is to be shy on bonuses:
```python
report = validate('data.csv', checks=[
{'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"
```
### Frequently Asked Questions
#### How can I add a new custom check?
To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments):
```python
from goodtables import validate, check, Error
@check('custom-check', type='custom', context='body')
def custom_check(cells):
errors = []
for cell in cells:
message = 'Custom error on column {column_number} and row {row_number}'
error = Error(
'custom-error',
cell,
message
)
errors.append(error)
return errors
report = validate('data.csv', checks=['custom-check'])
```
Recommended steps:
- let's discuss in comment proposed checks first
- select name for a new check like `possible-noise-text`
- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module
- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py
- write actual code for the new check
- write tests and readme for the new check
#### How can I add support for a new tabular file type?
To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset.
```python
from tabulator import Stream
from tableschema import Schema
from goodtables import validate
@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables
report = validate(source, preset='custom-preset')
```
For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol.
## API Reference
### `cli`
```python
cli()
```
Command-line interface
```
Usage: cli.py [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
validate* Validate tabular files (default).
init Init data package from list of files.
```
### `validate`
```python
validate(source, **options)
```
Validates a source file and returns a report.
__Arguments__
- __source (Union[str, Dict, List[Dict], IO])__:
The source to be validated.
It can be a local file path, URL, dict, list of dicts, or a
file-like object. If it's a list of dicts and the `preset` is
"nested", each of the dict key's will be used as if it was passed
as a keyword argument to this method.
The file can be a CSV, XLS, JSON, and any other format supported by
`tabulator`_.
- __checks (List[str])__:
List of checks names to be enabled. They can be
individual check names (e.g. `blank-headers`), or check types (e.g.
`structure`).
- __skip_checks (List[str])__:
List of checks names to be skipped. They can
be individual check names (e.g. `blank-headers`), or check types
(e.g. `structure`).
- __infer_schema (bool)__:
Infer schema if one wasn't passed as an argument.
- __infer_fields (bool)__:
Infer schema for columns not present in the received schema.
- __order_fields (bool)__:
Order source columns based on schema fields order.
This is useful when you don't want to validate that the data
columns' order is the same as the schema's.
- __error_limit (int)__:
Stop validation if the number of errors per table exceeds this value.
- __table_limit (int)__:
Maximum number of tables to validate.
- __row_limit (int)__:
Maximum number of rows to validate.
- __preset (str)__:
Dataset type could be `table` (default), `datapackage`,
`nested` or custom. Usually, the preset can be inferred from the
source, so you don't need to define it.
- __Any (Any)__:
Any additional arguments not defined here will be passed on,
depending on the chosen `preset`. If the `preset` is `table`, the
extra arguments will be passed on to `tabulator`_, if it is
`datapackage`, they will be passed on to the `datapackage`_
constructor.
__Raises__
- `GoodtablesException`: Raised on any non-tabular error.
__Returns__
`dict`: The validation report.
### `preset`
```python
preset(name)
```
Register a custom preset (decorator)
__Example__
```python
@preset('custom-preset')
def custom_preset(source, **options):
# ...
```
__Arguments__
- __name (str)__: preset name
### `check`
```python
check(name, type=None, context=None, position=None)
```
Register a custom check (decorator)
__Example__
```python
@check('custom-check', type='custom', context='body')
def custom_check(cells):
# ...
```
__Arguments__
- __name (str)__: preset name
- __type (str)__: has to be `custom`
- __context (str)__: has to be `head` or `body`
- __position (str)__: has to be `before:` or `after:`
### `Error`
```python
Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None)
```
Describes a validation check error
__Arguments__
- __code (str)__: The error code. Must be one in the spec.
- __cell (dict, optional)__: The cell where the error occurred.
- __row_number (int, optional)__: The row number where the error occurs.
- __message (str, optional)__:
The error message. Defaults to the message from the Data Quality Spec.
- __message_substitutions (dict, optional)__:
Dictionary with substitutions to be used when
generating the error message and description.
__Raises__
- `KeyError`: Raised if the error code isn't known.
### `spec`
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
(key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
d = {}
for k, v in iterable:
d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
### `GoodtablesException`
```python
GoodtablesException(self, /, *args, **kwargs)
```
Base goodtables exception
## Contributing
> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
Recommended way to get started is to create and activate a project virtual environment.
To install package and development dependencies into active environment:
```bash
$ make install
```
To run tests with linting and coverage:
```bash
$ make test
```
## Changelog
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master).
##### v2.5
- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support)
##### v2.4
- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values
##### v2.3
- Added a [foreign keys check](#foreign-key)
##### v2.2
- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298))
##### v2.1
- A new key added to the `error.to_dict` return: `message-data`
##### v2.0
Breaking changes:
- Checks method signature now only receives the current row's `cells` list
- Checks raise errors by returning an array of `Error` objects
- Cells have the row number in the `row-number` key
- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works
- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233))
- New `goodtables init ` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas.
Bug fixes:
- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250))
##### v1.5
New API added:
- Validation `source` now could be a `pathlib.Path`
##### v1.4
Improved behaviour:
- rebased on Data Quality Spec v1
- rebased on Data Package Spec v1
- rebased on Table Schema Spec v1
- treat primary key as required/unique field
##### v1.3
New advanced checks added:
- `blacklisted-value`
- `custom-constraint`
- `deviated-value`
- `sequential-value`
- `truncated-value`
##### v1.2
New API added:
- `report.preset`
- `report.tables[].schema`
##### v1.1
New API added:
- `report.tables[].scheme`
- `report.tables[].format`
- `report.tables[].encoding`
##### v1.0
This version includes various big changes. A migration guide is under development and will be published here.
##### v0.6
First version of `goodtables`.
[tableschema]: https://specs.frictionlessdata.io/table-schema/
[tabulator]: https://github.com/frictionlessdata/tabulator-py/
[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification"
[semver]: https://semver.org/ "Semantic Versioning"
[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema"
%package -n python3-goodtables
Summary: Goodtables is a framework to inspect tabular data.
Provides: python-goodtables
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-goodtables
# goodtables-py
[![Travis](https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg)](https://travis-ci.org/frictionlessdata/goodtables-py)
[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master)
[![PyPi](https://img.shields.io/pypi/v/goodtables.svg)](https://pypi.python.org/pypi/goodtables)
[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/goodtables-py)
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
Goodtables is a framework to validate tabular data. It can check the structure
of your data (e.g. all rows have the same number of columns), and its contents
(e.g. all dates are valid).
> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python.
> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before
> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
## Features
* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc.
* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18").
* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc.
* **Parallelized validations for multi-table datasets**
* **Command line interface**
## Contents
- [Getting Started](#getting-started)
- [Installing](#installing)
- [Running on CLI](#running-on-cli)
- [Running on Python](#running-on-python)
- [Documentation](#documentation)
- [Report](#report)
- [Checks](#checks)
- [Presets](#presets)
- [Data Quality Errors](#data-quality-errors)
- [Frequently Asked Questions](#frequently-asked-questions)
- [API Reference](#api-reference)
- [`cli`](#cli)
- [`validate`](#validate)
- [`preset`](#preset)
- [`check`](#check)
- [`Error`](#error)
- [`spec`](#spec)
- [`GoodtablesException`](#goodtablesexception)
- [Contributing](#contributing)
- [Changelog](#changelog)
## Getting Started
> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py
### Installing
```
pip install goodtables
pip install goodtables[ods] # If you need LibreOffice's ODS file support
```
### Running on CLI
```
goodtables data.csv
```
Use `goodtables --help` to see the different options.
### Running on Python
```python
from goodtables import validate
report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'
```
You can read a more in depth explanation on using goodtables with Python on
the [developer documentation](#developer-documentation) section. Check also
the [examples](examples) folder for other examples.
## Documentation
Goodtables validates your tabular dataset to find structural and content
errors. Consider you have a file named `invalid.csv`. Let's validate it:
```python
report = validate('invalid.csv')
```
We could also pass a remote URI instead of a local path. It supports CSV, XLS,
XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator]
library.
### Report
> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema].
The output of the `validate()` method is a report dictionary. It includes
information if the data was valid, count of errors, list of table reports, which
individual checks failed, etc. A report will be looking like this:
```json
{
"time": 0.009,
"error-count": 1,
"warnings": [
"Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
],
"preset": "table",
"valid": false,
"tables": [
{
"errors": [
{
"row-number": null,
"message": "Header in column 3 is blank",
"row": null,
"column-number": 3,
"code": "blank-header"
}
],
"error-count": 1,
"headers": [
"id",
"name",
"",
"name"
],
"scheme": "file",
"row-count": 2,
"valid": false,
"encoding": "utf-8",
"time": 0.007,
"schema": null,
"format": "csv",
"source": "data/invalid"
}
],
"table-count": 1
}
```
The errors are divided in one of the following categories:
- `source` - data can't be loaded or parsed
- `structure` - general tabular errors like duplicate headers
- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/)
- `custom` - custom checks errors
### Checks
Check is a main validation actor in goodtables. The list of enabled checks can
be changed using `checks` and `skip_checks` arguments. Let's explore the options
on an example:
```python
report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'
```
By default a dataset will be validated against all available Data Quality Spec
errors. Some checks can be unavailable for validation. For example, if the
schema isn't provided, only the `structure` checks will be done.
### Presets
Goodtables support different formats of tabular datasets. They're called
presets. A tabular dataset is some data that can be split in a list of data
tables, as:
![Dataset](data/dataset.png)
We can change the preset using the `preset` argument for `validate()`. By
default, it'll be inferred from the source, falling back to `table`. To validate
a [data package][datapackage], we can do:
```python
report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset
```
This will validate all tabular resources in the datapackage.
It's also possible to validate a list of files using the "nested" preset. To do
so, the first argument to `validate()` should be a list of dictionaries, where
each key in the dictionary is named after a parameter on `validate()`. For example:
```python
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset
```
Is similar to:
```python
report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')
```
The difference is that goodtables validates multiple tables in parallel, so
calling using the "nested" preset should run faster.
### Data Quality Errors
Base report errors are standardized and described in
[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json).
#### Source errors
The basic checks can't be disabled, as they deal with goodtables being able to read the files.
| check | description |
| --- | --- |
| io-error | Data reading error because of IO error. |
| http-error | Data reading error because of HTTP error. |
| source-error | Data reading error because of not supported or inconsistent contents. |
| scheme-error | Data reading error because of incorrect scheme. |
| format-error | Data reading error because of incorrect format. |
| encoding-error | Data reading error because of an encoding problem. |
#### Structure errors
These checks validate that the structure of the file are valid.
| check | description |
| --- | --- |
| blank-header | There is a blank header name. All cells in the header row must have a value. |
| duplicate-header | There are multiple columns with the same name. All column names must be unique. |
| blank-row | Rows must have at least one non-blank cell. |
| duplicate-row | Rows can't be duplicated. |
| extra-value | A row has more columns than the header. |
| missing-value | A row has less columns than the header. |
#### Schema errors
These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option.
If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields.
Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option.
| check | description |
| --- | --- |
| schema-error | Schema is not valid. |
| non-matching-header | The header's name in the schema is different from what's in the data. |
| extra-header | The data contains a header not defined in the schema. |
| missing-header | The data doesn't contain a header defined in the schema. |
| type-or-format-error | The value can’t be cast based on the schema type and format for this field. |
| required-constraint | This field is a required field, but it contains no value. |
| pattern-constraint | This field value's should conform to the defined pattern. |
| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |
| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |
| minimum-constraint | This field value should be greater or equal than constraint value. |
| maximum-constraint | This field value should be less or equal than constraint value. |
| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |
| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |
#### Custom errors
| check | description |
| --- | --- |
| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. |
| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. |
| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package |
| [sequential-value](#sequential-value) | Ensure numbers are sequential. |
| [truncated-value](#truncated-value) | Detect values that were potentially truncated. |
| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). |
##### blacklisted-value
Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| blacklist | list of str | List of blacklisted values |
Consider the following CSV file:
```csv
id,name
1,John
2,bug
3,bad
5,Alex
```
Let's check that the `name` column doesn't contain rows with `bug` or `bad`:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"
```
##### deviated-value
This check helps to find outlines in a column containing positive numbers. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| average | str | Average type, either "mean", "median" or "mode" |
| interval | int | Values must be inside range `average ± standard deviation * interval` |
Consider the following CSV file:
```csv
temperature
1
-2
7
0
1
2
5
-4
100
8
3
```
We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval.
```python
report = validate('data.csv', checks=[
{'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"
```
##### foreign-key
> We support here relative paths. It MUST be used only for trusted data sources.
This check validate foreign keys within a data package. Consider we have a data package defined below:
```python
DESCRIPTOR = {
'resources': [
{
'name': 'cities',
'data': [
['id', 'name', 'next_id'],
[1, 'london', 2],
[2, 'paris', 3],
[3, 'rome', 4],
# [4, 'rio', None],
],
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'next_id', 'type': 'integer'},
],
'foreignKeys': [
{
'fields': 'next_id',
'reference': {'resource': '', 'fields': 'id'},
},
{
'fields': 'id',
'reference': {'resource': 'people', 'fields': 'label'},
},
],
},
}, {
'name': 'people',
'data': [
['label', 'population'],
[1, 8],
[2, 2],
# [3, 3],
# [4, 6],
],
},
],
}
```
Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data:
```python
report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key'])
print(report)
```
```
{'error-count': 2,
'preset': 'datapackage',
'table-count': 2,
'tables': [{'datapackage': '...',
'error-count': 2,
'errors': [{'code': 'foreign-key',
'message': 'Foreign key "[\'next_id\']" violation in '
'row 4',
'message-data': {'fields': ['next_id']},
'row-number': 4},
{'code': 'foreign-key',
'message': 'Foreign key "[\'id\']" violation in row 4',
'message-data': {'fields': ['id']},
'row-number': 4}],
'format': 'inline',
'headers': ['id', 'name', 'next_id'],
'resource-name': 'cities',
'row-count': 4,
'schema': 'table-schema',
'source': 'inline',
'time': 0.031,
'valid': False},
{'datapackage': '...',
'error-count': 0,
'errors': [],
'format': 'inline',
'headers': ['label', 'population'],
'resource-name': 'people',
'row-count': 3,
'source': 'inline',
'time': 0.038,
'valid': True}],
'time': 0.117,
'valid': False,
'warnings': []}
```
It experimetally supports external resource checks, for example, for a `foreignKey` definition like these:
```json
{"package": "../people/datapackage.json", "resource": "people", "fields": "label"}
{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"}
```
##### sequential-value
This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
Consider the following CSV file:
```csv
id,name
1,one
2,two
3,three
5,five
```
Let's check if the `id` column contains sequential integers:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"
```
##### truncated-value
Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information.
Consider the following CSV file:
```csv
id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad
```
To detect all probably truncated values we could use `truncated-value` check:
```python
report = validate('data.csv', checks=[
'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"
```
##### custom-constraint
With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options:
- constraint (str)
- Constraint definition (e.g.
col1 + col2 == col3
)
Consider csv file like this:
```csv
id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000
```
Let's say our business rule is to be shy on bonuses:
```python
report = validate('data.csv', checks=[
{'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"
```
### Frequently Asked Questions
#### How can I add a new custom check?
To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments):
```python
from goodtables import validate, check, Error
@check('custom-check', type='custom', context='body')
def custom_check(cells):
errors = []
for cell in cells:
message = 'Custom error on column {column_number} and row {row_number}'
error = Error(
'custom-error',
cell,
message
)
errors.append(error)
return errors
report = validate('data.csv', checks=['custom-check'])
```
Recommended steps:
- let's discuss in comment proposed checks first
- select name for a new check like `possible-noise-text`
- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module
- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py
- write actual code for the new check
- write tests and readme for the new check
#### How can I add support for a new tabular file type?
To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset.
```python
from tabulator import Stream
from tableschema import Schema
from goodtables import validate
@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables
report = validate(source, preset='custom-preset')
```
For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol.
## API Reference
### `cli`
```python
cli()
```
Command-line interface
```
Usage: cli.py [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
validate* Validate tabular files (default).
init Init data package from list of files.
```
### `validate`
```python
validate(source, **options)
```
Validates a source file and returns a report.
__Arguments__
- __source (Union[str, Dict, List[Dict], IO])__:
The source to be validated.
It can be a local file path, URL, dict, list of dicts, or a
file-like object. If it's a list of dicts and the `preset` is
"nested", each of the dict key's will be used as if it was passed
as a keyword argument to this method.
The file can be a CSV, XLS, JSON, and any other format supported by
`tabulator`_.
- __checks (List[str])__:
List of checks names to be enabled. They can be
individual check names (e.g. `blank-headers`), or check types (e.g.
`structure`).
- __skip_checks (List[str])__:
List of checks names to be skipped. They can
be individual check names (e.g. `blank-headers`), or check types
(e.g. `structure`).
- __infer_schema (bool)__:
Infer schema if one wasn't passed as an argument.
- __infer_fields (bool)__:
Infer schema for columns not present in the received schema.
- __order_fields (bool)__:
Order source columns based on schema fields order.
This is useful when you don't want to validate that the data
columns' order is the same as the schema's.
- __error_limit (int)__:
Stop validation if the number of errors per table exceeds this value.
- __table_limit (int)__:
Maximum number of tables to validate.
- __row_limit (int)__:
Maximum number of rows to validate.
- __preset (str)__:
Dataset type could be `table` (default), `datapackage`,
`nested` or custom. Usually, the preset can be inferred from the
source, so you don't need to define it.
- __Any (Any)__:
Any additional arguments not defined here will be passed on,
depending on the chosen `preset`. If the `preset` is `table`, the
extra arguments will be passed on to `tabulator`_, if it is
`datapackage`, they will be passed on to the `datapackage`_
constructor.
__Raises__
- `GoodtablesException`: Raised on any non-tabular error.
__Returns__
`dict`: The validation report.
### `preset`
```python
preset(name)
```
Register a custom preset (decorator)
__Example__
```python
@preset('custom-preset')
def custom_preset(source, **options):
# ...
```
__Arguments__
- __name (str)__: preset name
### `check`
```python
check(name, type=None, context=None, position=None)
```
Register a custom check (decorator)
__Example__
```python
@check('custom-check', type='custom', context='body')
def custom_check(cells):
# ...
```
__Arguments__
- __name (str)__: preset name
- __type (str)__: has to be `custom`
- __context (str)__: has to be `head` or `body`
- __position (str)__: has to be `before:` or `after:`
### `Error`
```python
Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None)
```
Describes a validation check error
__Arguments__
- __code (str)__: The error code. Must be one in the spec.
- __cell (dict, optional)__: The cell where the error occurred.
- __row_number (int, optional)__: The row number where the error occurs.
- __message (str, optional)__:
The error message. Defaults to the message from the Data Quality Spec.
- __message_substitutions (dict, optional)__:
Dictionary with substitutions to be used when
generating the error message and description.
__Raises__
- `KeyError`: Raised if the error code isn't known.
### `spec`
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
(key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
d = {}
for k, v in iterable:
d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
### `GoodtablesException`
```python
GoodtablesException(self, /, *args, **kwargs)
```
Base goodtables exception
## Contributing
> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
Recommended way to get started is to create and activate a project virtual environment.
To install package and development dependencies into active environment:
```bash
$ make install
```
To run tests with linting and coverage:
```bash
$ make test
```
## Changelog
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master).
##### v2.5
- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support)
##### v2.4
- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values
##### v2.3
- Added a [foreign keys check](#foreign-key)
##### v2.2
- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298))
##### v2.1
- A new key added to the `error.to_dict` return: `message-data`
##### v2.0
Breaking changes:
- Checks method signature now only receives the current row's `cells` list
- Checks raise errors by returning an array of `Error` objects
- Cells have the row number in the `row-number` key
- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works
- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233))
- New `goodtables init ` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas.
Bug fixes:
- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250))
##### v1.5
New API added:
- Validation `source` now could be a `pathlib.Path`
##### v1.4
Improved behaviour:
- rebased on Data Quality Spec v1
- rebased on Data Package Spec v1
- rebased on Table Schema Spec v1
- treat primary key as required/unique field
##### v1.3
New advanced checks added:
- `blacklisted-value`
- `custom-constraint`
- `deviated-value`
- `sequential-value`
- `truncated-value`
##### v1.2
New API added:
- `report.preset`
- `report.tables[].schema`
##### v1.1
New API added:
- `report.tables[].scheme`
- `report.tables[].format`
- `report.tables[].encoding`
##### v1.0
This version includes various big changes. A migration guide is under development and will be published here.
##### v0.6
First version of `goodtables`.
[tableschema]: https://specs.frictionlessdata.io/table-schema/
[tabulator]: https://github.com/frictionlessdata/tabulator-py/
[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification"
[semver]: https://semver.org/ "Semantic Versioning"
[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema"
%package help
Summary: Development documents and examples for goodtables
Provides: python3-goodtables-doc
%description help
# goodtables-py
[![Travis](https://img.shields.io/travis/frictionlessdata/goodtables-py/master.svg)](https://travis-ci.org/frictionlessdata/goodtables-py)
[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/goodtables-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master)
[![PyPi](https://img.shields.io/pypi/v/goodtables.svg)](https://pypi.python.org/pypi/goodtables)
[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/goodtables-py)
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
Goodtables is a framework to validate tabular data. It can check the structure
of your data (e.g. all rows have the same number of columns), and its contents
(e.g. all dates are valid).
> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python.
> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before
> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
## Features
* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc.
* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18").
* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc.
* **Parallelized validations for multi-table datasets**
* **Command line interface**
## Contents
- [Getting Started](#getting-started)
- [Installing](#installing)
- [Running on CLI](#running-on-cli)
- [Running on Python](#running-on-python)
- [Documentation](#documentation)
- [Report](#report)
- [Checks](#checks)
- [Presets](#presets)
- [Data Quality Errors](#data-quality-errors)
- [Frequently Asked Questions](#frequently-asked-questions)
- [API Reference](#api-reference)
- [`cli`](#cli)
- [`validate`](#validate)
- [`preset`](#preset)
- [`check`](#check)
- [`Error`](#error)
- [`spec`](#spec)
- [`GoodtablesException`](#goodtablesexception)
- [Contributing](#contributing)
- [Changelog](#changelog)
## Getting Started
> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py
### Installing
```
pip install goodtables
pip install goodtables[ods] # If you need LibreOffice's ODS file support
```
### Running on CLI
```
goodtables data.csv
```
Use `goodtables --help` to see the different options.
### Running on Python
```python
from goodtables import validate
report = validate('invalid.csv')
report['valid'] # false
report['table-count'] # 1
report['error-count'] # 3
report['tables'][0]['valid'] # false
report['tables'][0]['source'] # 'invalid.csv'
report['tables'][0]['errors'][0]['code'] # 'blank-header'
```
You can read a more in depth explanation on using goodtables with Python on
the [developer documentation](#developer-documentation) section. Check also
the [examples](examples) folder for other examples.
## Documentation
Goodtables validates your tabular dataset to find structural and content
errors. Consider you have a file named `invalid.csv`. Let's validate it:
```python
report = validate('invalid.csv')
```
We could also pass a remote URI instead of a local path. It supports CSV, XLS,
XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator]
library.
### Report
> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema].
The output of the `validate()` method is a report dictionary. It includes
information if the data was valid, count of errors, list of table reports, which
individual checks failed, etc. A report will be looking like this:
```json
{
"time": 0.009,
"error-count": 1,
"warnings": [
"Table \"data/invalid.csv\" inspection has reached 1 error(s) limit"
],
"preset": "table",
"valid": false,
"tables": [
{
"errors": [
{
"row-number": null,
"message": "Header in column 3 is blank",
"row": null,
"column-number": 3,
"code": "blank-header"
}
],
"error-count": 1,
"headers": [
"id",
"name",
"",
"name"
],
"scheme": "file",
"row-count": 2,
"valid": false,
"encoding": "utf-8",
"time": 0.007,
"schema": null,
"format": "csv",
"source": "data/invalid"
}
],
"table-count": 1
}
```
The errors are divided in one of the following categories:
- `source` - data can't be loaded or parsed
- `structure` - general tabular errors like duplicate headers
- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/)
- `custom` - custom checks errors
### Checks
Check is a main validation actor in goodtables. The list of enabled checks can
be changed using `checks` and `skip_checks` arguments. Let's explore the options
on an example:
```python
report = validate('data.csv') # by default structure and schema (if available) checks
report = validate('data.csv', checks=['structure']) # only structure checks
report = validate('data.csv', checks=['schema']) # only schema (if available) checks
report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'
report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'
```
By default a dataset will be validated against all available Data Quality Spec
errors. Some checks can be unavailable for validation. For example, if the
schema isn't provided, only the `structure` checks will be done.
### Presets
Goodtables support different formats of tabular datasets. They're called
presets. A tabular dataset is some data that can be split in a list of data
tables, as:
![Dataset](data/dataset.png)
We can change the preset using the `preset` argument for `validate()`. By
default, it'll be inferred from the source, falling back to `table`. To validate
a [data package][datapackage], we can do:
```python
report = validate('datapackage.json') # implicit preset
report = validate('datapackage.json', preset='datapackage') # explicit preset
```
This will validate all tabular resources in the datapackage.
It's also possible to validate a list of files using the "nested" preset. To do
so, the first argument to `validate()` should be a list of dictionaries, where
each key in the dictionary is named after a parameter on `validate()`. For example:
```python
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset
report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset
```
Is similar to:
```python
report_data1 = validate('data1.csv')
report_data2 = validate('data2.csv')
```
The difference is that goodtables validates multiple tables in parallel, so
calling using the "nested" preset should run faster.
### Data Quality Errors
Base report errors are standardized and described in
[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json).
#### Source errors
The basic checks can't be disabled, as they deal with goodtables being able to read the files.
| check | description |
| --- | --- |
| io-error | Data reading error because of IO error. |
| http-error | Data reading error because of HTTP error. |
| source-error | Data reading error because of not supported or inconsistent contents. |
| scheme-error | Data reading error because of incorrect scheme. |
| format-error | Data reading error because of incorrect format. |
| encoding-error | Data reading error because of an encoding problem. |
#### Structure errors
These checks validate that the structure of the file are valid.
| check | description |
| --- | --- |
| blank-header | There is a blank header name. All cells in the header row must have a value. |
| duplicate-header | There are multiple columns with the same name. All column names must be unique. |
| blank-row | Rows must have at least one non-blank cell. |
| duplicate-row | Rows can't be duplicated. |
| extra-value | A row has more columns than the header. |
| missing-value | A row has less columns than the header. |
#### Schema errors
These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option.
If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields.
Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option.
| check | description |
| --- | --- |
| schema-error | Schema is not valid. |
| non-matching-header | The header's name in the schema is different from what's in the data. |
| extra-header | The data contains a header not defined in the schema. |
| missing-header | The data doesn't contain a header defined in the schema. |
| type-or-format-error | The value can’t be cast based on the schema type and format for this field. |
| required-constraint | This field is a required field, but it contains no value. |
| pattern-constraint | This field value's should conform to the defined pattern. |
| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |
| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |
| minimum-constraint | This field value should be greater or equal than constraint value. |
| maximum-constraint | This field value should be less or equal than constraint value. |
| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |
| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |
#### Custom errors
| check | description |
| --- | --- |
| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. |
| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. |
| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package |
| [sequential-value](#sequential-value) | Ensure numbers are sequential. |
| [truncated-value](#truncated-value) | Detect values that were potentially truncated. |
| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). |
##### blacklisted-value
Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| blacklist | list of str | List of blacklisted values |
Consider the following CSV file:
```csv
id,name
1,John
2,bug
3,bad
5,Alex
```
Let's check that the `name` column doesn't contain rows with `bug` or `bad`:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}},
])
# error on row 3 with code "blacklisted-value"
# error on row 4 with code "blacklisted-value"
```
##### deviated-value
This check helps to find outlines in a column containing positive numbers. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
| average | str | Average type, either "mean", "median" or "mode" |
| interval | int | Values must be inside range `average ± standard deviation * interval` |
Consider the following CSV file:
```csv
temperature
1
-2
7
0
1
2
5
-4
100
8
3
```
We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval.
```python
report = validate('data.csv', checks=[
{'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},
])
# error on row 10 with code "deviated-value"
```
##### foreign-key
> We support here relative paths. It MUST be used only for trusted data sources.
This check validate foreign keys within a data package. Consider we have a data package defined below:
```python
DESCRIPTOR = {
'resources': [
{
'name': 'cities',
'data': [
['id', 'name', 'next_id'],
[1, 'london', 2],
[2, 'paris', 3],
[3, 'rome', 4],
# [4, 'rio', None],
],
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'next_id', 'type': 'integer'},
],
'foreignKeys': [
{
'fields': 'next_id',
'reference': {'resource': '', 'fields': 'id'},
},
{
'fields': 'id',
'reference': {'resource': 'people', 'fields': 'label'},
},
],
},
}, {
'name': 'people',
'data': [
['label', 'population'],
[1, 8],
[2, 2],
# [3, 3],
# [4, 6],
],
},
],
}
```
Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data:
```python
report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key'])
print(report)
```
```
{'error-count': 2,
'preset': 'datapackage',
'table-count': 2,
'tables': [{'datapackage': '...',
'error-count': 2,
'errors': [{'code': 'foreign-key',
'message': 'Foreign key "[\'next_id\']" violation in '
'row 4',
'message-data': {'fields': ['next_id']},
'row-number': 4},
{'code': 'foreign-key',
'message': 'Foreign key "[\'id\']" violation in row 4',
'message-data': {'fields': ['id']},
'row-number': 4}],
'format': 'inline',
'headers': ['id', 'name', 'next_id'],
'resource-name': 'cities',
'row-count': 4,
'schema': 'table-schema',
'source': 'inline',
'time': 0.031,
'valid': False},
{'datapackage': '...',
'error-count': 0,
'errors': [],
'format': 'inline',
'headers': ['label', 'population'],
'resource-name': 'people',
'row-count': 3,
'source': 'inline',
'time': 0.038,
'valid': True}],
'time': 0.117,
'valid': False,
'warnings': []}
```
It experimetally supports external resource checks, for example, for a `foreignKey` definition like these:
```json
{"package": "../people/datapackage.json", "resource": "people", "fields": "label"}
{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"}
```
##### sequential-value
This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options:
| option | type | description |
| --- | --- | --- |
| column | int/str | Column number or name |
Consider the following CSV file:
```csv
id,name
1,one
2,two
3,three
5,five
```
Let's check if the `id` column contains sequential integers:
```python
from goodtables import validate
report = validate('data.csv', checks=[
{'sequential-value': {'column': 'id'}},
])
# error on row 5 with code "sequential-value"
```
##### truncated-value
Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information.
Consider the following CSV file:
```csv
id,amount,comment
1,14000000,good
2,2147483647,bad
3,32767,bad
4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad
```
To detect all probably truncated values we could use `truncated-value` check:
```python
report = validate('data.csv', checks=[
'truncated-value',
])
# error on row 3 with code "truncated-value"
# error on row 4 with code "truncated-value"
# error on row 5 with code "truncated-value"
```
##### custom-constraint
With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options:
- constraint (str)
- Constraint definition (e.g.
col1 + col2 == col3
)
Consider csv file like this:
```csv
id,name,salary,bonus
1,Alex,1000,200
2,Sam,2500,500
3,Ray,1350,500
4,John,5000,1000
```
Let's say our business rule is to be shy on bonuses:
```python
report = validate('data.csv', checks=[
{'custom-constraint': {'constraint': 'salary > bonus * 4'}},
])
# error on row 4 with code "custom-constraint"
```
### Frequently Asked Questions
#### How can I add a new custom check?
To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments):
```python
from goodtables import validate, check, Error
@check('custom-check', type='custom', context='body')
def custom_check(cells):
errors = []
for cell in cells:
message = 'Custom error on column {column_number} and row {row_number}'
error = Error(
'custom-error',
cell,
message
)
errors.append(error)
return errors
report = validate('data.csv', checks=['custom-check'])
```
Recommended steps:
- let's discuss in comment proposed checks first
- select name for a new check like `possible-noise-text`
- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module
- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py
- write actual code for the new check
- write tests and readme for the new check
#### How can I add support for a new tabular file type?
To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset.
```python
from tabulator import Stream
from tableschema import Schema
from goodtables import validate
@preset('custom-preset')
def custom_preset(source, **options):
warnings = []
tables = []
for table in source:
try:
tables.append({
'source': str(source),
'stream': Stream(...),
'schema': Schema(...),
'extra': {...},
})
except Exception:
warnings.append('Warning message')
return warnings, tables
report = validate(source, preset='custom-preset')
```
For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol.
## API Reference
### `cli`
```python
cli()
```
Command-line interface
```
Usage: cli.py [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
validate* Validate tabular files (default).
init Init data package from list of files.
```
### `validate`
```python
validate(source, **options)
```
Validates a source file and returns a report.
__Arguments__
- __source (Union[str, Dict, List[Dict], IO])__:
The source to be validated.
It can be a local file path, URL, dict, list of dicts, or a
file-like object. If it's a list of dicts and the `preset` is
"nested", each of the dict key's will be used as if it was passed
as a keyword argument to this method.
The file can be a CSV, XLS, JSON, and any other format supported by
`tabulator`_.
- __checks (List[str])__:
List of checks names to be enabled. They can be
individual check names (e.g. `blank-headers`), or check types (e.g.
`structure`).
- __skip_checks (List[str])__:
List of checks names to be skipped. They can
be individual check names (e.g. `blank-headers`), or check types
(e.g. `structure`).
- __infer_schema (bool)__:
Infer schema if one wasn't passed as an argument.
- __infer_fields (bool)__:
Infer schema for columns not present in the received schema.
- __order_fields (bool)__:
Order source columns based on schema fields order.
This is useful when you don't want to validate that the data
columns' order is the same as the schema's.
- __error_limit (int)__:
Stop validation if the number of errors per table exceeds this value.
- __table_limit (int)__:
Maximum number of tables to validate.
- __row_limit (int)__:
Maximum number of rows to validate.
- __preset (str)__:
Dataset type could be `table` (default), `datapackage`,
`nested` or custom. Usually, the preset can be inferred from the
source, so you don't need to define it.
- __Any (Any)__:
Any additional arguments not defined here will be passed on,
depending on the chosen `preset`. If the `preset` is `table`, the
extra arguments will be passed on to `tabulator`_, if it is
`datapackage`, they will be passed on to the `datapackage`_
constructor.
__Raises__
- `GoodtablesException`: Raised on any non-tabular error.
__Returns__
`dict`: The validation report.
### `preset`
```python
preset(name)
```
Register a custom preset (decorator)
__Example__
```python
@preset('custom-preset')
def custom_preset(source, **options):
# ...
```
__Arguments__
- __name (str)__: preset name
### `check`
```python
check(name, type=None, context=None, position=None)
```
Register a custom check (decorator)
__Example__
```python
@check('custom-check', type='custom', context='body')
def custom_check(cells):
# ...
```
__Arguments__
- __name (str)__: preset name
- __type (str)__: has to be `custom`
- __context (str)__: has to be `head` or `body`
- __position (str)__: has to be `before:` or `after:`
### `Error`
```python
Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None)
```
Describes a validation check error
__Arguments__
- __code (str)__: The error code. Must be one in the spec.
- __cell (dict, optional)__: The cell where the error occurred.
- __row_number (int, optional)__: The row number where the error occurs.
- __message (str, optional)__:
The error message. Defaults to the message from the Data Quality Spec.
- __message_substitutions (dict, optional)__:
Dictionary with substitutions to be used when
generating the error message and description.
__Raises__
- `KeyError`: Raised if the error code isn't known.
### `spec`
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
(key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
d = {}
for k, v in iterable:
d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
### `GoodtablesException`
```python
GoodtablesException(self, /, *args, **kwargs)
```
Base goodtables exception
## Contributing
> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
Recommended way to get started is to create and activate a project virtual environment.
To install package and development dependencies into active environment:
```bash
$ make install
```
To run tests with linting and coverage:
```bash
$ make test
```
## Changelog
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master).
##### v2.5
- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support)
##### v2.4
- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values
##### v2.3
- Added a [foreign keys check](#foreign-key)
##### v2.2
- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298))
##### v2.1
- A new key added to the `error.to_dict` return: `message-data`
##### v2.0
Breaking changes:
- Checks method signature now only receives the current row's `cells` list
- Checks raise errors by returning an array of `Error` objects
- Cells have the row number in the `row-number` key
- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works
- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233))
- New `goodtables init ` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas.
Bug fixes:
- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250))
##### v1.5
New API added:
- Validation `source` now could be a `pathlib.Path`
##### v1.4
Improved behaviour:
- rebased on Data Quality Spec v1
- rebased on Data Package Spec v1
- rebased on Table Schema Spec v1
- treat primary key as required/unique field
##### v1.3
New advanced checks added:
- `blacklisted-value`
- `custom-constraint`
- `deviated-value`
- `sequential-value`
- `truncated-value`
##### v1.2
New API added:
- `report.preset`
- `report.tables[].schema`
##### v1.1
New API added:
- `report.tables[].scheme`
- `report.tables[].format`
- `report.tables[].encoding`
##### v1.0
This version includes various big changes. A migration guide is under development and will be published here.
##### v0.6
First version of `goodtables`.
[tableschema]: https://specs.frictionlessdata.io/table-schema/
[tabulator]: https://github.com/frictionlessdata/tabulator-py/
[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification"
[semver]: https://semver.org/ "Semantic Versioning"
[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema"
%prep
%autosetup -n goodtables-2.5.4
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-goodtables -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Sun Apr 23 2023 Python_Bot - 2.5.4-1
- Package Spec generated