diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-04-11 10:39:12 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-04-11 10:39:12 +0000 |
| commit | 76ab74b23ba8bb8b81752d1b9212af3cb9e74e1c (patch) | |
| tree | 3b33974e30785b8ff9b823c2b4480451f0ddb8cd /python-goodtables.spec | |
| parent | 94df2feeec22d57910c32d654bf5a8e9fe2a5b7e (diff) | |
automatic import of python-goodtables
Diffstat (limited to 'python-goodtables.spec')
| -rw-r--r-- | python-goodtables.spec | 2647 |
1 files changed, 2647 insertions, 0 deletions
diff --git a/python-goodtables.spec b/python-goodtables.spec new file mode 100644 index 0000000..0613e20 --- /dev/null +++ b/python-goodtables.spec @@ -0,0 +1,2647 @@ +%global _empty_manifest_terminate_build 0 +Name: python-goodtables +Version: 2.5.4 +Release: 1 +Summary: Goodtables is a framework to inspect tabular data. +License: MIT +URL: https://github.com/frictionlessdata/goodtables +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/93/25/c691e85a93d0411ef50903a56ff26ffcc784f392e2765d2555548a0dd1a0/goodtables-2.5.4.tar.gz +BuildArch: noarch + +Requires: python3-six +Requires: python3-click +Requires: python3-click-default-group +Requires: python3-requests +Requires: python3-simpleeval +Requires: python3-statistics +Requires: python3-tabulator +Requires: python3-tableschema +Requires: python3-datapackage +Requires: python3-mock +Requires: python3-pylama +Requires: python3-pytest +Requires: python3-pytest-cov +Requires: python3-pyyaml +Requires: python3-ezodf +Requires: python3-lxml + +%description +# goodtables-py + +[](https://travis-ci.org/frictionlessdata/goodtables-py) +[](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master) +[](https://pypi.python.org/pypi/goodtables) +[](https://github.com/frictionlessdata/goodtables-py) +[](https://gitter.im/frictionlessdata/chat) + +Goodtables is a framework to validate tabular data. It can check the structure +of your data (e.g. all rows have the same number of columns), and its contents +(e.g. all dates are valid). + +> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python. +> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before +> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable +> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version + +## Features + +* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc. +* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18"). +* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc. +* **Parallelized validations for multi-table datasets** +* **Command line interface** + +## Contents + +<!--TOC--> + + - [Getting Started](#getting-started) + - [Installing](#installing) + - [Running on CLI](#running-on-cli) + - [Running on Python](#running-on-python) + - [Documentation](#documentation) + - [Report](#report) + - [Checks](#checks) + - [Presets](#presets) + - [Data Quality Errors](#data-quality-errors) + - [Frequently Asked Questions](#frequently-asked-questions) + - [API Reference](#api-reference) + - [`cli`](#cli) + - [`validate`](#validate) + - [`preset`](#preset) + - [`check`](#check) + - [`Error`](#error) + - [`spec`](#spec) + - [`GoodtablesException`](#goodtablesexception) + - [Contributing](#contributing) + - [Changelog](#changelog) + +<!--TOC--> + +## Getting Started + +> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py + +### Installing + +``` +pip install goodtables +pip install goodtables[ods] # If you need LibreOffice's ODS file support +``` + +### Running on CLI + +``` +goodtables data.csv +``` + +Use `goodtables --help` to see the different options. + +### Running on Python + +```python +from goodtables import validate + +report = validate('invalid.csv') +report['valid'] # false +report['table-count'] # 1 +report['error-count'] # 3 +report['tables'][0]['valid'] # false +report['tables'][0]['source'] # 'invalid.csv' +report['tables'][0]['errors'][0]['code'] # 'blank-header' +``` + +You can read a more in depth explanation on using goodtables with Python on +the [developer documentation](#developer-documentation) section. Check also +the [examples](examples) folder for other examples. + +## Documentation + +Goodtables validates your tabular dataset to find structural and content +errors. Consider you have a file named `invalid.csv`. Let's validate it: + +```python +report = validate('invalid.csv') +``` + +We could also pass a remote URI instead of a local path. It supports CSV, XLS, +XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator] +library. + +### Report + +> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema]. + +The output of the `validate()` method is a report dictionary. It includes +information if the data was valid, count of errors, list of table reports, which +individual checks failed, etc. A report will be looking like this: + +```json +{ + "time": 0.009, + "error-count": 1, + "warnings": [ + "Table \"data/invalid.csv\" inspection has reached 1 error(s) limit" + ], + "preset": "table", + "valid": false, + "tables": [ + { + "errors": [ + { + "row-number": null, + "message": "Header in column 3 is blank", + "row": null, + "column-number": 3, + "code": "blank-header" + } + ], + "error-count": 1, + "headers": [ + "id", + "name", + "", + "name" + ], + "scheme": "file", + "row-count": 2, + "valid": false, + "encoding": "utf-8", + "time": 0.007, + "schema": null, + "format": "csv", + "source": "data/invalid" + } + ], + "table-count": 1 +} +``` + +The errors are divided in one of the following categories: + +- `source` - data can't be loaded or parsed +- `structure` - general tabular errors like duplicate headers +- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/) +- `custom` - custom checks errors + +### Checks + +Check is a main validation actor in goodtables. The list of enabled checks can +be changed using `checks` and `skip_checks` arguments. Let's explore the options +on an example: + +```python +report = validate('data.csv') # by default structure and schema (if available) checks +report = validate('data.csv', checks=['structure']) # only structure checks +report = validate('data.csv', checks=['schema']) # only schema (if available) checks +report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers' +report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers' +``` + +By default a dataset will be validated against all available Data Quality Spec +errors. Some checks can be unavailable for validation. For example, if the +schema isn't provided, only the `structure` checks will be done. + +### Presets + +Goodtables support different formats of tabular datasets. They're called +presets. A tabular dataset is some data that can be split in a list of data +tables, as: + + + +We can change the preset using the `preset` argument for `validate()`. By +default, it'll be inferred from the source, falling back to `table`. To validate +a [data package][datapackage], we can do: + +```python +report = validate('datapackage.json') # implicit preset +report = validate('datapackage.json', preset='datapackage') # explicit preset +``` + +This will validate all tabular resources in the datapackage. + +It's also possible to validate a list of files using the "nested" preset. To do +so, the first argument to `validate()` should be a list of dictionaries, where +each key in the dictionary is named after a parameter on `validate()`. For example: + +```python +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset +``` + +Is similar to: + +```python +report_data1 = validate('data1.csv') +report_data2 = validate('data2.csv') +``` + +The difference is that goodtables validates multiple tables in parallel, so +calling using the "nested" preset should run faster. + +### Data Quality Errors + +Base report errors are standardized and described in +[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json). + +#### Source errors + +The basic checks can't be disabled, as they deal with goodtables being able to read the files. + +| check | description | +| --- | --- | +| io-error | Data reading error because of IO error. | +| http-error | Data reading error because of HTTP error. | +| source-error | Data reading error because of not supported or inconsistent contents. | +| scheme-error | Data reading error because of incorrect scheme. | +| format-error | Data reading error because of incorrect format. | +| encoding-error | Data reading error because of an encoding problem. | + +#### Structure errors + +These checks validate that the structure of the file are valid. + +| check | description | +| --- | --- | +| blank-header | There is a blank header name. All cells in the header row must have a value. | +| duplicate-header | There are multiple columns with the same name. All column names must be unique. | +| blank-row | Rows must have at least one non-blank cell. | +| duplicate-row | Rows can't be duplicated. | +| extra-value | A row has more columns than the header. | +| missing-value | A row has less columns than the header. | + +#### Schema errors + +These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option. + +If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields. + +Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option. + +| check | description | +| --- | --- | +| schema-error | Schema is not valid. | +| non-matching-header | The header's name in the schema is different from what's in the data. | +| extra-header | The data contains a header not defined in the schema. | +| missing-header | The data doesn't contain a header defined in the schema. | +| type-or-format-error | The value can’t be cast based on the schema type and format for this field. | +| required-constraint | This field is a required field, but it contains no value. | +| pattern-constraint | This field value's should conform to the defined pattern. | +| unique-constraint | This field is a unique field but it contains a value that has been used in another row. | +| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. | +| minimum-constraint | This field value should be greater or equal than constraint value. | +| maximum-constraint | This field value should be less or equal than constraint value. | +| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. | +| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. | + +#### Custom errors + +| check | description | +| --- | --- | +| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. | +| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. | +| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package | +| [sequential-value](#sequential-value) | Ensure numbers are sequential. | +| [truncated-value](#truncated-value) | Detect values that were potentially truncated. | +| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). | + +##### blacklisted-value + +Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| blacklist | list of str | List of blacklisted values | + +Consider the following CSV file: + +```csv +id,name +1,John +2,bug +3,bad +5,Alex +``` + +Let's check that the `name` column doesn't contain rows with `bug` or `bad`: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}}, +]) +# error on row 3 with code "blacklisted-value" +# error on row 4 with code "blacklisted-value" +``` + +##### deviated-value + +This check helps to find outlines in a column containing positive numbers. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| average | str | Average type, either "mean", "median" or "mode" | +| interval | int | Values must be inside range `average ± standard deviation * interval` | + +Consider the following CSV file: + +```csv +temperature +1 +-2 +7 +0 +1 +2 +5 +-4 +100 +8 +3 +``` + +We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval. + +```python +report = validate('data.csv', checks=[ + {'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}}, +]) +# error on row 10 with code "deviated-value" +``` + +##### foreign-key + +> We support here relative paths. It MUST be used only for trusted data sources. + +This check validate foreign keys within a data package. Consider we have a data package defined below: + +```python +DESCRIPTOR = { + 'resources': [ + { + 'name': 'cities', + 'data': [ + ['id', 'name', 'next_id'], + [1, 'london', 2], + [2, 'paris', 3], + [3, 'rome', 4], + # [4, 'rio', None], + ], + 'schema': { + 'fields': [ + {'name': 'id', 'type': 'integer'}, + {'name': 'name', 'type': 'string'}, + {'name': 'next_id', 'type': 'integer'}, + ], + 'foreignKeys': [ + { + 'fields': 'next_id', + 'reference': {'resource': '', 'fields': 'id'}, + }, + { + 'fields': 'id', + 'reference': {'resource': 'people', 'fields': 'label'}, + }, + ], + }, + }, { + 'name': 'people', + 'data': [ + ['label', 'population'], + [1, 8], + [2, 2], + # [3, 3], + # [4, 6], + ], + }, + ], +} +``` + +Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data: + +```python +report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key']) +print(report) +``` + +``` +{'error-count': 2, + 'preset': 'datapackage', + 'table-count': 2, + 'tables': [{'datapackage': '...', + 'error-count': 2, + 'errors': [{'code': 'foreign-key', + 'message': 'Foreign key "[\'next_id\']" violation in ' + 'row 4', + 'message-data': {'fields': ['next_id']}, + 'row-number': 4}, + {'code': 'foreign-key', + 'message': 'Foreign key "[\'id\']" violation in row 4', + 'message-data': {'fields': ['id']}, + 'row-number': 4}], + 'format': 'inline', + 'headers': ['id', 'name', 'next_id'], + 'resource-name': 'cities', + 'row-count': 4, + 'schema': 'table-schema', + 'source': 'inline', + 'time': 0.031, + 'valid': False}, + {'datapackage': '...', + 'error-count': 0, + 'errors': [], + 'format': 'inline', + 'headers': ['label', 'population'], + 'resource-name': 'people', + 'row-count': 3, + 'source': 'inline', + 'time': 0.038, + 'valid': True}], + 'time': 0.117, + 'valid': False, + 'warnings': []} +``` + +It experimetally supports external resource checks, for example, for a `foreignKey` definition like these: + +```json +{"package": "../people/datapackage.json", "resource": "people", "fields": "label"} +{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"} +``` + +##### sequential-value + +This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | + +Consider the following CSV file: + +```csv +id,name +1,one +2,two +3,three +5,five +``` + +Let's check if the `id` column contains sequential integers: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'sequential-value': {'column': 'id'}}, +]) +# error on row 5 with code "sequential-value" +``` + +##### truncated-value + +Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information. + +Consider the following CSV file: + +```csv +id,amount,comment +1,14000000,good +2,2147483647,bad +3,32767,bad +4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad +``` + +To detect all probably truncated values we could use `truncated-value` check: + +```python +report = validate('data.csv', checks=[ + 'truncated-value', +]) +# error on row 3 with code "truncated-value" +# error on row 4 with code "truncated-value" +# error on row 5 with code "truncated-value" +``` + +##### custom-constraint + +With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options: + +<dl> + <dt>constraint (str)</dt> + <dd>Constraint definition (e.g. <code>col1 + col2 == col3</code>)</dd> +</dl> + +Consider csv file like this: + +```csv +id,name,salary,bonus +1,Alex,1000,200 +2,Sam,2500,500 +3,Ray,1350,500 +4,John,5000,1000 +``` + +Let's say our business rule is to be shy on bonuses: + +```python +report = validate('data.csv', checks=[ + {'custom-constraint': {'constraint': 'salary > bonus * 4'}}, +]) +# error on row 4 with code "custom-constraint" +``` + +### Frequently Asked Questions + +#### How can I add a new custom check? + +To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments): + +```python +from goodtables import validate, check, Error + +@check('custom-check', type='custom', context='body') +def custom_check(cells): + errors = [] + for cell in cells: + message = 'Custom error on column {column_number} and row {row_number}' + error = Error( + 'custom-error', + cell, + message + ) + errors.append(error) + return errors + +report = validate('data.csv', checks=['custom-check']) +``` + +Recommended steps: +- let's discuss in comment proposed checks first +- select name for a new check like `possible-noise-text` +- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module +- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py +- write actual code for the new check +- write tests and readme for the new check + +#### How can I add support for a new tabular file type? + +To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset. + +```python +from tabulator import Stream +from tableschema import Schema +from goodtables import validate + +@preset('custom-preset') +def custom_preset(source, **options): + warnings = [] + tables = [] + for table in source: + try: + tables.append({ + 'source': str(source), + 'stream': Stream(...), + 'schema': Schema(...), + 'extra': {...}, + }) + except Exception: + warnings.append('Warning message') + return warnings, tables + +report = validate(source, preset='custom-preset') +``` + +For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol. + +## API Reference + +### `cli` +```python +cli() +``` +Command-line interface + +``` +Usage: cli.py [OPTIONS] COMMAND [ARGS]... + +Options: + --version Show the version and exit. + --help Show this message and exit. + +Commands: + validate* Validate tabular files (default). + init Init data package from list of files. +``` + + +### `validate` +```python +validate(source, **options) +``` +Validates a source file and returns a report. + +__Arguments__ + +- __source (Union[str, Dict, List[Dict], IO])__: + The source to be validated. + It can be a local file path, URL, dict, list of dicts, or a + file-like object. If it's a list of dicts and the `preset` is + "nested", each of the dict key's will be used as if it was passed + as a keyword argument to this method. + + The file can be a CSV, XLS, JSON, and any other format supported by + `tabulator`_. +- __checks (List[str])__: + List of checks names to be enabled. They can be + individual check names (e.g. `blank-headers`), or check types (e.g. + `structure`). +- __skip_checks (List[str])__: + List of checks names to be skipped. They can + be individual check names (e.g. `blank-headers`), or check types + (e.g. `structure`). +- __infer_schema (bool)__: + Infer schema if one wasn't passed as an argument. +- __infer_fields (bool)__: + Infer schema for columns not present in the received schema. +- __order_fields (bool)__: + Order source columns based on schema fields order. + This is useful when you don't want to validate that the data + columns' order is the same as the schema's. +- __error_limit (int)__: + Stop validation if the number of errors per table exceeds this value. +- __table_limit (int)__: + Maximum number of tables to validate. +- __row_limit (int)__: + Maximum number of rows to validate. +- __preset (str)__: + Dataset type could be `table` (default), `datapackage`, + `nested` or custom. Usually, the preset can be inferred from the + source, so you don't need to define it. +- __Any (Any)__: + Any additional arguments not defined here will be passed on, + depending on the chosen `preset`. If the `preset` is `table`, the + extra arguments will be passed on to `tabulator`_, if it is + `datapackage`, they will be passed on to the `datapackage`_ + constructor. + +__Raises__ +- `GoodtablesException`: Raised on any non-tabular error. + +__Returns__ + +`dict`: The validation report. + + +### `preset` +```python +preset(name) +``` +Register a custom preset (decorator) + +__Example__ + + +```python +@preset('custom-preset') +def custom_preset(source, **options): + # ... +``` + +__Arguments__ +- __name (str)__: preset name + + +### `check` +```python +check(name, type=None, context=None, position=None) +``` +Register a custom check (decorator) + +__Example__ + + +```python +@check('custom-check', type='custom', context='body') +def custom_check(cells): + # ... +``` + +__Arguments__ +- __name (str)__: preset name +- __type (str)__: has to be `custom` +- __context (str)__: has to be `head` or `body` +- __position (str)__: has to be `before:<check-name>` or `after:<check-name>` + + +### `Error` +```python +Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None) +``` +Describes a validation check error + +__Arguments__ +- __code (str)__: The error code. Must be one in the spec. +- __cell (dict, optional)__: The cell where the error occurred. +- __row_number (int, optional)__: The row number where the error occurs. +- __message (str, optional)__: + The error message. Defaults to the message from the Data Quality Spec. +- __message_substitutions (dict, optional)__: + Dictionary with substitutions to be used when + generating the error message and description. + +__Raises__ +- `KeyError`: Raised if the error code isn't known. + + +### `spec` +dict() -> new empty dictionary +dict(mapping) -> new dictionary initialized from a mapping object's + (key, value) pairs +dict(iterable) -> new dictionary initialized as if via: + d = {} + for k, v in iterable: + d[k] = v +dict(**kwargs) -> new dictionary initialized with the name=value pairs + in the keyword argument list. For example: dict(one=1, two=2) +### `GoodtablesException` +```python +GoodtablesException(self, /, *args, **kwargs) +``` +Base goodtables exception + +## Contributing + +> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). + +Recommended way to get started is to create and activate a project virtual environment. +To install package and development dependencies into active environment: + +```bash +$ make install +``` + +To run tests with linting and coverage: + +```bash +$ make test +``` + +## Changelog + +Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master). + +##### v2.5 + +- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support) + +##### v2.4 + +- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values + +##### v2.3 + +- Added a [foreign keys check](#foreign-key) + +##### v2.2 + +- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298)) + +##### v2.1 + +- A new key added to the `error.to_dict` return: `message-data` + +##### v2.0 + +Breaking changes: + +- Checks method signature now only receives the current row's `cells` list +- Checks raise errors by returning an array of `Error` objects +- Cells have the row number in the `row-number` key +- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works +- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233)) +- New `goodtables init <data paths>` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas. + +Bug fixes: +- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250)) + +##### v1.5 + +New API added: +- Validation `source` now could be a `pathlib.Path` + +##### v1.4 + +Improved behaviour: +- rebased on Data Quality Spec v1 +- rebased on Data Package Spec v1 +- rebased on Table Schema Spec v1 +- treat primary key as required/unique field + +##### v1.3 + +New advanced checks added: +- `blacklisted-value` +- `custom-constraint` +- `deviated-value` +- `sequential-value` +- `truncated-value` + +##### v1.2 + +New API added: +- `report.preset` +- `report.tables[].schema` + +##### v1.1 + +New API added: +- `report.tables[].scheme` +- `report.tables[].format` +- `report.tables[].encoding` + +##### v1.0 + +This version includes various big changes. A migration guide is under development and will be published here. + +##### v0.6 + +First version of `goodtables`. + +[tableschema]: https://specs.frictionlessdata.io/table-schema/ +[tabulator]: https://github.com/frictionlessdata/tabulator-py/ +[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification" +[semver]: https://semver.org/ "Semantic Versioning" +[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema" + + + +%package -n python3-goodtables +Summary: Goodtables is a framework to inspect tabular data. +Provides: python-goodtables +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-goodtables +# goodtables-py + +[](https://travis-ci.org/frictionlessdata/goodtables-py) +[](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master) +[](https://pypi.python.org/pypi/goodtables) +[](https://github.com/frictionlessdata/goodtables-py) +[](https://gitter.im/frictionlessdata/chat) + +Goodtables is a framework to validate tabular data. It can check the structure +of your data (e.g. all rows have the same number of columns), and its contents +(e.g. all dates are valid). + +> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python. +> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before +> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable +> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version + +## Features + +* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc. +* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18"). +* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc. +* **Parallelized validations for multi-table datasets** +* **Command line interface** + +## Contents + +<!--TOC--> + + - [Getting Started](#getting-started) + - [Installing](#installing) + - [Running on CLI](#running-on-cli) + - [Running on Python](#running-on-python) + - [Documentation](#documentation) + - [Report](#report) + - [Checks](#checks) + - [Presets](#presets) + - [Data Quality Errors](#data-quality-errors) + - [Frequently Asked Questions](#frequently-asked-questions) + - [API Reference](#api-reference) + - [`cli`](#cli) + - [`validate`](#validate) + - [`preset`](#preset) + - [`check`](#check) + - [`Error`](#error) + - [`spec`](#spec) + - [`GoodtablesException`](#goodtablesexception) + - [Contributing](#contributing) + - [Changelog](#changelog) + +<!--TOC--> + +## Getting Started + +> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py + +### Installing + +``` +pip install goodtables +pip install goodtables[ods] # If you need LibreOffice's ODS file support +``` + +### Running on CLI + +``` +goodtables data.csv +``` + +Use `goodtables --help` to see the different options. + +### Running on Python + +```python +from goodtables import validate + +report = validate('invalid.csv') +report['valid'] # false +report['table-count'] # 1 +report['error-count'] # 3 +report['tables'][0]['valid'] # false +report['tables'][0]['source'] # 'invalid.csv' +report['tables'][0]['errors'][0]['code'] # 'blank-header' +``` + +You can read a more in depth explanation on using goodtables with Python on +the [developer documentation](#developer-documentation) section. Check also +the [examples](examples) folder for other examples. + +## Documentation + +Goodtables validates your tabular dataset to find structural and content +errors. Consider you have a file named `invalid.csv`. Let's validate it: + +```python +report = validate('invalid.csv') +``` + +We could also pass a remote URI instead of a local path. It supports CSV, XLS, +XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator] +library. + +### Report + +> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema]. + +The output of the `validate()` method is a report dictionary. It includes +information if the data was valid, count of errors, list of table reports, which +individual checks failed, etc. A report will be looking like this: + +```json +{ + "time": 0.009, + "error-count": 1, + "warnings": [ + "Table \"data/invalid.csv\" inspection has reached 1 error(s) limit" + ], + "preset": "table", + "valid": false, + "tables": [ + { + "errors": [ + { + "row-number": null, + "message": "Header in column 3 is blank", + "row": null, + "column-number": 3, + "code": "blank-header" + } + ], + "error-count": 1, + "headers": [ + "id", + "name", + "", + "name" + ], + "scheme": "file", + "row-count": 2, + "valid": false, + "encoding": "utf-8", + "time": 0.007, + "schema": null, + "format": "csv", + "source": "data/invalid" + } + ], + "table-count": 1 +} +``` + +The errors are divided in one of the following categories: + +- `source` - data can't be loaded or parsed +- `structure` - general tabular errors like duplicate headers +- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/) +- `custom` - custom checks errors + +### Checks + +Check is a main validation actor in goodtables. The list of enabled checks can +be changed using `checks` and `skip_checks` arguments. Let's explore the options +on an example: + +```python +report = validate('data.csv') # by default structure and schema (if available) checks +report = validate('data.csv', checks=['structure']) # only structure checks +report = validate('data.csv', checks=['schema']) # only schema (if available) checks +report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers' +report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers' +``` + +By default a dataset will be validated against all available Data Quality Spec +errors. Some checks can be unavailable for validation. For example, if the +schema isn't provided, only the `structure` checks will be done. + +### Presets + +Goodtables support different formats of tabular datasets. They're called +presets. A tabular dataset is some data that can be split in a list of data +tables, as: + + + +We can change the preset using the `preset` argument for `validate()`. By +default, it'll be inferred from the source, falling back to `table`. To validate +a [data package][datapackage], we can do: + +```python +report = validate('datapackage.json') # implicit preset +report = validate('datapackage.json', preset='datapackage') # explicit preset +``` + +This will validate all tabular resources in the datapackage. + +It's also possible to validate a list of files using the "nested" preset. To do +so, the first argument to `validate()` should be a list of dictionaries, where +each key in the dictionary is named after a parameter on `validate()`. For example: + +```python +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset +``` + +Is similar to: + +```python +report_data1 = validate('data1.csv') +report_data2 = validate('data2.csv') +``` + +The difference is that goodtables validates multiple tables in parallel, so +calling using the "nested" preset should run faster. + +### Data Quality Errors + +Base report errors are standardized and described in +[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json). + +#### Source errors + +The basic checks can't be disabled, as they deal with goodtables being able to read the files. + +| check | description | +| --- | --- | +| io-error | Data reading error because of IO error. | +| http-error | Data reading error because of HTTP error. | +| source-error | Data reading error because of not supported or inconsistent contents. | +| scheme-error | Data reading error because of incorrect scheme. | +| format-error | Data reading error because of incorrect format. | +| encoding-error | Data reading error because of an encoding problem. | + +#### Structure errors + +These checks validate that the structure of the file are valid. + +| check | description | +| --- | --- | +| blank-header | There is a blank header name. All cells in the header row must have a value. | +| duplicate-header | There are multiple columns with the same name. All column names must be unique. | +| blank-row | Rows must have at least one non-blank cell. | +| duplicate-row | Rows can't be duplicated. | +| extra-value | A row has more columns than the header. | +| missing-value | A row has less columns than the header. | + +#### Schema errors + +These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option. + +If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields. + +Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option. + +| check | description | +| --- | --- | +| schema-error | Schema is not valid. | +| non-matching-header | The header's name in the schema is different from what's in the data. | +| extra-header | The data contains a header not defined in the schema. | +| missing-header | The data doesn't contain a header defined in the schema. | +| type-or-format-error | The value can’t be cast based on the schema type and format for this field. | +| required-constraint | This field is a required field, but it contains no value. | +| pattern-constraint | This field value's should conform to the defined pattern. | +| unique-constraint | This field is a unique field but it contains a value that has been used in another row. | +| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. | +| minimum-constraint | This field value should be greater or equal than constraint value. | +| maximum-constraint | This field value should be less or equal than constraint value. | +| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. | +| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. | + +#### Custom errors + +| check | description | +| --- | --- | +| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. | +| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. | +| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package | +| [sequential-value](#sequential-value) | Ensure numbers are sequential. | +| [truncated-value](#truncated-value) | Detect values that were potentially truncated. | +| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). | + +##### blacklisted-value + +Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| blacklist | list of str | List of blacklisted values | + +Consider the following CSV file: + +```csv +id,name +1,John +2,bug +3,bad +5,Alex +``` + +Let's check that the `name` column doesn't contain rows with `bug` or `bad`: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}}, +]) +# error on row 3 with code "blacklisted-value" +# error on row 4 with code "blacklisted-value" +``` + +##### deviated-value + +This check helps to find outlines in a column containing positive numbers. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| average | str | Average type, either "mean", "median" or "mode" | +| interval | int | Values must be inside range `average ± standard deviation * interval` | + +Consider the following CSV file: + +```csv +temperature +1 +-2 +7 +0 +1 +2 +5 +-4 +100 +8 +3 +``` + +We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval. + +```python +report = validate('data.csv', checks=[ + {'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}}, +]) +# error on row 10 with code "deviated-value" +``` + +##### foreign-key + +> We support here relative paths. It MUST be used only for trusted data sources. + +This check validate foreign keys within a data package. Consider we have a data package defined below: + +```python +DESCRIPTOR = { + 'resources': [ + { + 'name': 'cities', + 'data': [ + ['id', 'name', 'next_id'], + [1, 'london', 2], + [2, 'paris', 3], + [3, 'rome', 4], + # [4, 'rio', None], + ], + 'schema': { + 'fields': [ + {'name': 'id', 'type': 'integer'}, + {'name': 'name', 'type': 'string'}, + {'name': 'next_id', 'type': 'integer'}, + ], + 'foreignKeys': [ + { + 'fields': 'next_id', + 'reference': {'resource': '', 'fields': 'id'}, + }, + { + 'fields': 'id', + 'reference': {'resource': 'people', 'fields': 'label'}, + }, + ], + }, + }, { + 'name': 'people', + 'data': [ + ['label', 'population'], + [1, 8], + [2, 2], + # [3, 3], + # [4, 6], + ], + }, + ], +} +``` + +Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data: + +```python +report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key']) +print(report) +``` + +``` +{'error-count': 2, + 'preset': 'datapackage', + 'table-count': 2, + 'tables': [{'datapackage': '...', + 'error-count': 2, + 'errors': [{'code': 'foreign-key', + 'message': 'Foreign key "[\'next_id\']" violation in ' + 'row 4', + 'message-data': {'fields': ['next_id']}, + 'row-number': 4}, + {'code': 'foreign-key', + 'message': 'Foreign key "[\'id\']" violation in row 4', + 'message-data': {'fields': ['id']}, + 'row-number': 4}], + 'format': 'inline', + 'headers': ['id', 'name', 'next_id'], + 'resource-name': 'cities', + 'row-count': 4, + 'schema': 'table-schema', + 'source': 'inline', + 'time': 0.031, + 'valid': False}, + {'datapackage': '...', + 'error-count': 0, + 'errors': [], + 'format': 'inline', + 'headers': ['label', 'population'], + 'resource-name': 'people', + 'row-count': 3, + 'source': 'inline', + 'time': 0.038, + 'valid': True}], + 'time': 0.117, + 'valid': False, + 'warnings': []} +``` + +It experimetally supports external resource checks, for example, for a `foreignKey` definition like these: + +```json +{"package": "../people/datapackage.json", "resource": "people", "fields": "label"} +{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"} +``` + +##### sequential-value + +This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | + +Consider the following CSV file: + +```csv +id,name +1,one +2,two +3,three +5,five +``` + +Let's check if the `id` column contains sequential integers: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'sequential-value': {'column': 'id'}}, +]) +# error on row 5 with code "sequential-value" +``` + +##### truncated-value + +Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information. + +Consider the following CSV file: + +```csv +id,amount,comment +1,14000000,good +2,2147483647,bad +3,32767,bad +4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad +``` + +To detect all probably truncated values we could use `truncated-value` check: + +```python +report = validate('data.csv', checks=[ + 'truncated-value', +]) +# error on row 3 with code "truncated-value" +# error on row 4 with code "truncated-value" +# error on row 5 with code "truncated-value" +``` + +##### custom-constraint + +With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options: + +<dl> + <dt>constraint (str)</dt> + <dd>Constraint definition (e.g. <code>col1 + col2 == col3</code>)</dd> +</dl> + +Consider csv file like this: + +```csv +id,name,salary,bonus +1,Alex,1000,200 +2,Sam,2500,500 +3,Ray,1350,500 +4,John,5000,1000 +``` + +Let's say our business rule is to be shy on bonuses: + +```python +report = validate('data.csv', checks=[ + {'custom-constraint': {'constraint': 'salary > bonus * 4'}}, +]) +# error on row 4 with code "custom-constraint" +``` + +### Frequently Asked Questions + +#### How can I add a new custom check? + +To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments): + +```python +from goodtables import validate, check, Error + +@check('custom-check', type='custom', context='body') +def custom_check(cells): + errors = [] + for cell in cells: + message = 'Custom error on column {column_number} and row {row_number}' + error = Error( + 'custom-error', + cell, + message + ) + errors.append(error) + return errors + +report = validate('data.csv', checks=['custom-check']) +``` + +Recommended steps: +- let's discuss in comment proposed checks first +- select name for a new check like `possible-noise-text` +- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module +- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py +- write actual code for the new check +- write tests and readme for the new check + +#### How can I add support for a new tabular file type? + +To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset. + +```python +from tabulator import Stream +from tableschema import Schema +from goodtables import validate + +@preset('custom-preset') +def custom_preset(source, **options): + warnings = [] + tables = [] + for table in source: + try: + tables.append({ + 'source': str(source), + 'stream': Stream(...), + 'schema': Schema(...), + 'extra': {...}, + }) + except Exception: + warnings.append('Warning message') + return warnings, tables + +report = validate(source, preset='custom-preset') +``` + +For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol. + +## API Reference + +### `cli` +```python +cli() +``` +Command-line interface + +``` +Usage: cli.py [OPTIONS] COMMAND [ARGS]... + +Options: + --version Show the version and exit. + --help Show this message and exit. + +Commands: + validate* Validate tabular files (default). + init Init data package from list of files. +``` + + +### `validate` +```python +validate(source, **options) +``` +Validates a source file and returns a report. + +__Arguments__ + +- __source (Union[str, Dict, List[Dict], IO])__: + The source to be validated. + It can be a local file path, URL, dict, list of dicts, or a + file-like object. If it's a list of dicts and the `preset` is + "nested", each of the dict key's will be used as if it was passed + as a keyword argument to this method. + + The file can be a CSV, XLS, JSON, and any other format supported by + `tabulator`_. +- __checks (List[str])__: + List of checks names to be enabled. They can be + individual check names (e.g. `blank-headers`), or check types (e.g. + `structure`). +- __skip_checks (List[str])__: + List of checks names to be skipped. They can + be individual check names (e.g. `blank-headers`), or check types + (e.g. `structure`). +- __infer_schema (bool)__: + Infer schema if one wasn't passed as an argument. +- __infer_fields (bool)__: + Infer schema for columns not present in the received schema. +- __order_fields (bool)__: + Order source columns based on schema fields order. + This is useful when you don't want to validate that the data + columns' order is the same as the schema's. +- __error_limit (int)__: + Stop validation if the number of errors per table exceeds this value. +- __table_limit (int)__: + Maximum number of tables to validate. +- __row_limit (int)__: + Maximum number of rows to validate. +- __preset (str)__: + Dataset type could be `table` (default), `datapackage`, + `nested` or custom. Usually, the preset can be inferred from the + source, so you don't need to define it. +- __Any (Any)__: + Any additional arguments not defined here will be passed on, + depending on the chosen `preset`. If the `preset` is `table`, the + extra arguments will be passed on to `tabulator`_, if it is + `datapackage`, they will be passed on to the `datapackage`_ + constructor. + +__Raises__ +- `GoodtablesException`: Raised on any non-tabular error. + +__Returns__ + +`dict`: The validation report. + + +### `preset` +```python +preset(name) +``` +Register a custom preset (decorator) + +__Example__ + + +```python +@preset('custom-preset') +def custom_preset(source, **options): + # ... +``` + +__Arguments__ +- __name (str)__: preset name + + +### `check` +```python +check(name, type=None, context=None, position=None) +``` +Register a custom check (decorator) + +__Example__ + + +```python +@check('custom-check', type='custom', context='body') +def custom_check(cells): + # ... +``` + +__Arguments__ +- __name (str)__: preset name +- __type (str)__: has to be `custom` +- __context (str)__: has to be `head` or `body` +- __position (str)__: has to be `before:<check-name>` or `after:<check-name>` + + +### `Error` +```python +Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None) +``` +Describes a validation check error + +__Arguments__ +- __code (str)__: The error code. Must be one in the spec. +- __cell (dict, optional)__: The cell where the error occurred. +- __row_number (int, optional)__: The row number where the error occurs. +- __message (str, optional)__: + The error message. Defaults to the message from the Data Quality Spec. +- __message_substitutions (dict, optional)__: + Dictionary with substitutions to be used when + generating the error message and description. + +__Raises__ +- `KeyError`: Raised if the error code isn't known. + + +### `spec` +dict() -> new empty dictionary +dict(mapping) -> new dictionary initialized from a mapping object's + (key, value) pairs +dict(iterable) -> new dictionary initialized as if via: + d = {} + for k, v in iterable: + d[k] = v +dict(**kwargs) -> new dictionary initialized with the name=value pairs + in the keyword argument list. For example: dict(one=1, two=2) +### `GoodtablesException` +```python +GoodtablesException(self, /, *args, **kwargs) +``` +Base goodtables exception + +## Contributing + +> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). + +Recommended way to get started is to create and activate a project virtual environment. +To install package and development dependencies into active environment: + +```bash +$ make install +``` + +To run tests with linting and coverage: + +```bash +$ make test +``` + +## Changelog + +Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master). + +##### v2.5 + +- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support) + +##### v2.4 + +- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values + +##### v2.3 + +- Added a [foreign keys check](#foreign-key) + +##### v2.2 + +- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298)) + +##### v2.1 + +- A new key added to the `error.to_dict` return: `message-data` + +##### v2.0 + +Breaking changes: + +- Checks method signature now only receives the current row's `cells` list +- Checks raise errors by returning an array of `Error` objects +- Cells have the row number in the `row-number` key +- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works +- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233)) +- New `goodtables init <data paths>` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas. + +Bug fixes: +- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250)) + +##### v1.5 + +New API added: +- Validation `source` now could be a `pathlib.Path` + +##### v1.4 + +Improved behaviour: +- rebased on Data Quality Spec v1 +- rebased on Data Package Spec v1 +- rebased on Table Schema Spec v1 +- treat primary key as required/unique field + +##### v1.3 + +New advanced checks added: +- `blacklisted-value` +- `custom-constraint` +- `deviated-value` +- `sequential-value` +- `truncated-value` + +##### v1.2 + +New API added: +- `report.preset` +- `report.tables[].schema` + +##### v1.1 + +New API added: +- `report.tables[].scheme` +- `report.tables[].format` +- `report.tables[].encoding` + +##### v1.0 + +This version includes various big changes. A migration guide is under development and will be published here. + +##### v0.6 + +First version of `goodtables`. + +[tableschema]: https://specs.frictionlessdata.io/table-schema/ +[tabulator]: https://github.com/frictionlessdata/tabulator-py/ +[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification" +[semver]: https://semver.org/ "Semantic Versioning" +[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema" + + + +%package help +Summary: Development documents and examples for goodtables +Provides: python3-goodtables-doc +%description help +# goodtables-py + +[](https://travis-ci.org/frictionlessdata/goodtables-py) +[](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master) +[](https://pypi.python.org/pypi/goodtables) +[](https://github.com/frictionlessdata/goodtables-py) +[](https://gitter.im/frictionlessdata/chat) + +Goodtables is a framework to validate tabular data. It can check the structure +of your data (e.g. all rows have the same number of columns), and its contents +(e.g. all dates are valid). + +> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python. +> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before +> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable +> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version + +## Features + +* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc. +* **Content checks**: Ensure that the values have the correct types ("string", "number", "date", etc.), that their format is valid ("string must be an e-mail"), and that they respect the constraints ("age must be a number greater than 18"). +* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc. +* **Parallelized validations for multi-table datasets** +* **Command line interface** + +## Contents + +<!--TOC--> + + - [Getting Started](#getting-started) + - [Installing](#installing) + - [Running on CLI](#running-on-cli) + - [Running on Python](#running-on-python) + - [Documentation](#documentation) + - [Report](#report) + - [Checks](#checks) + - [Presets](#presets) + - [Data Quality Errors](#data-quality-errors) + - [Frequently Asked Questions](#frequently-asked-questions) + - [API Reference](#api-reference) + - [`cli`](#cli) + - [`validate`](#validate) + - [`preset`](#preset) + - [`check`](#check) + - [`Error`](#error) + - [`spec`](#spec) + - [`GoodtablesException`](#goodtablesexception) + - [Contributing](#contributing) + - [Changelog](#changelog) + +<!--TOC--> + +## Getting Started + +> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py + +### Installing + +``` +pip install goodtables +pip install goodtables[ods] # If you need LibreOffice's ODS file support +``` + +### Running on CLI + +``` +goodtables data.csv +``` + +Use `goodtables --help` to see the different options. + +### Running on Python + +```python +from goodtables import validate + +report = validate('invalid.csv') +report['valid'] # false +report['table-count'] # 1 +report['error-count'] # 3 +report['tables'][0]['valid'] # false +report['tables'][0]['source'] # 'invalid.csv' +report['tables'][0]['errors'][0]['code'] # 'blank-header' +``` + +You can read a more in depth explanation on using goodtables with Python on +the [developer documentation](#developer-documentation) section. Check also +the [examples](examples) folder for other examples. + +## Documentation + +Goodtables validates your tabular dataset to find structural and content +errors. Consider you have a file named `invalid.csv`. Let's validate it: + +```python +report = validate('invalid.csv') +``` + +We could also pass a remote URI instead of a local path. It supports CSV, XLS, +XLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator] +library. + +### Report + +> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema]. + +The output of the `validate()` method is a report dictionary. It includes +information if the data was valid, count of errors, list of table reports, which +individual checks failed, etc. A report will be looking like this: + +```json +{ + "time": 0.009, + "error-count": 1, + "warnings": [ + "Table \"data/invalid.csv\" inspection has reached 1 error(s) limit" + ], + "preset": "table", + "valid": false, + "tables": [ + { + "errors": [ + { + "row-number": null, + "message": "Header in column 3 is blank", + "row": null, + "column-number": 3, + "code": "blank-header" + } + ], + "error-count": 1, + "headers": [ + "id", + "name", + "", + "name" + ], + "scheme": "file", + "row-count": 2, + "valid": false, + "encoding": "utf-8", + "time": 0.007, + "schema": null, + "format": "csv", + "source": "data/invalid" + } + ], + "table-count": 1 +} +``` + +The errors are divided in one of the following categories: + +- `source` - data can't be loaded or parsed +- `structure` - general tabular errors like duplicate headers +- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/) +- `custom` - custom checks errors + +### Checks + +Check is a main validation actor in goodtables. The list of enabled checks can +be changed using `checks` and `skip_checks` arguments. Let's explore the options +on an example: + +```python +report = validate('data.csv') # by default structure and schema (if available) checks +report = validate('data.csv', checks=['structure']) # only structure checks +report = validate('data.csv', checks=['schema']) # only schema (if available) checks +report = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers' +report = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers' +``` + +By default a dataset will be validated against all available Data Quality Spec +errors. Some checks can be unavailable for validation. For example, if the +schema isn't provided, only the `structure` checks will be done. + +### Presets + +Goodtables support different formats of tabular datasets. They're called +presets. A tabular dataset is some data that can be split in a list of data +tables, as: + + + +We can change the preset using the `preset` argument for `validate()`. By +default, it'll be inferred from the source, falling back to `table`. To validate +a [data package][datapackage], we can do: + +```python +report = validate('datapackage.json') # implicit preset +report = validate('datapackage.json', preset='datapackage') # explicit preset +``` + +This will validate all tabular resources in the datapackage. + +It's also possible to validate a list of files using the "nested" preset. To do +so, the first argument to `validate()` should be a list of dictionaries, where +each key in the dictionary is named after a parameter on `validate()`. For example: + +```python +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset +report = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset +``` + +Is similar to: + +```python +report_data1 = validate('data1.csv') +report_data2 = validate('data2.csv') +``` + +The difference is that goodtables validates multiple tables in parallel, so +calling using the "nested" preset should run faster. + +### Data Quality Errors + +Base report errors are standardized and described in +[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json). + +#### Source errors + +The basic checks can't be disabled, as they deal with goodtables being able to read the files. + +| check | description | +| --- | --- | +| io-error | Data reading error because of IO error. | +| http-error | Data reading error because of HTTP error. | +| source-error | Data reading error because of not supported or inconsistent contents. | +| scheme-error | Data reading error because of incorrect scheme. | +| format-error | Data reading error because of incorrect format. | +| encoding-error | Data reading error because of an encoding problem. | + +#### Structure errors + +These checks validate that the structure of the file are valid. + +| check | description | +| --- | --- | +| blank-header | There is a blank header name. All cells in the header row must have a value. | +| duplicate-header | There are multiple columns with the same name. All column names must be unique. | +| blank-row | Rows must have at least one non-blank cell. | +| duplicate-row | Rows can't be duplicated. | +| extra-value | A row has more columns than the header. | +| missing-value | A row has less columns than the header. | + +#### Schema errors + +These checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option. + +If your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields. + +Lastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option. + +| check | description | +| --- | --- | +| schema-error | Schema is not valid. | +| non-matching-header | The header's name in the schema is different from what's in the data. | +| extra-header | The data contains a header not defined in the schema. | +| missing-header | The data doesn't contain a header defined in the schema. | +| type-or-format-error | The value can’t be cast based on the schema type and format for this field. | +| required-constraint | This field is a required field, but it contains no value. | +| pattern-constraint | This field value's should conform to the defined pattern. | +| unique-constraint | This field is a unique field but it contains a value that has been used in another row. | +| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. | +| minimum-constraint | This field value should be greater or equal than constraint value. | +| maximum-constraint | This field value should be less or equal than constraint value. | +| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. | +| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. | + +#### Custom errors + +| check | description | +| --- | --- | +| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. | +| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. | +| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package | +| [sequential-value](#sequential-value) | Ensure numbers are sequential. | +| [truncated-value](#truncated-value) | Detect values that were potentially truncated. | +| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). | + +##### blacklisted-value + +Sometimes we have to check for some values we don't want to have in out dataset. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| blacklist | list of str | List of blacklisted values | + +Consider the following CSV file: + +```csv +id,name +1,John +2,bug +3,bad +5,Alex +``` + +Let's check that the `name` column doesn't contain rows with `bug` or `bad`: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}}, +]) +# error on row 3 with code "blacklisted-value" +# error on row 4 with code "blacklisted-value" +``` + +##### deviated-value + +This check helps to find outlines in a column containing positive numbers. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | +| average | str | Average type, either "mean", "median" or "mode" | +| interval | int | Values must be inside range `average ± standard deviation * interval` | + +Consider the following CSV file: + +```csv +temperature +1 +-2 +7 +0 +1 +2 +5 +-4 +100 +8 +3 +``` + +We use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval. + +```python +report = validate('data.csv', checks=[ + {'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}}, +]) +# error on row 10 with code "deviated-value" +``` + +##### foreign-key + +> We support here relative paths. It MUST be used only for trusted data sources. + +This check validate foreign keys within a data package. Consider we have a data package defined below: + +```python +DESCRIPTOR = { + 'resources': [ + { + 'name': 'cities', + 'data': [ + ['id', 'name', 'next_id'], + [1, 'london', 2], + [2, 'paris', 3], + [3, 'rome', 4], + # [4, 'rio', None], + ], + 'schema': { + 'fields': [ + {'name': 'id', 'type': 'integer'}, + {'name': 'name', 'type': 'string'}, + {'name': 'next_id', 'type': 'integer'}, + ], + 'foreignKeys': [ + { + 'fields': 'next_id', + 'reference': {'resource': '', 'fields': 'id'}, + }, + { + 'fields': 'id', + 'reference': {'resource': 'people', 'fields': 'label'}, + }, + ], + }, + }, { + 'name': 'people', + 'data': [ + ['label', 'population'], + [1, 8], + [2, 2], + # [3, 3], + # [4, 6], + ], + }, + ], +} +``` + +Running `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data: + +```python +report = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key']) +print(report) +``` + +``` +{'error-count': 2, + 'preset': 'datapackage', + 'table-count': 2, + 'tables': [{'datapackage': '...', + 'error-count': 2, + 'errors': [{'code': 'foreign-key', + 'message': 'Foreign key "[\'next_id\']" violation in ' + 'row 4', + 'message-data': {'fields': ['next_id']}, + 'row-number': 4}, + {'code': 'foreign-key', + 'message': 'Foreign key "[\'id\']" violation in row 4', + 'message-data': {'fields': ['id']}, + 'row-number': 4}], + 'format': 'inline', + 'headers': ['id', 'name', 'next_id'], + 'resource-name': 'cities', + 'row-count': 4, + 'schema': 'table-schema', + 'source': 'inline', + 'time': 0.031, + 'valid': False}, + {'datapackage': '...', + 'error-count': 0, + 'errors': [], + 'format': 'inline', + 'headers': ['label', 'population'], + 'resource-name': 'people', + 'row-count': 3, + 'source': 'inline', + 'time': 0.038, + 'valid': True}], + 'time': 0.117, + 'valid': False, + 'warnings': []} +``` + +It experimetally supports external resource checks, for example, for a `foreignKey` definition like these: + +```json +{"package": "../people/datapackage.json", "resource": "people", "fields": "label"} +{"package": "http:/example.com/datapackage.json", "resource": "people", "fields": "label"} +``` + +##### sequential-value + +This checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options: + +| option | type | description | +| --- | --- | --- | +| column | int/str | Column number or name | + +Consider the following CSV file: + +```csv +id,name +1,one +2,two +3,three +5,five +``` + +Let's check if the `id` column contains sequential integers: + +```python +from goodtables import validate + +report = validate('data.csv', checks=[ + {'sequential-value': {'column': 'id'}}, +]) +# error on row 5 with code "sequential-value" +``` + +##### truncated-value + +Some database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information. + +Consider the following CSV file: + +```csv +id,amount,comment +1,14000000,good +2,2147483647,bad +3,32767,bad +4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad +``` + +To detect all probably truncated values we could use `truncated-value` check: + +```python +report = validate('data.csv', checks=[ + 'truncated-value', +]) +# error on row 3 with code "truncated-value" +# error on row 4 with code "truncated-value" +# error on row 5 with code "truncated-value" +``` + +##### custom-constraint + +With Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options: + +<dl> + <dt>constraint (str)</dt> + <dd>Constraint definition (e.g. <code>col1 + col2 == col3</code>)</dd> +</dl> + +Consider csv file like this: + +```csv +id,name,salary,bonus +1,Alex,1000,200 +2,Sam,2500,500 +3,Ray,1350,500 +4,John,5000,1000 +``` + +Let's say our business rule is to be shy on bonuses: + +```python +report = validate('data.csv', checks=[ + {'custom-constraint': {'constraint': 'salary > bonus * 4'}}, +]) +# error on row 4 with code "custom-constraint" +``` + +### Frequently Asked Questions + +#### How can I add a new custom check? + +To create a custom check user could use a `check` decorator. This way the builtin check could be overridden (use the spec error code like `duplicate-row`) or could be added a check for a custom error (use `type`, `context` and `position` arguments): + +```python +from goodtables import validate, check, Error + +@check('custom-check', type='custom', context='body') +def custom_check(cells): + errors = [] + for cell in cells: + message = 'Custom error on column {column_number} and row {row_number}' + error = Error( + 'custom-error', + cell, + message + ) + errors.append(error) + return errors + +report = validate('data.csv', checks=['custom-check']) +``` + +Recommended steps: +- let's discuss in comment proposed checks first +- select name for a new check like `possible-noise-text` +- copy https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/contrib/checks/blacklisted_value.py to new check module +- add new check module to configuration - https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/config.py +- write actual code for the new check +- write tests and readme for the new check + +#### How can I add support for a new tabular file type? + +To create a custom preset user could use a `preset` decorator. This way the builtin preset could be overridden or could be added a custom preset. + +```python +from tabulator import Stream +from tableschema import Schema +from goodtables import validate + +@preset('custom-preset') +def custom_preset(source, **options): + warnings = [] + tables = [] + for table in source: + try: + tables.append({ + 'source': str(source), + 'stream': Stream(...), + 'schema': Schema(...), + 'extra': {...}, + }) + except Exception: + warnings.append('Warning message') + return warnings, tables + +report = validate(source, preset='custom-preset') +``` + +For now this documentation section is incomplete. Please see builtin presets to learn more about the dataset extraction protocol. + +## API Reference + +### `cli` +```python +cli() +``` +Command-line interface + +``` +Usage: cli.py [OPTIONS] COMMAND [ARGS]... + +Options: + --version Show the version and exit. + --help Show this message and exit. + +Commands: + validate* Validate tabular files (default). + init Init data package from list of files. +``` + + +### `validate` +```python +validate(source, **options) +``` +Validates a source file and returns a report. + +__Arguments__ + +- __source (Union[str, Dict, List[Dict], IO])__: + The source to be validated. + It can be a local file path, URL, dict, list of dicts, or a + file-like object. If it's a list of dicts and the `preset` is + "nested", each of the dict key's will be used as if it was passed + as a keyword argument to this method. + + The file can be a CSV, XLS, JSON, and any other format supported by + `tabulator`_. +- __checks (List[str])__: + List of checks names to be enabled. They can be + individual check names (e.g. `blank-headers`), or check types (e.g. + `structure`). +- __skip_checks (List[str])__: + List of checks names to be skipped. They can + be individual check names (e.g. `blank-headers`), or check types + (e.g. `structure`). +- __infer_schema (bool)__: + Infer schema if one wasn't passed as an argument. +- __infer_fields (bool)__: + Infer schema for columns not present in the received schema. +- __order_fields (bool)__: + Order source columns based on schema fields order. + This is useful when you don't want to validate that the data + columns' order is the same as the schema's. +- __error_limit (int)__: + Stop validation if the number of errors per table exceeds this value. +- __table_limit (int)__: + Maximum number of tables to validate. +- __row_limit (int)__: + Maximum number of rows to validate. +- __preset (str)__: + Dataset type could be `table` (default), `datapackage`, + `nested` or custom. Usually, the preset can be inferred from the + source, so you don't need to define it. +- __Any (Any)__: + Any additional arguments not defined here will be passed on, + depending on the chosen `preset`. If the `preset` is `table`, the + extra arguments will be passed on to `tabulator`_, if it is + `datapackage`, they will be passed on to the `datapackage`_ + constructor. + +__Raises__ +- `GoodtablesException`: Raised on any non-tabular error. + +__Returns__ + +`dict`: The validation report. + + +### `preset` +```python +preset(name) +``` +Register a custom preset (decorator) + +__Example__ + + +```python +@preset('custom-preset') +def custom_preset(source, **options): + # ... +``` + +__Arguments__ +- __name (str)__: preset name + + +### `check` +```python +check(name, type=None, context=None, position=None) +``` +Register a custom check (decorator) + +__Example__ + + +```python +@check('custom-check', type='custom', context='body') +def custom_check(cells): + # ... +``` + +__Arguments__ +- __name (str)__: preset name +- __type (str)__: has to be `custom` +- __context (str)__: has to be `head` or `body` +- __position (str)__: has to be `before:<check-name>` or `after:<check-name>` + + +### `Error` +```python +Error(self, code, cell=None, row_number=None, message=None, message_substitutions=None) +``` +Describes a validation check error + +__Arguments__ +- __code (str)__: The error code. Must be one in the spec. +- __cell (dict, optional)__: The cell where the error occurred. +- __row_number (int, optional)__: The row number where the error occurs. +- __message (str, optional)__: + The error message. Defaults to the message from the Data Quality Spec. +- __message_substitutions (dict, optional)__: + Dictionary with substitutions to be used when + generating the error message and description. + +__Raises__ +- `KeyError`: Raised if the error code isn't known. + + +### `spec` +dict() -> new empty dictionary +dict(mapping) -> new dictionary initialized from a mapping object's + (key, value) pairs +dict(iterable) -> new dictionary initialized as if via: + d = {} + for k, v in iterable: + d[k] = v +dict(**kwargs) -> new dictionary initialized with the name=value pairs + in the keyword argument list. For example: dict(one=1, two=2) +### `GoodtablesException` +```python +GoodtablesException(self, /, *args, **kwargs) +``` +Base goodtables exception + +## Contributing + +> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). + +Recommended way to get started is to create and activate a project virtual environment. +To install package and development dependencies into active environment: + +```bash +$ make install +``` + +To run tests with linting and coverage: + +```bash +$ make test +``` + +## Changelog + +Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/goodtables-py/commits/master). + +##### v2.5 + +- Added `check.check_headers_hook` to support headers check for body-contexted checks (see https://github.com/frictionlessdata/goodtables-py/tree/v3 for native support) + +##### v2.4 + +- Added integrity checks for data packages. If `resource.bytes` or `resource.hash` (sha256) is provided it will be verified against actual values + +##### v2.3 + +- Added a [foreign keys check](#foreign-key) + +##### v2.2 + +- Improved missing/non-matching-headers detection ([#298](https://github.com/frictionlessdata/goodtables-py/issues/298)) + +##### v2.1 + +- A new key added to the `error.to_dict` return: `message-data` + +##### v2.0 + +Breaking changes: + +- Checks method signature now only receives the current row's `cells` list +- Checks raise errors by returning an array of `Error` objects +- Cells have the row number in the `row-number` key +- Files with ZIP extension are presumed to be datapackages, so `goodtables mydatapackage.zip` works +- Improvements to goodtables CLI ([#233](https://github.com/frictionlessdata/goodtables-py/issues/233)) +- New `goodtables init <data paths>` command to create a new `datapackage.json` with the files passed as parameters and their inferred schemas. + +Bug fixes: +- Fix bug with `truncated-values` check on date fields ([#250](https://github.com/frictionlessdata/goodtables-py/issues/250)) + +##### v1.5 + +New API added: +- Validation `source` now could be a `pathlib.Path` + +##### v1.4 + +Improved behaviour: +- rebased on Data Quality Spec v1 +- rebased on Data Package Spec v1 +- rebased on Table Schema Spec v1 +- treat primary key as required/unique field + +##### v1.3 + +New advanced checks added: +- `blacklisted-value` +- `custom-constraint` +- `deviated-value` +- `sequential-value` +- `truncated-value` + +##### v1.2 + +New API added: +- `report.preset` +- `report.tables[].schema` + +##### v1.1 + +New API added: +- `report.tables[].scheme` +- `report.tables[].format` +- `report.tables[].encoding` + +##### v1.0 + +This version includes various big changes. A migration guide is under development and will be published here. + +##### v0.6 + +First version of `goodtables`. + +[tableschema]: https://specs.frictionlessdata.io/table-schema/ +[tabulator]: https://github.com/frictionlessdata/tabulator-py/ +[datapackage]: https://specs.frictionlessdata.io/data-package/ "Data Package specification" +[semver]: https://semver.org/ "Semantic Versioning" +[validation-jsonschema]: goodtables/schemas/report.json "Validation Report JSON Schema" + + + +%prep +%autosetup -n goodtables-2.5.4 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-goodtables -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Apr 11 2023 Python_Bot <Python_Bot@openeuler.org> - 2.5.4-1 +- Package Spec generated |
