%global _empty_manifest_terminate_build 0 Name: python-datapackage Version: 1.15.2 Release: 1 Summary: Utilities to work with Data Packages as defined on specs.frictionlessdata.io License: MIT URL: https://github.com/frictionlessdata/datapackage-py Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c7/83/ef759d618745503e66c33aa9e218a954c3637a1b9a700ac9b368b5c9908b/datapackage-1.15.2.tar.gz BuildArch: noarch Requires: python3-six Requires: python3-click Requires: python3-chardet Requires: python3-requests Requires: python3-jsonschema Requires: python3-unicodecsv Requires: python3-jsonpointer Requires: python3-tableschema Requires: python3-tabulator Requires: python3-cchardet Requires: python3-mock Requires: python3-pylama Requires: python3-pytest Requires: python3-pytest-cov Requires: python3-httpretty Requires: python3-tableschema-sql %description # datapackage-py [![Travis](https://travis-ci.org/frictionlessdata/datapackage-py.svg?branch=master)](https://travis-ci.org/frictionlessdata/datapackage-py) [![Coveralls](https://coveralls.io/repos/github/frictionlessdata/datapackage-py/badge.svg?branch=master)](https://coveralls.io/github/frictionlessdata/datapackage-py?branch=master) [![PyPi](https://img.shields.io/pypi/v/datapackage.svg)](https://pypi.python.org/pypi/datapackage) [![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/datapackage-py) [![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat) A library for working with [Data Packages](http://specs.frictionlessdata.io/data-package/). > **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework provides improved `datapackage` functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `datapackage` to Frictionless Framework. > - we continue to bug-fix `datapackage@1.x` in this [repository](https://github.com/frictionlessdata/datapackage-py) as well as it's available on [PyPi](https://pypi.org/project/datapackage/) as it was before > - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable > - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version ## Features - `Package` class for working with data packages - `Resource` class for working with data resources - `Profile` class for working with profiles - `validate` function for validating data package descriptors - `infer` function for inferring data package descriptors ## Contents - [Getting Started](#getting-started) - [Installation](#installation) - [Documentation](#documentation) - [Introduction](#introduction) - [Working with Package](#working-with-package) - [Working with Resource](#working-with-resource) - [Working with Group](#working-with-group) - [Working with Profile](#working-with-profile) - [Working with Foreign Keys](#working-with-foreign-keys) - [Working with validate/infer](#working-with-validateinfer) - [Frequently Asked Questions](#frequently-asked-questions) - [API Reference](#api-reference) - [`cli`](#cli) - [`Package`](#package) - [`Resource`](#resource) - [`Group`](#group) - [`Profile`](#profile) - [`validate`](#validate) - [`infer`](#infer) - [`DataPackageException`](#datapackageexception) - [`TableSchemaException`](#tableschemaexception) - [`LoadError`](#loaderror) - [`CastError`](#casterror) - [`IntegrityError`](#integrityerror) - [`RelationError`](#relationerror) - [`StorageError`](#storageerror) - [Contributing](#contributing) - [Changelog](#changelog) ## Getting Started ### Installation The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify `datapackage` version range in your `setup/requirements` file e.g. `datapackage>=1.0,<2.0`. ```bash $ pip install datapackage ``` #### OSX 10.14+ If you receive an error about the `cchardet` package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps: 1. Make sure you have the latest x-code by running the following in terminal: `xcode-select --install` 2. Then go to [https://developer.apple.com/download/more/](https://developer.apple.com/download/more/) and download the `command line tools`. Note, this requires an Apple ID. 3. Then, in terminal, run `open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg` You can read more about these steps in this [post.](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave) ## Documentation ### Introduction Let's start with a simple example: ```python from datapackage import Package package = Package('datapackage.json') package.get_resource('resource').read() ``` ### Working with Package A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more. Consider we have some local csv files in a `data` directory. Let's create a data package based on this data using a `Package` class: > data/cities.csv ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,"41.89,12.51" ``` > data/population.csv ```csv city,year,population london,2017,8780000 paris,2017,2240000 rome,2017,2860000 ``` First we create a blank data package: ```python package = Package() ``` Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern `**/*.csv`: ```python package.infer('**/*.csv') package.descriptor #{ profile: 'tabular-data-package', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` An `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit: ```python package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year' package.commit() package.valid # true ``` Because our resources are tabular we could read it as a tabular data: ```python package.get_resource('population').read(keyed=True) #[ { city: 'london', year: 2017, population: 8780000 }, # { city: 'paris', year: 2017, population: 2240000 }, # { city: 'rome', year: 2017, population: 2860000 } ] ``` Let's save our descriptor on the disk as a zip-file: ```python package.save('datapackage.zip') ``` To continue the work with the data package we just load it again but this time using local `datapackage.zip`: ```python package = Package('datapackage.zip') # Continue the work ``` It was onle basic introduction to the `Package` class. To learn more let's take a look on `Package` class API reference. ### Working with Resource A class for working with data resources. You can read or iterate tabular resources using the `iter/read` methods and all resource as bytes using `row_iter/row_read` methods. Consider we have some local csv file. It could be inline data or remote link - all supported by `Resource` class (except local files for in-brower usage of course). But say it's `data.csv` for now: ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,N/A ``` Let's create and read a resource. Because resource is tabular we could use `resource.read` method with a `keyed` option to get an array of keyed rows: ```python resource = Resource({path: 'data.csv'}) resource.tabular # true resource.read(keyed=True) # [ # {city: 'london', location: '51.50,-0.11'}, # {city: 'paris', location: '48.85,2.30'}, # {city: 'rome', location: 'N/A'}, # ] resource.headers # ['city', 'location'] # (reading has to be started first) ``` As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a `N/A` string instead of Python `None`. First we have to infer resource metadata: ```python resource.infer() resource.descriptor #{ path: 'data.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'data', # format: 'csv', # mediatype: 'text/csv', # schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } } resource.read(keyed=True) # Fails with a data validation error ``` Let's fix not available location. There is a `missingValues` property in Table Schema specification. As a first try we set `missingValues` to `N/A` in `resource.descriptor.schema`. Resource descriptor could be changed in-place but all changes should be commited by `resource.commit()`: ```python resource.descriptor['schema']['missingValues'] = 'N/A' resource.commit() resource.valid # False resource.errors # [] ``` As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for `missingValues` property. Also don't forget to have an empty string as a missing value: ```python resource.descriptor['schema']['missingValues'] = ['', 'N/A'] resource.commit() resource.valid # true ``` All good. It looks like we're ready to read our data again: ```python resource.read(keyed=True) # [ # {city: 'london', location: [51.50,-0.11]}, # {city: 'paris', location: [48.85,2.30]}, # {city: 'rome', location: null}, # ] ``` Now we see that: - locations are arrays with numeric lattide and longitude - Rome's location is a native JavaScript `null` And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor: ```python resource.save('dataresource.json') ``` Let's check newly-crated `dataresource.json`. It contains path to our data file, inferred metadata and our `missingValues` tweak: ```json { "path": "data.csv", "profile": "tabular-data-resource", "encoding": "utf-8", "name": "data", "format": "csv", "mediatype": "text/csv", "schema": { "fields": [ { "name": "city", "type": "string", "format": "default" }, { "name": "location", "type": "geopoint", "format": "default" } ], "missingValues": [ "", "N/A" ] } } ``` If we decide to improve it even more we could update the `dataresource.json` file and then open it again using local file name: ```python resource = Resource('dataresource.json') # Continue the work ``` It was onle basic introduction to the `Resource` class. To learn more let's take a look on `Resource` class API reference. ### Working with Group A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: ` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name). Consider we have a data package with two tables partitioned by a year and a shared schema stored separately: > cars-2017.csv ```csv name,value bmw,2017 tesla,2017 nissan,2017 ``` > cars-2018.csv ```csv name,value bmw,2018 tesla,2018 nissan,2018 ``` > cars.schema.json ```json { "fields": [ { "name": "name", "type": "string" }, { "name": "value", "type": "integer" } ] } ``` > datapackage.json ```json { "name": "datapackage", "resources": [ { "group": "cars", "name": "cars-2017", "path": "cars-2017.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" }, { "group": "cars", "name": "cars-2018", "path": "cars-2018.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" } ] } ``` Let's read the resources separately: ```python package = Package('datapackage.json') package.get_resource('cars-2017').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, ] package.get_resource('cars-2018').read(keyed=True) == [ {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group: ```python package = Package('datapackage.json') package.get_group('cars').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour: ```python package = Package('datapackage.json') package.save(storage='sql', engine=engine) # SQL tables: # - cars-2017 # - cars-2018 package.save(storage='sql', engine=engine, merge_groups=True) # SQL tables: # - cars ``` ### Working with Profile A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json): ```python profile = Profile('data-package') profile.name # data-package profile.jsonschema # JSON Schema contents try: valid = profile.validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` ### Working with Foreign Keys The library supports foreign keys described in the [Table Schema](http://specs.frictionlessdata.io/table-schema/#foreign-keys) specification. It means if your data package descriptor use `resources[].schema.foreignKeys` property for some resources a data integrity will be checked on reading operations. Consider we have a data package: ```python DESCRIPTOR = { 'resources': [ { 'name': 'teams', 'data': [ ['id', 'name', 'city'], ['1', 'Arsenal', 'London'], ['2', 'Real', 'Madrid'], ['3', 'Bayern', 'Munich'], ], 'schema': { 'fields': [ {'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}, {'name': 'city', 'type': 'string'}, ], 'foreignKeys': [ { 'fields': 'city', 'reference': {'resource': 'cities', 'fields': 'name'}, }, ], }, }, { 'name': 'cities', 'data': [ ['name', 'country'], ['London', 'England'], ['Madrid', 'Spain'], ], }, ], } ``` Let's check relations for a `teams` resource: ```python from datapackage import Package package = Package(DESCRIPTOR) teams = package.get_resource('teams') teams.check_relations() # tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4" ``` As we could see there is a foreign key violation. That's because our lookup table `cities` doesn't have a city of `Munich` but we have a team from there. We need to fix it in `cities` resource: ```python package.descriptor['resources'][1]['data'].append(['Munich', 'Germany']) package.commit() teams = package.get_resource('teams') teams.check_relations() # True ``` Fixed! But not only a check operation is available. We could use `relations` argument for `resource.iter/read` methods to dereference a resource relations: ```python teams.read(keyed=True, relations=True) #[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}}, # {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}}, # {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}] ``` Instead of plain city name we've got a dictionary containing a city data. These `resource.iter/read` methods will fail with the same as `resource.check_relations` error if there is an integrity issue. But only if `relations=True` flag is passed. ### Working with validate/infer A standalone function to validate a data package descriptor: ```python from datapackage import validate, exceptions try: valid = validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` A standalone function to infer a data package descriptor. ```python descriptor = infer('**/*.csv') #{ profile: 'tabular-data-resource', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` ### Frequently Asked Questions #### Accessing data behind a proxy server? Before the `package = Package("https://xxx.json")` call set these environment variables: ```python import os os.environ["HTTP_PROXY"] = 'xxx' os.environ["HTTPS_PROXY"] = 'xxx' ``` ## API Reference ### `cli` ```python cli() ``` Command-line interface ``` Usage: datapackage [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit. --help Show this message and exit. Commands: infer validate ``` ### `Package` ```python Package(self, descriptor=None, base_path=None, strict=False, unsafe=False, storage=None, schema=None, default_base_path=None, **options) ``` Package representation __Arguments__ - __descriptor (str/dict)__: data package descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `True` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `package.base_path` Package's base path __Returns__ `str/None`: returns the data package base path #### `package.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `package.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `package.profile` Package's profile __Returns__ `Profile`: an instance of `Profile` class #### `package.resource_names` Package's resource names __Returns__ `str[]`: returns an array of resource names #### `package.resources` Package's resources __Returns__ `Resource[]`: returns an array of `Resource` instances #### `package.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `package.get_resource` ```python package.get_resource(name) ``` Get data package resource by name. __Arguments__ - __name (str)__: data resource name __Returns__ `Resource/None`: returns `Resource` instances or null if not found #### `package.add_resource` ```python package.add_resource(descriptor) ``` Add new resource to data package. The data package descriptor will be validated with newly added resource descriptor. __Arguments__ - __descriptor (dict)__: data resource descriptor __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns added `Resource` instance or null if not added #### `package.remove_resource` ```python package.remove_resource(name) ``` Remove data package resource by name. The data package descriptor will be validated after resource descriptor removal. __Arguments__ - __name (str)__: data resource name __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns removed `Resource` instances or null if not found #### `package.get_group` ```python package.get_group(name) ``` Returns a group of tabular resources by name. For more information about groups see [Group](#group). __Arguments__ - __name (str)__: name of a group of resources __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Group/None`: returns a `Group` instance or null if not found #### `package.infer` ```python package.infer(pattern=False) ``` Infer a data package metadata. > Argument `pattern` works only for local files If `pattern` is not provided only existent resources will be inferred (added metadata like encoding, profile etc). If `pattern` is provided new resoures with file names mathing the pattern will be added and inferred. It commits changes to data package instance. __Arguments__ - __pattern (str)__: glob pattern for new resources __Returns__ `dict`: returns data package descriptor #### `package.commit` ```python package.commit(strict=None) ``` Update data package instance if there are in-place changes in the descriptor. __Example__ ```python package = Package({ 'name': 'package', 'resources': [{'name': 'resource', 'data': ['data']}] }) package.name # package package.descriptor['name'] = 'renamed-package' package.name # package package.commit() package.name # renamed-package ``` __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `package.save` ```python package.save(target=None, storage=None, merge_groups=False, to_base_path=False, **options) ``` Saves this data package It saves it to storage if `storage` argument is passed or saves this data package's descriptor to json file if `target` arguments ends with `.json` or saves this data package to zip file otherwise. __Example__ It creates a zip file into ``file_or_path`` with the contents of this Data Package and its resources. Every resource which content lives in the local filesystem will be copied to the zip file. Consider the following Data Package descriptor: ```json { "name": "gdp", "resources": [ {"name": "local", "format": "CSV", "path": "data.csv"}, {"name": "inline", "data": [4, 8, 15, 16, 23, 42]}, {"name": "remote", "url": "http://someplace.com/data.csv"} ] } ``` The final structure of the zip file will be: ``` ./datapackage.json ./data/local.csv ``` With the contents of `datapackage.json` being the same as returned `datapackage.descriptor`. The resources' file names are generated based on their `name` and `format` fields if they exist. If the resource has no `name`, it'll be used `resource-X`, where `X` is the index of the resource in the `resources` list (starting at zero). If the resource has `format`, it'll be lowercased and appended to the `name`, becoming "`name.format`". __Arguments__ - __target (string/filelike)__: the file path or a file-like object where the contents of this Data Package will be saved into. - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __merge_groups (bool)__: save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about [Group](#group). - __to_base_path (bool)__: save the package to the package's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises if there was some error writing the package __Returns__ `bool/Storage`: on success return true or a `Storage` instance ### `Resource` ```python Resource(self, descriptor={}, base_path=None, strict=False, unsafe=False, storage=None, package=None, **options) ``` Resource represenation __Arguments__ - __descriptor (str/dict)__: data resource descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `true` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `resource.data` Return resource data #### `resource.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `resource.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `resource.group` Group name __Returns__ `str`: group name #### `resource.headers` Resource's headers > Only for tabular resources (reading has to be started first or it's `None`) __Returns__ `str[]/None`: returns data source headers #### `resource.inline` Whether resource inline __Returns__ `bool`: returns true if resource is inline #### `resource.local` Whether resource local __Returns__ `bool`: returns true if resource is local #### `resource.multipart` Whether resource multipart __Returns__ `bool`: returns true if resource is multipart #### `resource.name` Resource name __Returns__ `str`: name #### `resource.package` Package instance if the resource belongs to some package __Returns__ `Package/None`: a package instance if available #### `resource.profile` Resource's profile __Returns__ `Profile`: an instance of `Profile` class #### `resource.remote` Whether resource remote __Returns__ `bool`: returns true if resource is remote #### `resource.schema` Resource's schema > Only for tabular resources For tabular resources it returns `Schema` instance to interact with data schema. Read API documentation - [tableschema.Schema](https://github.com/frictionlessdata/tableschema-py#schema). __Returns__ `tableschema.Schema`: schema #### `resource.source` Resource's source Combination of `resource.source` and `resource.inline/local/remote/multipart` provides predictable interface to work with resource data. __Returns__ `list/str`: returns `data` or `path` property #### `resource.table` Return resource table #### `resource.tabular` Whether resource tabular __Returns__ `bool`: returns true if resource is tabular #### `resource.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `resource.iter` ```python resource.iter(integrity=False, relations=False, **options) ``` Iterates through the resource data and emits rows cast based on table schema. > Only for tabular resources __Arguments__ keyed (bool): yield keyed rows in a form of `{header1: value1, header2: value2}` (default is false; the form of rows is `[value1, value2]`) extended (bool): yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]` (default is false; the form of rows is `[value1, value2]`) cast (bool): disable data casting if false (default is true) integrity (bool): if true actual size in BYTES and SHA256 hash of the file will be checked against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently) relations (bool): if true foreign key fields will be checked and resolved to its references foreign_keys_values (dict): three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`. If not provided but relations is true, it will be created before the validation process by *index_foreign_keys_values* method exc_handler (func): optional custom exception handler callable. Can be used to defer raising errors (i.e. "fail late"), e.g. for data validation purposes. Must support the signature below __Custom exception handler__ ```python def exc_handler(exc, row_number=None, row_data=None, error_data=None): '''Custom exception handler (example) # Arguments: exc(Exception): Deferred exception instance row_number(int): Data row number that triggers exception exc row_data(OrderedDict): Invalid data row source data error_data(OrderedDict): Data row source data field subset responsible for the error, if applicable (e.g. invalid primary or foreign key fields). May be identical to row_data. ''' # ... ``` __Raises__ - `DataPackageException`: base class of any error - `CastError`: data cast error - `IntegrityError`: integrity checking error - `UniqueKeyError`: unique key constraint violation - `UnresolvedFKError`: unresolved foreign key reference error __Returns__ `Iterator[list]`: yields rows #### `resource.read` ```python resource.read(integrity=False, relations=False, foreign_keys_values=False, **options) ``` Read the whole resource and return as array of rows > Only for tabular resources > It has the same API as `resource.iter` except for __Arguments__ - __limit (int)__: limit count of rows to read and return __Returns__ `list[]`: returns rows #### `resource.check_integrity` ```python resource.check_integrity() ``` Checks resource integrity > Only for tabular resources It checks size in BYTES and SHA256 hash of the file against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently). __Raises__ - `exceptions.IntegrityError`: raises if there are integrity issues __Returns__ `bool`: returns True if no issues #### `resource.check_relations` ```python resource.check_relations(foreign_keys_values=False) ``` Check relations > Only for tabular resources It checks foreign keys and raises an exception if there are integrity issues. __Raises__ - `exceptions.RelationError`: raises if there are relation issues __Returns__ `bool`: returns True if no issues #### `resource.drop_relations` ```python resource.drop_relations() ``` Drop relations > Only for tabular resources Remove relations data from memory __Returns__ `bool`: returns True #### `resource.raw_iter` ```python resource.raw_iter(stream=False) ``` Iterate over data chunks as bytes. If `stream` is true File-like object will be returned. __Arguments__ - __stream (bool)__: File-like object will be returned __Returns__ `bytes[]/filelike`: returns bytes[]/filelike #### `resource.raw_read` ```python resource.raw_read() ``` Returns resource data as bytes. __Returns__ `bytes`: returns resource data in bytes #### `resource.infer` ```python resource.infer(**options) ``` Infer resource metadata Like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance. __Arguments__ - __options__: options will be passed to `tableschema.infer` call, for more control on results (e.g. for setting `limit`, `confidence` etc.). __Returns__ `dict`: returns resource descriptor #### `resource.commit` ```python resource.commit(strict=None) ``` Update resource instance if there are in-place changes in the descriptor. __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `resource.save` ```python resource.save(target, storage=None, to_base_path=False, **options) ``` Saves this resource Into storage if `storage` argument is passed or saves this resource's descriptor to json file otherwise. __Arguments__ - __target (str)__: path where to save a resource - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __to_base_path (bool)__: save the resource to the resource's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success Building index... Started generating documentation... ### `Group` ```python Group(self, resources) ``` Group representation __Arguments__ - __Resource[]__: list of TABULAR resources #### `group.headers` Group's headers __Returns__ `str[]/None`: returns headers #### `group.name` Group name __Returns__ `str`: name #### `group.schema` Resource's schema __Returns__ `tableschema.Schema`: schema #### `group.iter` ```python group.iter(**options) ``` Iterates through the group data and emits rows cast based on table schema. > It concatenates all the resources and has the same API as `resource.iter` #### `group.read` ```python group.read(limit=None, **options) ``` Read the whole group and return as array of rows > It concatenates all the resources and has the same API as `resource.read` #### `group.check_relations` ```python group.check_relations() ``` Check group's relations The same as `resource.check_relations` but without the optional argument *foreign_keys_values*. This method will test foreignKeys of the whole group at once otpimizing the process by creating the foreign_key_values hashmap only once before testing the set of resources. ### `Profile` ```python Profile(self, profile) ``` Profile representation __Arguments__ - __profile (str)__: profile name in registry or URL to JSON Schema __Raises__ - `DataPackageException`: raises error if something goes wrong #### `profile.jsonschema` JSONSchema content __Returns__ `dict`: returns profile's JSON Schema contents #### `profile.name` Profile name __Returns__ `str/None`: name if available #### `profile.validate` ```python profile.validate(descriptor) ``` Validate a data package `descriptor` against the profile. __Arguments__ - __descriptor (dict)__: retrieved and dereferenced data package descriptor __Raises__ - `ValidationError`: raises if not valid __Returns__ `bool`: returns True if valid ### `validate` ```python validate(descriptor) ``` Validate a data package descriptor. __Arguments__ - __descriptor (str/dict)__: package descriptor (one of): - local path - remote url - object __Raises__ - `ValidationError`: raises on invalid __Returns__ `bool`: returns true on valid ### `infer` ```python infer(pattern, base_path=None) ``` Infer a data package descriptor. > Argument `pattern` works only for local files __Arguments__ - __pattern (str)__: glob file pattern __Returns__ `dict`: returns data package descriptor ### `DataPackageException` ```python DataPackageException(self, message, errors=[]) ``` Base class for all DataPackage/TableSchema exceptions. If there are multiple errors, they can be read from the exception object: ```python try: # lib action except DataPackageException as exception: if exception.multiple: for error in exception.errors: # handle error ``` #### `datapackageexception.errors` List of nested errors __Returns__ `DataPackageException[]`: list of nested errors #### `datapackageexception.multiple` Whether it's a nested exception __Returns__ `bool`: whether it's a nested exception ### `TableSchemaException` ```python TableSchemaException(self, message, errors=[]) ``` Base class for all TableSchema exceptions. ### `LoadError` ```python LoadError(self, message, errors=[]) ``` All loading errors. ### `CastError` ```python CastError(self, message, errors=[]) ``` All value cast errors. ### `IntegrityError` ```python IntegrityError(self, message, errors=[]) ``` All integrity errors. ### `RelationError` ```python RelationError(self, message, errors=[]) ``` All relations errors. ### `StorageError` ```python StorageError(self, message, errors=[]) ``` All storage errors. ## Contributing > The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment: ```bash $ make install ``` To run tests with linting and coverage: ```bash $ make test ``` ## Changelog Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/datapackage-py/commits/master). #### v1.15 > WARNING: it can be breaking for some setups, please read the discussions below - Fixed header management according to the specs: - https://github.com/frictionlessdata/datapackage-py/pull/257 - https://github.com/frictionlessdata/datapackage-py/issues/256 - https://github.com/frictionlessdata/forum/issues/1 #### v1.14 - Add experimental options for pick/skiping fileds/rows #### v1.13 - Add `unsafe` option to Package and Resource (#262) #### v1.12 - Use `chardet` for encoding deteciton by default. For `cchardet`: `pip install datapackage[cchardet]` #### v1.11 - `resource/package.save` now accept a `to_base_path` argument (#254) - `package.save` now returns a `Storage` instance if available #### v1.10 - Added an ability to check tabular resource's integrity #### v1.9 - Added `resource.package` property #### v1.8 - Added support for [groups of resources](#group) #### v1.7 - Added support for [compression of resources](https://frictionlessdata.io/specs/patterns/#compression-of-resources) #### v1.6 - Added support for custom request session #### v1.5 Updated behaviour: - Added support for Python 3.7 #### v1.4 New API added: - added `skip_rows` support to the resource descriptor #### v1.3 New API added: - property `package.base_path` is now publicly available #### v1.2 Updated behaviour: - CLI command `$ datapackage infer` now outputs only a JSON-formatted data package descriptor. #### v1.1 New API added: - Added an integration between `Package/Resource` and the `tableschema.Storage` - https://github.com/frictionlessdata/tableschema-py#storage. It allows to load and save data package from/to different storages like SQL/BigQuery/etc. %package -n python3-datapackage Summary: Utilities to work with Data Packages as defined on specs.frictionlessdata.io Provides: python-datapackage BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-datapackage # datapackage-py [![Travis](https://travis-ci.org/frictionlessdata/datapackage-py.svg?branch=master)](https://travis-ci.org/frictionlessdata/datapackage-py) [![Coveralls](https://coveralls.io/repos/github/frictionlessdata/datapackage-py/badge.svg?branch=master)](https://coveralls.io/github/frictionlessdata/datapackage-py?branch=master) [![PyPi](https://img.shields.io/pypi/v/datapackage.svg)](https://pypi.python.org/pypi/datapackage) [![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/datapackage-py) [![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat) A library for working with [Data Packages](http://specs.frictionlessdata.io/data-package/). > **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework provides improved `datapackage` functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `datapackage` to Frictionless Framework. > - we continue to bug-fix `datapackage@1.x` in this [repository](https://github.com/frictionlessdata/datapackage-py) as well as it's available on [PyPi](https://pypi.org/project/datapackage/) as it was before > - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable > - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version ## Features - `Package` class for working with data packages - `Resource` class for working with data resources - `Profile` class for working with profiles - `validate` function for validating data package descriptors - `infer` function for inferring data package descriptors ## Contents - [Getting Started](#getting-started) - [Installation](#installation) - [Documentation](#documentation) - [Introduction](#introduction) - [Working with Package](#working-with-package) - [Working with Resource](#working-with-resource) - [Working with Group](#working-with-group) - [Working with Profile](#working-with-profile) - [Working with Foreign Keys](#working-with-foreign-keys) - [Working with validate/infer](#working-with-validateinfer) - [Frequently Asked Questions](#frequently-asked-questions) - [API Reference](#api-reference) - [`cli`](#cli) - [`Package`](#package) - [`Resource`](#resource) - [`Group`](#group) - [`Profile`](#profile) - [`validate`](#validate) - [`infer`](#infer) - [`DataPackageException`](#datapackageexception) - [`TableSchemaException`](#tableschemaexception) - [`LoadError`](#loaderror) - [`CastError`](#casterror) - [`IntegrityError`](#integrityerror) - [`RelationError`](#relationerror) - [`StorageError`](#storageerror) - [Contributing](#contributing) - [Changelog](#changelog) ## Getting Started ### Installation The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify `datapackage` version range in your `setup/requirements` file e.g. `datapackage>=1.0,<2.0`. ```bash $ pip install datapackage ``` #### OSX 10.14+ If you receive an error about the `cchardet` package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps: 1. Make sure you have the latest x-code by running the following in terminal: `xcode-select --install` 2. Then go to [https://developer.apple.com/download/more/](https://developer.apple.com/download/more/) and download the `command line tools`. Note, this requires an Apple ID. 3. Then, in terminal, run `open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg` You can read more about these steps in this [post.](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave) ## Documentation ### Introduction Let's start with a simple example: ```python from datapackage import Package package = Package('datapackage.json') package.get_resource('resource').read() ``` ### Working with Package A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more. Consider we have some local csv files in a `data` directory. Let's create a data package based on this data using a `Package` class: > data/cities.csv ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,"41.89,12.51" ``` > data/population.csv ```csv city,year,population london,2017,8780000 paris,2017,2240000 rome,2017,2860000 ``` First we create a blank data package: ```python package = Package() ``` Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern `**/*.csv`: ```python package.infer('**/*.csv') package.descriptor #{ profile: 'tabular-data-package', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` An `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit: ```python package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year' package.commit() package.valid # true ``` Because our resources are tabular we could read it as a tabular data: ```python package.get_resource('population').read(keyed=True) #[ { city: 'london', year: 2017, population: 8780000 }, # { city: 'paris', year: 2017, population: 2240000 }, # { city: 'rome', year: 2017, population: 2860000 } ] ``` Let's save our descriptor on the disk as a zip-file: ```python package.save('datapackage.zip') ``` To continue the work with the data package we just load it again but this time using local `datapackage.zip`: ```python package = Package('datapackage.zip') # Continue the work ``` It was onle basic introduction to the `Package` class. To learn more let's take a look on `Package` class API reference. ### Working with Resource A class for working with data resources. You can read or iterate tabular resources using the `iter/read` methods and all resource as bytes using `row_iter/row_read` methods. Consider we have some local csv file. It could be inline data or remote link - all supported by `Resource` class (except local files for in-brower usage of course). But say it's `data.csv` for now: ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,N/A ``` Let's create and read a resource. Because resource is tabular we could use `resource.read` method with a `keyed` option to get an array of keyed rows: ```python resource = Resource({path: 'data.csv'}) resource.tabular # true resource.read(keyed=True) # [ # {city: 'london', location: '51.50,-0.11'}, # {city: 'paris', location: '48.85,2.30'}, # {city: 'rome', location: 'N/A'}, # ] resource.headers # ['city', 'location'] # (reading has to be started first) ``` As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a `N/A` string instead of Python `None`. First we have to infer resource metadata: ```python resource.infer() resource.descriptor #{ path: 'data.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'data', # format: 'csv', # mediatype: 'text/csv', # schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } } resource.read(keyed=True) # Fails with a data validation error ``` Let's fix not available location. There is a `missingValues` property in Table Schema specification. As a first try we set `missingValues` to `N/A` in `resource.descriptor.schema`. Resource descriptor could be changed in-place but all changes should be commited by `resource.commit()`: ```python resource.descriptor['schema']['missingValues'] = 'N/A' resource.commit() resource.valid # False resource.errors # [] ``` As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for `missingValues` property. Also don't forget to have an empty string as a missing value: ```python resource.descriptor['schema']['missingValues'] = ['', 'N/A'] resource.commit() resource.valid # true ``` All good. It looks like we're ready to read our data again: ```python resource.read(keyed=True) # [ # {city: 'london', location: [51.50,-0.11]}, # {city: 'paris', location: [48.85,2.30]}, # {city: 'rome', location: null}, # ] ``` Now we see that: - locations are arrays with numeric lattide and longitude - Rome's location is a native JavaScript `null` And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor: ```python resource.save('dataresource.json') ``` Let's check newly-crated `dataresource.json`. It contains path to our data file, inferred metadata and our `missingValues` tweak: ```json { "path": "data.csv", "profile": "tabular-data-resource", "encoding": "utf-8", "name": "data", "format": "csv", "mediatype": "text/csv", "schema": { "fields": [ { "name": "city", "type": "string", "format": "default" }, { "name": "location", "type": "geopoint", "format": "default" } ], "missingValues": [ "", "N/A" ] } } ``` If we decide to improve it even more we could update the `dataresource.json` file and then open it again using local file name: ```python resource = Resource('dataresource.json') # Continue the work ``` It was onle basic introduction to the `Resource` class. To learn more let's take a look on `Resource` class API reference. ### Working with Group A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: ` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name). Consider we have a data package with two tables partitioned by a year and a shared schema stored separately: > cars-2017.csv ```csv name,value bmw,2017 tesla,2017 nissan,2017 ``` > cars-2018.csv ```csv name,value bmw,2018 tesla,2018 nissan,2018 ``` > cars.schema.json ```json { "fields": [ { "name": "name", "type": "string" }, { "name": "value", "type": "integer" } ] } ``` > datapackage.json ```json { "name": "datapackage", "resources": [ { "group": "cars", "name": "cars-2017", "path": "cars-2017.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" }, { "group": "cars", "name": "cars-2018", "path": "cars-2018.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" } ] } ``` Let's read the resources separately: ```python package = Package('datapackage.json') package.get_resource('cars-2017').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, ] package.get_resource('cars-2018').read(keyed=True) == [ {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group: ```python package = Package('datapackage.json') package.get_group('cars').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour: ```python package = Package('datapackage.json') package.save(storage='sql', engine=engine) # SQL tables: # - cars-2017 # - cars-2018 package.save(storage='sql', engine=engine, merge_groups=True) # SQL tables: # - cars ``` ### Working with Profile A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json): ```python profile = Profile('data-package') profile.name # data-package profile.jsonschema # JSON Schema contents try: valid = profile.validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` ### Working with Foreign Keys The library supports foreign keys described in the [Table Schema](http://specs.frictionlessdata.io/table-schema/#foreign-keys) specification. It means if your data package descriptor use `resources[].schema.foreignKeys` property for some resources a data integrity will be checked on reading operations. Consider we have a data package: ```python DESCRIPTOR = { 'resources': [ { 'name': 'teams', 'data': [ ['id', 'name', 'city'], ['1', 'Arsenal', 'London'], ['2', 'Real', 'Madrid'], ['3', 'Bayern', 'Munich'], ], 'schema': { 'fields': [ {'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}, {'name': 'city', 'type': 'string'}, ], 'foreignKeys': [ { 'fields': 'city', 'reference': {'resource': 'cities', 'fields': 'name'}, }, ], }, }, { 'name': 'cities', 'data': [ ['name', 'country'], ['London', 'England'], ['Madrid', 'Spain'], ], }, ], } ``` Let's check relations for a `teams` resource: ```python from datapackage import Package package = Package(DESCRIPTOR) teams = package.get_resource('teams') teams.check_relations() # tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4" ``` As we could see there is a foreign key violation. That's because our lookup table `cities` doesn't have a city of `Munich` but we have a team from there. We need to fix it in `cities` resource: ```python package.descriptor['resources'][1]['data'].append(['Munich', 'Germany']) package.commit() teams = package.get_resource('teams') teams.check_relations() # True ``` Fixed! But not only a check operation is available. We could use `relations` argument for `resource.iter/read` methods to dereference a resource relations: ```python teams.read(keyed=True, relations=True) #[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}}, # {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}}, # {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}] ``` Instead of plain city name we've got a dictionary containing a city data. These `resource.iter/read` methods will fail with the same as `resource.check_relations` error if there is an integrity issue. But only if `relations=True` flag is passed. ### Working with validate/infer A standalone function to validate a data package descriptor: ```python from datapackage import validate, exceptions try: valid = validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` A standalone function to infer a data package descriptor. ```python descriptor = infer('**/*.csv') #{ profile: 'tabular-data-resource', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` ### Frequently Asked Questions #### Accessing data behind a proxy server? Before the `package = Package("https://xxx.json")` call set these environment variables: ```python import os os.environ["HTTP_PROXY"] = 'xxx' os.environ["HTTPS_PROXY"] = 'xxx' ``` ## API Reference ### `cli` ```python cli() ``` Command-line interface ``` Usage: datapackage [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit. --help Show this message and exit. Commands: infer validate ``` ### `Package` ```python Package(self, descriptor=None, base_path=None, strict=False, unsafe=False, storage=None, schema=None, default_base_path=None, **options) ``` Package representation __Arguments__ - __descriptor (str/dict)__: data package descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `True` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `package.base_path` Package's base path __Returns__ `str/None`: returns the data package base path #### `package.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `package.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `package.profile` Package's profile __Returns__ `Profile`: an instance of `Profile` class #### `package.resource_names` Package's resource names __Returns__ `str[]`: returns an array of resource names #### `package.resources` Package's resources __Returns__ `Resource[]`: returns an array of `Resource` instances #### `package.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `package.get_resource` ```python package.get_resource(name) ``` Get data package resource by name. __Arguments__ - __name (str)__: data resource name __Returns__ `Resource/None`: returns `Resource` instances or null if not found #### `package.add_resource` ```python package.add_resource(descriptor) ``` Add new resource to data package. The data package descriptor will be validated with newly added resource descriptor. __Arguments__ - __descriptor (dict)__: data resource descriptor __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns added `Resource` instance or null if not added #### `package.remove_resource` ```python package.remove_resource(name) ``` Remove data package resource by name. The data package descriptor will be validated after resource descriptor removal. __Arguments__ - __name (str)__: data resource name __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns removed `Resource` instances or null if not found #### `package.get_group` ```python package.get_group(name) ``` Returns a group of tabular resources by name. For more information about groups see [Group](#group). __Arguments__ - __name (str)__: name of a group of resources __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Group/None`: returns a `Group` instance or null if not found #### `package.infer` ```python package.infer(pattern=False) ``` Infer a data package metadata. > Argument `pattern` works only for local files If `pattern` is not provided only existent resources will be inferred (added metadata like encoding, profile etc). If `pattern` is provided new resoures with file names mathing the pattern will be added and inferred. It commits changes to data package instance. __Arguments__ - __pattern (str)__: glob pattern for new resources __Returns__ `dict`: returns data package descriptor #### `package.commit` ```python package.commit(strict=None) ``` Update data package instance if there are in-place changes in the descriptor. __Example__ ```python package = Package({ 'name': 'package', 'resources': [{'name': 'resource', 'data': ['data']}] }) package.name # package package.descriptor['name'] = 'renamed-package' package.name # package package.commit() package.name # renamed-package ``` __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `package.save` ```python package.save(target=None, storage=None, merge_groups=False, to_base_path=False, **options) ``` Saves this data package It saves it to storage if `storage` argument is passed or saves this data package's descriptor to json file if `target` arguments ends with `.json` or saves this data package to zip file otherwise. __Example__ It creates a zip file into ``file_or_path`` with the contents of this Data Package and its resources. Every resource which content lives in the local filesystem will be copied to the zip file. Consider the following Data Package descriptor: ```json { "name": "gdp", "resources": [ {"name": "local", "format": "CSV", "path": "data.csv"}, {"name": "inline", "data": [4, 8, 15, 16, 23, 42]}, {"name": "remote", "url": "http://someplace.com/data.csv"} ] } ``` The final structure of the zip file will be: ``` ./datapackage.json ./data/local.csv ``` With the contents of `datapackage.json` being the same as returned `datapackage.descriptor`. The resources' file names are generated based on their `name` and `format` fields if they exist. If the resource has no `name`, it'll be used `resource-X`, where `X` is the index of the resource in the `resources` list (starting at zero). If the resource has `format`, it'll be lowercased and appended to the `name`, becoming "`name.format`". __Arguments__ - __target (string/filelike)__: the file path or a file-like object where the contents of this Data Package will be saved into. - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __merge_groups (bool)__: save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about [Group](#group). - __to_base_path (bool)__: save the package to the package's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises if there was some error writing the package __Returns__ `bool/Storage`: on success return true or a `Storage` instance ### `Resource` ```python Resource(self, descriptor={}, base_path=None, strict=False, unsafe=False, storage=None, package=None, **options) ``` Resource represenation __Arguments__ - __descriptor (str/dict)__: data resource descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `true` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `resource.data` Return resource data #### `resource.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `resource.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `resource.group` Group name __Returns__ `str`: group name #### `resource.headers` Resource's headers > Only for tabular resources (reading has to be started first or it's `None`) __Returns__ `str[]/None`: returns data source headers #### `resource.inline` Whether resource inline __Returns__ `bool`: returns true if resource is inline #### `resource.local` Whether resource local __Returns__ `bool`: returns true if resource is local #### `resource.multipart` Whether resource multipart __Returns__ `bool`: returns true if resource is multipart #### `resource.name` Resource name __Returns__ `str`: name #### `resource.package` Package instance if the resource belongs to some package __Returns__ `Package/None`: a package instance if available #### `resource.profile` Resource's profile __Returns__ `Profile`: an instance of `Profile` class #### `resource.remote` Whether resource remote __Returns__ `bool`: returns true if resource is remote #### `resource.schema` Resource's schema > Only for tabular resources For tabular resources it returns `Schema` instance to interact with data schema. Read API documentation - [tableschema.Schema](https://github.com/frictionlessdata/tableschema-py#schema). __Returns__ `tableschema.Schema`: schema #### `resource.source` Resource's source Combination of `resource.source` and `resource.inline/local/remote/multipart` provides predictable interface to work with resource data. __Returns__ `list/str`: returns `data` or `path` property #### `resource.table` Return resource table #### `resource.tabular` Whether resource tabular __Returns__ `bool`: returns true if resource is tabular #### `resource.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `resource.iter` ```python resource.iter(integrity=False, relations=False, **options) ``` Iterates through the resource data and emits rows cast based on table schema. > Only for tabular resources __Arguments__ keyed (bool): yield keyed rows in a form of `{header1: value1, header2: value2}` (default is false; the form of rows is `[value1, value2]`) extended (bool): yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]` (default is false; the form of rows is `[value1, value2]`) cast (bool): disable data casting if false (default is true) integrity (bool): if true actual size in BYTES and SHA256 hash of the file will be checked against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently) relations (bool): if true foreign key fields will be checked and resolved to its references foreign_keys_values (dict): three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`. If not provided but relations is true, it will be created before the validation process by *index_foreign_keys_values* method exc_handler (func): optional custom exception handler callable. Can be used to defer raising errors (i.e. "fail late"), e.g. for data validation purposes. Must support the signature below __Custom exception handler__ ```python def exc_handler(exc, row_number=None, row_data=None, error_data=None): '''Custom exception handler (example) # Arguments: exc(Exception): Deferred exception instance row_number(int): Data row number that triggers exception exc row_data(OrderedDict): Invalid data row source data error_data(OrderedDict): Data row source data field subset responsible for the error, if applicable (e.g. invalid primary or foreign key fields). May be identical to row_data. ''' # ... ``` __Raises__ - `DataPackageException`: base class of any error - `CastError`: data cast error - `IntegrityError`: integrity checking error - `UniqueKeyError`: unique key constraint violation - `UnresolvedFKError`: unresolved foreign key reference error __Returns__ `Iterator[list]`: yields rows #### `resource.read` ```python resource.read(integrity=False, relations=False, foreign_keys_values=False, **options) ``` Read the whole resource and return as array of rows > Only for tabular resources > It has the same API as `resource.iter` except for __Arguments__ - __limit (int)__: limit count of rows to read and return __Returns__ `list[]`: returns rows #### `resource.check_integrity` ```python resource.check_integrity() ``` Checks resource integrity > Only for tabular resources It checks size in BYTES and SHA256 hash of the file against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently). __Raises__ - `exceptions.IntegrityError`: raises if there are integrity issues __Returns__ `bool`: returns True if no issues #### `resource.check_relations` ```python resource.check_relations(foreign_keys_values=False) ``` Check relations > Only for tabular resources It checks foreign keys and raises an exception if there are integrity issues. __Raises__ - `exceptions.RelationError`: raises if there are relation issues __Returns__ `bool`: returns True if no issues #### `resource.drop_relations` ```python resource.drop_relations() ``` Drop relations > Only for tabular resources Remove relations data from memory __Returns__ `bool`: returns True #### `resource.raw_iter` ```python resource.raw_iter(stream=False) ``` Iterate over data chunks as bytes. If `stream` is true File-like object will be returned. __Arguments__ - __stream (bool)__: File-like object will be returned __Returns__ `bytes[]/filelike`: returns bytes[]/filelike #### `resource.raw_read` ```python resource.raw_read() ``` Returns resource data as bytes. __Returns__ `bytes`: returns resource data in bytes #### `resource.infer` ```python resource.infer(**options) ``` Infer resource metadata Like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance. __Arguments__ - __options__: options will be passed to `tableschema.infer` call, for more control on results (e.g. for setting `limit`, `confidence` etc.). __Returns__ `dict`: returns resource descriptor #### `resource.commit` ```python resource.commit(strict=None) ``` Update resource instance if there are in-place changes in the descriptor. __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `resource.save` ```python resource.save(target, storage=None, to_base_path=False, **options) ``` Saves this resource Into storage if `storage` argument is passed or saves this resource's descriptor to json file otherwise. __Arguments__ - __target (str)__: path where to save a resource - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __to_base_path (bool)__: save the resource to the resource's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success Building index... Started generating documentation... ### `Group` ```python Group(self, resources) ``` Group representation __Arguments__ - __Resource[]__: list of TABULAR resources #### `group.headers` Group's headers __Returns__ `str[]/None`: returns headers #### `group.name` Group name __Returns__ `str`: name #### `group.schema` Resource's schema __Returns__ `tableschema.Schema`: schema #### `group.iter` ```python group.iter(**options) ``` Iterates through the group data and emits rows cast based on table schema. > It concatenates all the resources and has the same API as `resource.iter` #### `group.read` ```python group.read(limit=None, **options) ``` Read the whole group and return as array of rows > It concatenates all the resources and has the same API as `resource.read` #### `group.check_relations` ```python group.check_relations() ``` Check group's relations The same as `resource.check_relations` but without the optional argument *foreign_keys_values*. This method will test foreignKeys of the whole group at once otpimizing the process by creating the foreign_key_values hashmap only once before testing the set of resources. ### `Profile` ```python Profile(self, profile) ``` Profile representation __Arguments__ - __profile (str)__: profile name in registry or URL to JSON Schema __Raises__ - `DataPackageException`: raises error if something goes wrong #### `profile.jsonschema` JSONSchema content __Returns__ `dict`: returns profile's JSON Schema contents #### `profile.name` Profile name __Returns__ `str/None`: name if available #### `profile.validate` ```python profile.validate(descriptor) ``` Validate a data package `descriptor` against the profile. __Arguments__ - __descriptor (dict)__: retrieved and dereferenced data package descriptor __Raises__ - `ValidationError`: raises if not valid __Returns__ `bool`: returns True if valid ### `validate` ```python validate(descriptor) ``` Validate a data package descriptor. __Arguments__ - __descriptor (str/dict)__: package descriptor (one of): - local path - remote url - object __Raises__ - `ValidationError`: raises on invalid __Returns__ `bool`: returns true on valid ### `infer` ```python infer(pattern, base_path=None) ``` Infer a data package descriptor. > Argument `pattern` works only for local files __Arguments__ - __pattern (str)__: glob file pattern __Returns__ `dict`: returns data package descriptor ### `DataPackageException` ```python DataPackageException(self, message, errors=[]) ``` Base class for all DataPackage/TableSchema exceptions. If there are multiple errors, they can be read from the exception object: ```python try: # lib action except DataPackageException as exception: if exception.multiple: for error in exception.errors: # handle error ``` #### `datapackageexception.errors` List of nested errors __Returns__ `DataPackageException[]`: list of nested errors #### `datapackageexception.multiple` Whether it's a nested exception __Returns__ `bool`: whether it's a nested exception ### `TableSchemaException` ```python TableSchemaException(self, message, errors=[]) ``` Base class for all TableSchema exceptions. ### `LoadError` ```python LoadError(self, message, errors=[]) ``` All loading errors. ### `CastError` ```python CastError(self, message, errors=[]) ``` All value cast errors. ### `IntegrityError` ```python IntegrityError(self, message, errors=[]) ``` All integrity errors. ### `RelationError` ```python RelationError(self, message, errors=[]) ``` All relations errors. ### `StorageError` ```python StorageError(self, message, errors=[]) ``` All storage errors. ## Contributing > The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment: ```bash $ make install ``` To run tests with linting and coverage: ```bash $ make test ``` ## Changelog Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/datapackage-py/commits/master). #### v1.15 > WARNING: it can be breaking for some setups, please read the discussions below - Fixed header management according to the specs: - https://github.com/frictionlessdata/datapackage-py/pull/257 - https://github.com/frictionlessdata/datapackage-py/issues/256 - https://github.com/frictionlessdata/forum/issues/1 #### v1.14 - Add experimental options for pick/skiping fileds/rows #### v1.13 - Add `unsafe` option to Package and Resource (#262) #### v1.12 - Use `chardet` for encoding deteciton by default. For `cchardet`: `pip install datapackage[cchardet]` #### v1.11 - `resource/package.save` now accept a `to_base_path` argument (#254) - `package.save` now returns a `Storage` instance if available #### v1.10 - Added an ability to check tabular resource's integrity #### v1.9 - Added `resource.package` property #### v1.8 - Added support for [groups of resources](#group) #### v1.7 - Added support for [compression of resources](https://frictionlessdata.io/specs/patterns/#compression-of-resources) #### v1.6 - Added support for custom request session #### v1.5 Updated behaviour: - Added support for Python 3.7 #### v1.4 New API added: - added `skip_rows` support to the resource descriptor #### v1.3 New API added: - property `package.base_path` is now publicly available #### v1.2 Updated behaviour: - CLI command `$ datapackage infer` now outputs only a JSON-formatted data package descriptor. #### v1.1 New API added: - Added an integration between `Package/Resource` and the `tableschema.Storage` - https://github.com/frictionlessdata/tableschema-py#storage. It allows to load and save data package from/to different storages like SQL/BigQuery/etc. %package help Summary: Development documents and examples for datapackage Provides: python3-datapackage-doc %description help # datapackage-py [![Travis](https://travis-ci.org/frictionlessdata/datapackage-py.svg?branch=master)](https://travis-ci.org/frictionlessdata/datapackage-py) [![Coveralls](https://coveralls.io/repos/github/frictionlessdata/datapackage-py/badge.svg?branch=master)](https://coveralls.io/github/frictionlessdata/datapackage-py?branch=master) [![PyPi](https://img.shields.io/pypi/v/datapackage.svg)](https://pypi.python.org/pypi/datapackage) [![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/datapackage-py) [![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat) A library for working with [Data Packages](http://specs.frictionlessdata.io/data-package/). > **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework provides improved `datapackage` functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `datapackage` to Frictionless Framework. > - we continue to bug-fix `datapackage@1.x` in this [repository](https://github.com/frictionlessdata/datapackage-py) as well as it's available on [PyPi](https://pypi.org/project/datapackage/) as it was before > - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable > - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version ## Features - `Package` class for working with data packages - `Resource` class for working with data resources - `Profile` class for working with profiles - `validate` function for validating data package descriptors - `infer` function for inferring data package descriptors ## Contents - [Getting Started](#getting-started) - [Installation](#installation) - [Documentation](#documentation) - [Introduction](#introduction) - [Working with Package](#working-with-package) - [Working with Resource](#working-with-resource) - [Working with Group](#working-with-group) - [Working with Profile](#working-with-profile) - [Working with Foreign Keys](#working-with-foreign-keys) - [Working with validate/infer](#working-with-validateinfer) - [Frequently Asked Questions](#frequently-asked-questions) - [API Reference](#api-reference) - [`cli`](#cli) - [`Package`](#package) - [`Resource`](#resource) - [`Group`](#group) - [`Profile`](#profile) - [`validate`](#validate) - [`infer`](#infer) - [`DataPackageException`](#datapackageexception) - [`TableSchemaException`](#tableschemaexception) - [`LoadError`](#loaderror) - [`CastError`](#casterror) - [`IntegrityError`](#integrityerror) - [`RelationError`](#relationerror) - [`StorageError`](#storageerror) - [Contributing](#contributing) - [Changelog](#changelog) ## Getting Started ### Installation The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify `datapackage` version range in your `setup/requirements` file e.g. `datapackage>=1.0,<2.0`. ```bash $ pip install datapackage ``` #### OSX 10.14+ If you receive an error about the `cchardet` package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps: 1. Make sure you have the latest x-code by running the following in terminal: `xcode-select --install` 2. Then go to [https://developer.apple.com/download/more/](https://developer.apple.com/download/more/) and download the `command line tools`. Note, this requires an Apple ID. 3. Then, in terminal, run `open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg` You can read more about these steps in this [post.](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave) ## Documentation ### Introduction Let's start with a simple example: ```python from datapackage import Package package = Package('datapackage.json') package.get_resource('resource').read() ``` ### Working with Package A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more. Consider we have some local csv files in a `data` directory. Let's create a data package based on this data using a `Package` class: > data/cities.csv ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,"41.89,12.51" ``` > data/population.csv ```csv city,year,population london,2017,8780000 paris,2017,2240000 rome,2017,2860000 ``` First we create a blank data package: ```python package = Package() ``` Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern `**/*.csv`: ```python package.infer('**/*.csv') package.descriptor #{ profile: 'tabular-data-package', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` An `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit: ```python package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year' package.commit() package.valid # true ``` Because our resources are tabular we could read it as a tabular data: ```python package.get_resource('population').read(keyed=True) #[ { city: 'london', year: 2017, population: 8780000 }, # { city: 'paris', year: 2017, population: 2240000 }, # { city: 'rome', year: 2017, population: 2860000 } ] ``` Let's save our descriptor on the disk as a zip-file: ```python package.save('datapackage.zip') ``` To continue the work with the data package we just load it again but this time using local `datapackage.zip`: ```python package = Package('datapackage.zip') # Continue the work ``` It was onle basic introduction to the `Package` class. To learn more let's take a look on `Package` class API reference. ### Working with Resource A class for working with data resources. You can read or iterate tabular resources using the `iter/read` methods and all resource as bytes using `row_iter/row_read` methods. Consider we have some local csv file. It could be inline data or remote link - all supported by `Resource` class (except local files for in-brower usage of course). But say it's `data.csv` for now: ```csv city,location london,"51.50,-0.11" paris,"48.85,2.30" rome,N/A ``` Let's create and read a resource. Because resource is tabular we could use `resource.read` method with a `keyed` option to get an array of keyed rows: ```python resource = Resource({path: 'data.csv'}) resource.tabular # true resource.read(keyed=True) # [ # {city: 'london', location: '51.50,-0.11'}, # {city: 'paris', location: '48.85,2.30'}, # {city: 'rome', location: 'N/A'}, # ] resource.headers # ['city', 'location'] # (reading has to be started first) ``` As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a `N/A` string instead of Python `None`. First we have to infer resource metadata: ```python resource.infer() resource.descriptor #{ path: 'data.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'data', # format: 'csv', # mediatype: 'text/csv', # schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } } resource.read(keyed=True) # Fails with a data validation error ``` Let's fix not available location. There is a `missingValues` property in Table Schema specification. As a first try we set `missingValues` to `N/A` in `resource.descriptor.schema`. Resource descriptor could be changed in-place but all changes should be commited by `resource.commit()`: ```python resource.descriptor['schema']['missingValues'] = 'N/A' resource.commit() resource.valid # False resource.errors # [] ``` As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for `missingValues` property. Also don't forget to have an empty string as a missing value: ```python resource.descriptor['schema']['missingValues'] = ['', 'N/A'] resource.commit() resource.valid # true ``` All good. It looks like we're ready to read our data again: ```python resource.read(keyed=True) # [ # {city: 'london', location: [51.50,-0.11]}, # {city: 'paris', location: [48.85,2.30]}, # {city: 'rome', location: null}, # ] ``` Now we see that: - locations are arrays with numeric lattide and longitude - Rome's location is a native JavaScript `null` And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor: ```python resource.save('dataresource.json') ``` Let's check newly-crated `dataresource.json`. It contains path to our data file, inferred metadata and our `missingValues` tweak: ```json { "path": "data.csv", "profile": "tabular-data-resource", "encoding": "utf-8", "name": "data", "format": "csv", "mediatype": "text/csv", "schema": { "fields": [ { "name": "city", "type": "string", "format": "default" }, { "name": "location", "type": "geopoint", "format": "default" } ], "missingValues": [ "", "N/A" ] } } ``` If we decide to improve it even more we could update the `dataresource.json` file and then open it again using local file name: ```python resource = Resource('dataresource.json') # Continue the work ``` It was onle basic introduction to the `Resource` class. To learn more let's take a look on `Resource` class API reference. ### Working with Group A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: ` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name). Consider we have a data package with two tables partitioned by a year and a shared schema stored separately: > cars-2017.csv ```csv name,value bmw,2017 tesla,2017 nissan,2017 ``` > cars-2018.csv ```csv name,value bmw,2018 tesla,2018 nissan,2018 ``` > cars.schema.json ```json { "fields": [ { "name": "name", "type": "string" }, { "name": "value", "type": "integer" } ] } ``` > datapackage.json ```json { "name": "datapackage", "resources": [ { "group": "cars", "name": "cars-2017", "path": "cars-2017.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" }, { "group": "cars", "name": "cars-2018", "path": "cars-2018.csv", "profile": "tabular-data-resource", "schema": "cars.schema.json" } ] } ``` Let's read the resources separately: ```python package = Package('datapackage.json') package.get_resource('cars-2017').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, ] package.get_resource('cars-2018').read(keyed=True) == [ {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group: ```python package = Package('datapackage.json') package.get_group('cars').read(keyed=True) == [ {'name': 'bmw', 'value': 2017}, {'name': 'tesla', 'value': 2017}, {'name': 'nissan', 'value': 2017}, {'name': 'bmw', 'value': 2018}, {'name': 'tesla', 'value': 2018}, {'name': 'nissan', 'value': 2018}, ] ``` We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour: ```python package = Package('datapackage.json') package.save(storage='sql', engine=engine) # SQL tables: # - cars-2017 # - cars-2018 package.save(storage='sql', engine=engine, merge_groups=True) # SQL tables: # - cars ``` ### Working with Profile A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json): ```python profile = Profile('data-package') profile.name # data-package profile.jsonschema # JSON Schema contents try: valid = profile.validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` ### Working with Foreign Keys The library supports foreign keys described in the [Table Schema](http://specs.frictionlessdata.io/table-schema/#foreign-keys) specification. It means if your data package descriptor use `resources[].schema.foreignKeys` property for some resources a data integrity will be checked on reading operations. Consider we have a data package: ```python DESCRIPTOR = { 'resources': [ { 'name': 'teams', 'data': [ ['id', 'name', 'city'], ['1', 'Arsenal', 'London'], ['2', 'Real', 'Madrid'], ['3', 'Bayern', 'Munich'], ], 'schema': { 'fields': [ {'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}, {'name': 'city', 'type': 'string'}, ], 'foreignKeys': [ { 'fields': 'city', 'reference': {'resource': 'cities', 'fields': 'name'}, }, ], }, }, { 'name': 'cities', 'data': [ ['name', 'country'], ['London', 'England'], ['Madrid', 'Spain'], ], }, ], } ``` Let's check relations for a `teams` resource: ```python from datapackage import Package package = Package(DESCRIPTOR) teams = package.get_resource('teams') teams.check_relations() # tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4" ``` As we could see there is a foreign key violation. That's because our lookup table `cities` doesn't have a city of `Munich` but we have a team from there. We need to fix it in `cities` resource: ```python package.descriptor['resources'][1]['data'].append(['Munich', 'Germany']) package.commit() teams = package.get_resource('teams') teams.check_relations() # True ``` Fixed! But not only a check operation is available. We could use `relations` argument for `resource.iter/read` methods to dereference a resource relations: ```python teams.read(keyed=True, relations=True) #[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}}, # {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}}, # {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}] ``` Instead of plain city name we've got a dictionary containing a city data. These `resource.iter/read` methods will fail with the same as `resource.check_relations` error if there is an integrity issue. But only if `relations=True` flag is passed. ### Working with validate/infer A standalone function to validate a data package descriptor: ```python from datapackage import validate, exceptions try: valid = validate(descriptor) except exceptions.ValidationError as exception: for error in exception.errors: # handle individual error ``` A standalone function to infer a data package descriptor. ```python descriptor = infer('**/*.csv') #{ profile: 'tabular-data-resource', # resources: # [ { path: 'data/cities.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'cities', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] }, # { path: 'data/population.csv', # profile: 'tabular-data-resource', # encoding: 'utf-8', # name: 'population', # format: 'csv', # mediatype: 'text/csv', # schema: [Object] } ] } ``` ### Frequently Asked Questions #### Accessing data behind a proxy server? Before the `package = Package("https://xxx.json")` call set these environment variables: ```python import os os.environ["HTTP_PROXY"] = 'xxx' os.environ["HTTPS_PROXY"] = 'xxx' ``` ## API Reference ### `cli` ```python cli() ``` Command-line interface ``` Usage: datapackage [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit. --help Show this message and exit. Commands: infer validate ``` ### `Package` ```python Package(self, descriptor=None, base_path=None, strict=False, unsafe=False, storage=None, schema=None, default_base_path=None, **options) ``` Package representation __Arguments__ - __descriptor (str/dict)__: data package descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `True` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `package.base_path` Package's base path __Returns__ `str/None`: returns the data package base path #### `package.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `package.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `package.profile` Package's profile __Returns__ `Profile`: an instance of `Profile` class #### `package.resource_names` Package's resource names __Returns__ `str[]`: returns an array of resource names #### `package.resources` Package's resources __Returns__ `Resource[]`: returns an array of `Resource` instances #### `package.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `package.get_resource` ```python package.get_resource(name) ``` Get data package resource by name. __Arguments__ - __name (str)__: data resource name __Returns__ `Resource/None`: returns `Resource` instances or null if not found #### `package.add_resource` ```python package.add_resource(descriptor) ``` Add new resource to data package. The data package descriptor will be validated with newly added resource descriptor. __Arguments__ - __descriptor (dict)__: data resource descriptor __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns added `Resource` instance or null if not added #### `package.remove_resource` ```python package.remove_resource(name) ``` Remove data package resource by name. The data package descriptor will be validated after resource descriptor removal. __Arguments__ - __name (str)__: data resource name __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Resource/None`: returns removed `Resource` instances or null if not found #### `package.get_group` ```python package.get_group(name) ``` Returns a group of tabular resources by name. For more information about groups see [Group](#group). __Arguments__ - __name (str)__: name of a group of resources __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `Group/None`: returns a `Group` instance or null if not found #### `package.infer` ```python package.infer(pattern=False) ``` Infer a data package metadata. > Argument `pattern` works only for local files If `pattern` is not provided only existent resources will be inferred (added metadata like encoding, profile etc). If `pattern` is provided new resoures with file names mathing the pattern will be added and inferred. It commits changes to data package instance. __Arguments__ - __pattern (str)__: glob pattern for new resources __Returns__ `dict`: returns data package descriptor #### `package.commit` ```python package.commit(strict=None) ``` Update data package instance if there are in-place changes in the descriptor. __Example__ ```python package = Package({ 'name': 'package', 'resources': [{'name': 'resource', 'data': ['data']}] }) package.name # package package.descriptor['name'] = 'renamed-package' package.name # package package.commit() package.name # renamed-package ``` __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `package.save` ```python package.save(target=None, storage=None, merge_groups=False, to_base_path=False, **options) ``` Saves this data package It saves it to storage if `storage` argument is passed or saves this data package's descriptor to json file if `target` arguments ends with `.json` or saves this data package to zip file otherwise. __Example__ It creates a zip file into ``file_or_path`` with the contents of this Data Package and its resources. Every resource which content lives in the local filesystem will be copied to the zip file. Consider the following Data Package descriptor: ```json { "name": "gdp", "resources": [ {"name": "local", "format": "CSV", "path": "data.csv"}, {"name": "inline", "data": [4, 8, 15, 16, 23, 42]}, {"name": "remote", "url": "http://someplace.com/data.csv"} ] } ``` The final structure of the zip file will be: ``` ./datapackage.json ./data/local.csv ``` With the contents of `datapackage.json` being the same as returned `datapackage.descriptor`. The resources' file names are generated based on their `name` and `format` fields if they exist. If the resource has no `name`, it'll be used `resource-X`, where `X` is the index of the resource in the `resources` list (starting at zero). If the resource has `format`, it'll be lowercased and appended to the `name`, becoming "`name.format`". __Arguments__ - __target (string/filelike)__: the file path or a file-like object where the contents of this Data Package will be saved into. - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __merge_groups (bool)__: save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about [Group](#group). - __to_base_path (bool)__: save the package to the package's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises if there was some error writing the package __Returns__ `bool/Storage`: on success return true or a `Storage` instance ### `Resource` ```python Resource(self, descriptor={}, base_path=None, strict=False, unsafe=False, storage=None, package=None, **options) ``` Resource represenation __Arguments__ - __descriptor (str/dict)__: data resource descriptor as local path, url or object - __base_path (str)__: base path for all relative paths - __strict (bool)__: strict flag to alter validation behavior. Setting it to `true` leads to throwing errors on any operation with invalid descriptor - __unsafe (bool)__: if `True` unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to `False` - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong #### `resource.data` Return resource data #### `resource.descriptor` Package's descriptor __Returns__ `dict`: descriptor #### `resource.errors` Validation errors Always empty in strict mode. __Returns__ `Exception[]`: validation errors #### `resource.group` Group name __Returns__ `str`: group name #### `resource.headers` Resource's headers > Only for tabular resources (reading has to be started first or it's `None`) __Returns__ `str[]/None`: returns data source headers #### `resource.inline` Whether resource inline __Returns__ `bool`: returns true if resource is inline #### `resource.local` Whether resource local __Returns__ `bool`: returns true if resource is local #### `resource.multipart` Whether resource multipart __Returns__ `bool`: returns true if resource is multipart #### `resource.name` Resource name __Returns__ `str`: name #### `resource.package` Package instance if the resource belongs to some package __Returns__ `Package/None`: a package instance if available #### `resource.profile` Resource's profile __Returns__ `Profile`: an instance of `Profile` class #### `resource.remote` Whether resource remote __Returns__ `bool`: returns true if resource is remote #### `resource.schema` Resource's schema > Only for tabular resources For tabular resources it returns `Schema` instance to interact with data schema. Read API documentation - [tableschema.Schema](https://github.com/frictionlessdata/tableschema-py#schema). __Returns__ `tableschema.Schema`: schema #### `resource.source` Resource's source Combination of `resource.source` and `resource.inline/local/remote/multipart` provides predictable interface to work with resource data. __Returns__ `list/str`: returns `data` or `path` property #### `resource.table` Return resource table #### `resource.tabular` Whether resource tabular __Returns__ `bool`: returns true if resource is tabular #### `resource.valid` Validation status Always true in strict mode. __Returns__ `bool`: validation status #### `resource.iter` ```python resource.iter(integrity=False, relations=False, **options) ``` Iterates through the resource data and emits rows cast based on table schema. > Only for tabular resources __Arguments__ keyed (bool): yield keyed rows in a form of `{header1: value1, header2: value2}` (default is false; the form of rows is `[value1, value2]`) extended (bool): yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]` (default is false; the form of rows is `[value1, value2]`) cast (bool): disable data casting if false (default is true) integrity (bool): if true actual size in BYTES and SHA256 hash of the file will be checked against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently) relations (bool): if true foreign key fields will be checked and resolved to its references foreign_keys_values (dict): three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`. If not provided but relations is true, it will be created before the validation process by *index_foreign_keys_values* method exc_handler (func): optional custom exception handler callable. Can be used to defer raising errors (i.e. "fail late"), e.g. for data validation purposes. Must support the signature below __Custom exception handler__ ```python def exc_handler(exc, row_number=None, row_data=None, error_data=None): '''Custom exception handler (example) # Arguments: exc(Exception): Deferred exception instance row_number(int): Data row number that triggers exception exc row_data(OrderedDict): Invalid data row source data error_data(OrderedDict): Data row source data field subset responsible for the error, if applicable (e.g. invalid primary or foreign key fields). May be identical to row_data. ''' # ... ``` __Raises__ - `DataPackageException`: base class of any error - `CastError`: data cast error - `IntegrityError`: integrity checking error - `UniqueKeyError`: unique key constraint violation - `UnresolvedFKError`: unresolved foreign key reference error __Returns__ `Iterator[list]`: yields rows #### `resource.read` ```python resource.read(integrity=False, relations=False, foreign_keys_values=False, **options) ``` Read the whole resource and return as array of rows > Only for tabular resources > It has the same API as `resource.iter` except for __Arguments__ - __limit (int)__: limit count of rows to read and return __Returns__ `list[]`: returns rows #### `resource.check_integrity` ```python resource.check_integrity() ``` Checks resource integrity > Only for tabular resources It checks size in BYTES and SHA256 hash of the file against `descriptor.bytes` and `descriptor.hash` (other hashing algorithms are not supported and will be skipped silently). __Raises__ - `exceptions.IntegrityError`: raises if there are integrity issues __Returns__ `bool`: returns True if no issues #### `resource.check_relations` ```python resource.check_relations(foreign_keys_values=False) ``` Check relations > Only for tabular resources It checks foreign keys and raises an exception if there are integrity issues. __Raises__ - `exceptions.RelationError`: raises if there are relation issues __Returns__ `bool`: returns True if no issues #### `resource.drop_relations` ```python resource.drop_relations() ``` Drop relations > Only for tabular resources Remove relations data from memory __Returns__ `bool`: returns True #### `resource.raw_iter` ```python resource.raw_iter(stream=False) ``` Iterate over data chunks as bytes. If `stream` is true File-like object will be returned. __Arguments__ - __stream (bool)__: File-like object will be returned __Returns__ `bytes[]/filelike`: returns bytes[]/filelike #### `resource.raw_read` ```python resource.raw_read() ``` Returns resource data as bytes. __Returns__ `bytes`: returns resource data in bytes #### `resource.infer` ```python resource.infer(**options) ``` Infer resource metadata Like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance. __Arguments__ - __options__: options will be passed to `tableschema.infer` call, for more control on results (e.g. for setting `limit`, `confidence` etc.). __Returns__ `dict`: returns resource descriptor #### `resource.commit` ```python resource.commit(strict=None) ``` Update resource instance if there are in-place changes in the descriptor. __Arguments__ - __strict (bool)__: alter `strict` mode for further work __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success and false if not modified #### `resource.save` ```python resource.save(target, storage=None, to_base_path=False, **options) ``` Saves this resource Into storage if `storage` argument is passed or saves this resource's descriptor to json file otherwise. __Arguments__ - __target (str)__: path where to save a resource - __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance - __to_base_path (bool)__: save the resource to the resource's base path using the "/" route - __options (dict)__: storage options to use for storage creation __Raises__ - `DataPackageException`: raises error if something goes wrong __Returns__ `bool`: returns true on success Building index... Started generating documentation... ### `Group` ```python Group(self, resources) ``` Group representation __Arguments__ - __Resource[]__: list of TABULAR resources #### `group.headers` Group's headers __Returns__ `str[]/None`: returns headers #### `group.name` Group name __Returns__ `str`: name #### `group.schema` Resource's schema __Returns__ `tableschema.Schema`: schema #### `group.iter` ```python group.iter(**options) ``` Iterates through the group data and emits rows cast based on table schema. > It concatenates all the resources and has the same API as `resource.iter` #### `group.read` ```python group.read(limit=None, **options) ``` Read the whole group and return as array of rows > It concatenates all the resources and has the same API as `resource.read` #### `group.check_relations` ```python group.check_relations() ``` Check group's relations The same as `resource.check_relations` but without the optional argument *foreign_keys_values*. This method will test foreignKeys of the whole group at once otpimizing the process by creating the foreign_key_values hashmap only once before testing the set of resources. ### `Profile` ```python Profile(self, profile) ``` Profile representation __Arguments__ - __profile (str)__: profile name in registry or URL to JSON Schema __Raises__ - `DataPackageException`: raises error if something goes wrong #### `profile.jsonschema` JSONSchema content __Returns__ `dict`: returns profile's JSON Schema contents #### `profile.name` Profile name __Returns__ `str/None`: name if available #### `profile.validate` ```python profile.validate(descriptor) ``` Validate a data package `descriptor` against the profile. __Arguments__ - __descriptor (dict)__: retrieved and dereferenced data package descriptor __Raises__ - `ValidationError`: raises if not valid __Returns__ `bool`: returns True if valid ### `validate` ```python validate(descriptor) ``` Validate a data package descriptor. __Arguments__ - __descriptor (str/dict)__: package descriptor (one of): - local path - remote url - object __Raises__ - `ValidationError`: raises on invalid __Returns__ `bool`: returns true on valid ### `infer` ```python infer(pattern, base_path=None) ``` Infer a data package descriptor. > Argument `pattern` works only for local files __Arguments__ - __pattern (str)__: glob file pattern __Returns__ `dict`: returns data package descriptor ### `DataPackageException` ```python DataPackageException(self, message, errors=[]) ``` Base class for all DataPackage/TableSchema exceptions. If there are multiple errors, they can be read from the exception object: ```python try: # lib action except DataPackageException as exception: if exception.multiple: for error in exception.errors: # handle error ``` #### `datapackageexception.errors` List of nested errors __Returns__ `DataPackageException[]`: list of nested errors #### `datapackageexception.multiple` Whether it's a nested exception __Returns__ `bool`: whether it's a nested exception ### `TableSchemaException` ```python TableSchemaException(self, message, errors=[]) ``` Base class for all TableSchema exceptions. ### `LoadError` ```python LoadError(self, message, errors=[]) ``` All loading errors. ### `CastError` ```python CastError(self, message, errors=[]) ``` All value cast errors. ### `IntegrityError` ```python IntegrityError(self, message, errors=[]) ``` All integrity errors. ### `RelationError` ```python RelationError(self, message, errors=[]) ``` All relations errors. ### `StorageError` ```python StorageError(self, message, errors=[]) ``` All storage errors. ## Contributing > The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment: ```bash $ make install ``` To run tests with linting and coverage: ```bash $ make test ``` ## Changelog Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/datapackage-py/commits/master). #### v1.15 > WARNING: it can be breaking for some setups, please read the discussions below - Fixed header management according to the specs: - https://github.com/frictionlessdata/datapackage-py/pull/257 - https://github.com/frictionlessdata/datapackage-py/issues/256 - https://github.com/frictionlessdata/forum/issues/1 #### v1.14 - Add experimental options for pick/skiping fileds/rows #### v1.13 - Add `unsafe` option to Package and Resource (#262) #### v1.12 - Use `chardet` for encoding deteciton by default. For `cchardet`: `pip install datapackage[cchardet]` #### v1.11 - `resource/package.save` now accept a `to_base_path` argument (#254) - `package.save` now returns a `Storage` instance if available #### v1.10 - Added an ability to check tabular resource's integrity #### v1.9 - Added `resource.package` property #### v1.8 - Added support for [groups of resources](#group) #### v1.7 - Added support for [compression of resources](https://frictionlessdata.io/specs/patterns/#compression-of-resources) #### v1.6 - Added support for custom request session #### v1.5 Updated behaviour: - Added support for Python 3.7 #### v1.4 New API added: - added `skip_rows` support to the resource descriptor #### v1.3 New API added: - property `package.base_path` is now publicly available #### v1.2 Updated behaviour: - CLI command `$ datapackage infer` now outputs only a JSON-formatted data package descriptor. #### v1.1 New API added: - Added an integration between `Package/Resource` and the `tableschema.Storage` - https://github.com/frictionlessdata/tableschema-py#storage. It allows to load and save data package from/to different storages like SQL/BigQuery/etc. %prep %autosetup -n datapackage-1.15.2 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-datapackage -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Sun Apr 23 2023 Python_Bot - 1.15.2-1 - Package Spec generated