diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-18 05:34:31 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-18 05:34:31 +0000 |
| commit | bebd90bc66cbcfb0590683f370e21103d70ee3c9 (patch) | |
| tree | 1394a19a700c2824010e5ad614ab88ce6a74804c | |
| parent | 622b5c2729b799428fb2cfe53b24023b3084b153 (diff) | |
automatic import of python-datavalid
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-datavalid.spec | 550 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 552 insertions, 0 deletions
@@ -0,0 +1 @@ +/datavalid-0.3.6.tar.gz diff --git a/python-datavalid.spec b/python-datavalid.spec new file mode 100644 index 0000000..a0955d9 --- /dev/null +++ b/python-datavalid.spec @@ -0,0 +1,550 @@ +%global _empty_manifest_terminate_build 0 +Name: python-datavalid +Version: 0.3.6 +Release: 1 +Summary: Data validation library +License: MIT License +URL: https://github.com/pckhoi/datavalid +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9d/26/458a8714b9eda5a7670af3ff7e86d1edd25885d8080c35ad17efa1e9bcaf/datavalid-0.3.6.tar.gz +BuildArch: noarch + +Requires: python3-numpy +Requires: python3-pandas +Requires: python3-pyyaml +Requires: python3-termcolor + +%description +# Datavalid + +This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently. + +## Installation + +```bash +pip install datavalid +``` + +## Usage + +Create a `datavalid.yml` file in your data folder: + +```yaml +files: + fuse/complaint.csv: + schema: + uid: + description: > + accused officer's unique identifier. This references the `uid` column in personnel.csv + tracking_number: + description: > + complaint tracking number from the agency the data originate from + complaint_uid: + description: > + complaint unique identifier + unique: true + no_na: true + validation_tasks: + - name: "`complaint_uid`, `allegation` and `uid` should be unique together" + unique: + - complaint_uid + - uid + - allegation + - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained" + empty: + and: + - column: allegation_finding + op: equal + value: sustained + - column: disposition + op: not_equal + value: sustained + fuse/event.csv: + schema: + event_uid: + description: > + unique identifier for each event + unique: true + no_na: true + kind: + options: + - officer_level_1_cert + - officer_pc_12_qualification + - officer_rank + validation_tasks: + - name: no officer with more than 1 left date in a calendar month + where: + column: kind + op: equal + value: officer_left + group_by: uid + no_more_than_once_per_30_days: + date_from: + year_column: year + month_column: month + day_column: day +save_bad_rows_to: invalid_rows.csv +``` + +Then run datavalid command in that folder: + +```bash +python -m datavalid +``` + +You can also specify a data folder that isn't the current working directory: + +```bash +python -m datavalid --dir my_data_folder +``` + +## Config specification + +A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format. + +### Config object + +- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration. +- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal. + +### File object + +- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object). +- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more. + +### Column schema object + +- **description**: optional, textual description of this column. +- **unique**: optional, if set to true then this column can not contain duplicates. +- **no_na**: optional, if set to true then this column cannot contain empty values. +- **integer**: optional, if set to true then this column can only contain integers. +- **float**: optional, if set to true then this column can only contain floats. +- **options**: optional, list of valid values for this column. +- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`. +- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter. +- **match_regex**: optional, regexp pattern to match against all values. + +### Task object + +Common fields: + +- **name**: required, name of validation task. +- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object). +- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with. +- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run. + +Checker fields (define exactly one of these fields): + +- **unique**: optional, column name or list of column names to ensure uniqueness. +- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition. +- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. +- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. + +### Condition object + +There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`: + +- **column**: optional, column name to compare +- **op**: optional, compare operation to use. Possible value are: + - _equal_ + - _not_equal_ + - _greater_than_ + - _less_than_ + - _greater_equal_ + - _less_equal_ +- **value**: optional, the value to compare with. + +The second way is to provide `and` field: + +- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object). + +Finally the last way is to provide `or` field: + +- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled. + +### Date parser + +Combines multiple columns to create dates. + +- **year_column**: required, year column name. +- **month_column**: required, month column name. +- **day_column**: required, day column name. + + + + +%package -n python3-datavalid +Summary: Data validation library +Provides: python-datavalid +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-datavalid +# Datavalid + +This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently. + +## Installation + +```bash +pip install datavalid +``` + +## Usage + +Create a `datavalid.yml` file in your data folder: + +```yaml +files: + fuse/complaint.csv: + schema: + uid: + description: > + accused officer's unique identifier. This references the `uid` column in personnel.csv + tracking_number: + description: > + complaint tracking number from the agency the data originate from + complaint_uid: + description: > + complaint unique identifier + unique: true + no_na: true + validation_tasks: + - name: "`complaint_uid`, `allegation` and `uid` should be unique together" + unique: + - complaint_uid + - uid + - allegation + - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained" + empty: + and: + - column: allegation_finding + op: equal + value: sustained + - column: disposition + op: not_equal + value: sustained + fuse/event.csv: + schema: + event_uid: + description: > + unique identifier for each event + unique: true + no_na: true + kind: + options: + - officer_level_1_cert + - officer_pc_12_qualification + - officer_rank + validation_tasks: + - name: no officer with more than 1 left date in a calendar month + where: + column: kind + op: equal + value: officer_left + group_by: uid + no_more_than_once_per_30_days: + date_from: + year_column: year + month_column: month + day_column: day +save_bad_rows_to: invalid_rows.csv +``` + +Then run datavalid command in that folder: + +```bash +python -m datavalid +``` + +You can also specify a data folder that isn't the current working directory: + +```bash +python -m datavalid --dir my_data_folder +``` + +## Config specification + +A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format. + +### Config object + +- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration. +- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal. + +### File object + +- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object). +- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more. + +### Column schema object + +- **description**: optional, textual description of this column. +- **unique**: optional, if set to true then this column can not contain duplicates. +- **no_na**: optional, if set to true then this column cannot contain empty values. +- **integer**: optional, if set to true then this column can only contain integers. +- **float**: optional, if set to true then this column can only contain floats. +- **options**: optional, list of valid values for this column. +- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`. +- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter. +- **match_regex**: optional, regexp pattern to match against all values. + +### Task object + +Common fields: + +- **name**: required, name of validation task. +- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object). +- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with. +- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run. + +Checker fields (define exactly one of these fields): + +- **unique**: optional, column name or list of column names to ensure uniqueness. +- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition. +- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. +- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. + +### Condition object + +There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`: + +- **column**: optional, column name to compare +- **op**: optional, compare operation to use. Possible value are: + - _equal_ + - _not_equal_ + - _greater_than_ + - _less_than_ + - _greater_equal_ + - _less_equal_ +- **value**: optional, the value to compare with. + +The second way is to provide `and` field: + +- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object). + +Finally the last way is to provide `or` field: + +- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled. + +### Date parser + +Combines multiple columns to create dates. + +- **year_column**: required, year column name. +- **month_column**: required, month column name. +- **day_column**: required, day column name. + + + + +%package help +Summary: Development documents and examples for datavalid +Provides: python3-datavalid-doc +%description help +# Datavalid + +This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently. + +## Installation + +```bash +pip install datavalid +``` + +## Usage + +Create a `datavalid.yml` file in your data folder: + +```yaml +files: + fuse/complaint.csv: + schema: + uid: + description: > + accused officer's unique identifier. This references the `uid` column in personnel.csv + tracking_number: + description: > + complaint tracking number from the agency the data originate from + complaint_uid: + description: > + complaint unique identifier + unique: true + no_na: true + validation_tasks: + - name: "`complaint_uid`, `allegation` and `uid` should be unique together" + unique: + - complaint_uid + - uid + - allegation + - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained" + empty: + and: + - column: allegation_finding + op: equal + value: sustained + - column: disposition + op: not_equal + value: sustained + fuse/event.csv: + schema: + event_uid: + description: > + unique identifier for each event + unique: true + no_na: true + kind: + options: + - officer_level_1_cert + - officer_pc_12_qualification + - officer_rank + validation_tasks: + - name: no officer with more than 1 left date in a calendar month + where: + column: kind + op: equal + value: officer_left + group_by: uid + no_more_than_once_per_30_days: + date_from: + year_column: year + month_column: month + day_column: day +save_bad_rows_to: invalid_rows.csv +``` + +Then run datavalid command in that folder: + +```bash +python -m datavalid +``` + +You can also specify a data folder that isn't the current working directory: + +```bash +python -m datavalid --dir my_data_folder +``` + +## Config specification + +A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format. + +### Config object + +- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration. +- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal. + +### File object + +- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object). +- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more. + +### Column schema object + +- **description**: optional, textual description of this column. +- **unique**: optional, if set to true then this column can not contain duplicates. +- **no_na**: optional, if set to true then this column cannot contain empty values. +- **integer**: optional, if set to true then this column can only contain integers. +- **float**: optional, if set to true then this column can only contain floats. +- **options**: optional, list of valid values for this column. +- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`. +- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter. +- **match_regex**: optional, regexp pattern to match against all values. + +### Task object + +Common fields: + +- **name**: required, name of validation task. +- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object). +- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with. +- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run. + +Checker fields (define exactly one of these fields): + +- **unique**: optional, column name or list of column names to ensure uniqueness. +- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition. +- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. +- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields: + - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object. + +### Condition object + +There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`: + +- **column**: optional, column name to compare +- **op**: optional, compare operation to use. Possible value are: + - _equal_ + - _not_equal_ + - _greater_than_ + - _less_than_ + - _greater_equal_ + - _less_equal_ +- **value**: optional, the value to compare with. + +The second way is to provide `and` field: + +- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object). + +Finally the last way is to provide `or` field: + +- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled. + +### Date parser + +Combines multiple columns to create dates. + +- **year_column**: required, year column name. +- **month_column**: required, month column name. +- **day_column**: required, day column name. + + + + +%prep +%autosetup -n datavalid-0.3.6 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-datavalid -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.6-1 +- Package Spec generated @@ -0,0 +1 @@ +b5c8731e77c657f3cf7b04228cd7008c datavalid-0.3.6.tar.gz |
