summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-18 05:34:31 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-18 05:34:31 +0000
commitbebd90bc66cbcfb0590683f370e21103d70ee3c9 (patch)
tree1394a19a700c2824010e5ad614ab88ce6a74804c
parent622b5c2729b799428fb2cfe53b24023b3084b153 (diff)
automatic import of python-datavalid
-rw-r--r--.gitignore1
-rw-r--r--python-datavalid.spec550
-rw-r--r--sources1
3 files changed, 552 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..1d5b10b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/datavalid-0.3.6.tar.gz
diff --git a/python-datavalid.spec b/python-datavalid.spec
new file mode 100644
index 0000000..a0955d9
--- /dev/null
+++ b/python-datavalid.spec
@@ -0,0 +1,550 @@
+%global _empty_manifest_terminate_build 0
+Name: python-datavalid
+Version: 0.3.6
+Release: 1
+Summary: Data validation library
+License: MIT License
+URL: https://github.com/pckhoi/datavalid
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9d/26/458a8714b9eda5a7670af3ff7e86d1edd25885d8080c35ad17efa1e9bcaf/datavalid-0.3.6.tar.gz
+BuildArch: noarch
+
+Requires: python3-numpy
+Requires: python3-pandas
+Requires: python3-pyyaml
+Requires: python3-termcolor
+
+%description
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+ fuse/complaint.csv:
+ schema:
+ uid:
+ description: >
+ accused officer's unique identifier. This references the `uid` column in personnel.csv
+ tracking_number:
+ description: >
+ complaint tracking number from the agency the data originate from
+ complaint_uid:
+ description: >
+ complaint unique identifier
+ unique: true
+ no_na: true
+ validation_tasks:
+ - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+ unique:
+ - complaint_uid
+ - uid
+ - allegation
+ - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+ empty:
+ and:
+ - column: allegation_finding
+ op: equal
+ value: sustained
+ - column: disposition
+ op: not_equal
+ value: sustained
+ fuse/event.csv:
+ schema:
+ event_uid:
+ description: >
+ unique identifier for each event
+ unique: true
+ no_na: true
+ kind:
+ options:
+ - officer_level_1_cert
+ - officer_pc_12_qualification
+ - officer_rank
+ validation_tasks:
+ - name: no officer with more than 1 left date in a calendar month
+ where:
+ column: kind
+ op: equal
+ value: officer_left
+ group_by: uid
+ no_more_than_once_per_30_days:
+ date_from:
+ year_column: year
+ month_column: month
+ day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+ - _equal_
+ - _not_equal_
+ - _greater_than_
+ - _less_than_
+ - _greater_equal_
+ - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%package -n python3-datavalid
+Summary: Data validation library
+Provides: python-datavalid
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-datavalid
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+ fuse/complaint.csv:
+ schema:
+ uid:
+ description: >
+ accused officer's unique identifier. This references the `uid` column in personnel.csv
+ tracking_number:
+ description: >
+ complaint tracking number from the agency the data originate from
+ complaint_uid:
+ description: >
+ complaint unique identifier
+ unique: true
+ no_na: true
+ validation_tasks:
+ - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+ unique:
+ - complaint_uid
+ - uid
+ - allegation
+ - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+ empty:
+ and:
+ - column: allegation_finding
+ op: equal
+ value: sustained
+ - column: disposition
+ op: not_equal
+ value: sustained
+ fuse/event.csv:
+ schema:
+ event_uid:
+ description: >
+ unique identifier for each event
+ unique: true
+ no_na: true
+ kind:
+ options:
+ - officer_level_1_cert
+ - officer_pc_12_qualification
+ - officer_rank
+ validation_tasks:
+ - name: no officer with more than 1 left date in a calendar month
+ where:
+ column: kind
+ op: equal
+ value: officer_left
+ group_by: uid
+ no_more_than_once_per_30_days:
+ date_from:
+ year_column: year
+ month_column: month
+ day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+ - _equal_
+ - _not_equal_
+ - _greater_than_
+ - _less_than_
+ - _greater_equal_
+ - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%package help
+Summary: Development documents and examples for datavalid
+Provides: python3-datavalid-doc
+%description help
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+ fuse/complaint.csv:
+ schema:
+ uid:
+ description: >
+ accused officer's unique identifier. This references the `uid` column in personnel.csv
+ tracking_number:
+ description: >
+ complaint tracking number from the agency the data originate from
+ complaint_uid:
+ description: >
+ complaint unique identifier
+ unique: true
+ no_na: true
+ validation_tasks:
+ - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+ unique:
+ - complaint_uid
+ - uid
+ - allegation
+ - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+ empty:
+ and:
+ - column: allegation_finding
+ op: equal
+ value: sustained
+ - column: disposition
+ op: not_equal
+ value: sustained
+ fuse/event.csv:
+ schema:
+ event_uid:
+ description: >
+ unique identifier for each event
+ unique: true
+ no_na: true
+ kind:
+ options:
+ - officer_level_1_cert
+ - officer_pc_12_qualification
+ - officer_rank
+ validation_tasks:
+ - name: no officer with more than 1 left date in a calendar month
+ where:
+ column: kind
+ op: equal
+ value: officer_left
+ group_by: uid
+ no_more_than_once_per_30_days:
+ date_from:
+ year_column: year
+ month_column: month
+ day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+ - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+ - _equal_
+ - _not_equal_
+ - _greater_than_
+ - _less_than_
+ - _greater_equal_
+ - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%prep
+%autosetup -n datavalid-0.3.6
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-datavalid -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.6-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..025b770
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+b5c8731e77c657f3cf7b04228cd7008c datavalid-0.3.6.tar.gz