automatic import of python-datavalid

author: CoprDistGit <infra@openeuler.org> 2023-05-18 05:34:31 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-18 05:34:31 +0000
commit: bebd90bc66cbcfb0590683f370e21103d70ee3c9 (patch)
tree: 1394a19a700c2824010e5ad614ab88ce6a74804c
parent: 622b5c2729b799428fb2cfe53b24023b3084b153 (diff)
3 files changed, 552 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..1d5b10b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/datavalid-0.3.6.tar.gz
diff --git a/python-datavalid.spec b/python-datavalid.spec
new file mode 100644
index 0000000..a0955d9
--- /dev/null
+++ b/python-datavalid.spec
@@ -0,0 +1,550 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-datavalid
+Version:	0.3.6
+Release:	1
+Summary:	Data validation library
+License:	MIT License
+URL:		https://github.com/pckhoi/datavalid
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/9d/26/458a8714b9eda5a7670af3ff7e86d1edd25885d8080c35ad17efa1e9bcaf/datavalid-0.3.6.tar.gz
+BuildArch:	noarch
+
+Requires:	python3-numpy
+Requires:	python3-pandas
+Requires:	python3-pyyaml
+Requires:	python3-termcolor
+
+%description
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+  fuse/complaint.csv:
+    schema:
+      uid:
+        description: >
+          accused officer's unique identifier. This references the `uid` column in personnel.csv
+      tracking_number:
+        description: >
+          complaint tracking number from the agency the data originate from
+      complaint_uid:
+        description: >
+          complaint unique identifier
+        unique: true
+        no_na: true
+    validation_tasks:
+      - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+        unique:
+          - complaint_uid
+          - uid
+          - allegation
+      - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+        empty:
+          and:
+            - column: allegation_finding
+              op: equal
+              value: sustained
+            - column: disposition
+              op: not_equal
+              value: sustained
+  fuse/event.csv:
+    schema:
+      event_uid:
+        description: >
+          unique identifier for each event
+        unique: true
+        no_na: true
+      kind:
+        options:
+          - officer_level_1_cert
+          - officer_pc_12_qualification
+          - officer_rank
+    validation_tasks:
+      - name: no officer with more than 1 left date in a calendar month
+        where:
+          column: kind
+          op: equal
+          value: officer_left
+        group_by: uid
+        no_more_than_once_per_30_days:
+          date_from:
+            year_column: year
+            month_column: month
+            day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+  - _equal_
+  - _not_equal_
+  - _greater_than_
+  - _less_than_
+  - _greater_equal_
+  - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%package -n python3-datavalid
+Summary:	Data validation library
+Provides:	python-datavalid
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-datavalid
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+  fuse/complaint.csv:
+    schema:
+      uid:
+        description: >
+          accused officer's unique identifier. This references the `uid` column in personnel.csv
+      tracking_number:
+        description: >
+          complaint tracking number from the agency the data originate from
+      complaint_uid:
+        description: >
+          complaint unique identifier
+        unique: true
+        no_na: true
+    validation_tasks:
+      - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+        unique:
+          - complaint_uid
+          - uid
+          - allegation
+      - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+        empty:
+          and:
+            - column: allegation_finding
+              op: equal
+              value: sustained
+            - column: disposition
+              op: not_equal
+              value: sustained
+  fuse/event.csv:
+    schema:
+      event_uid:
+        description: >
+          unique identifier for each event
+        unique: true
+        no_na: true
+      kind:
+        options:
+          - officer_level_1_cert
+          - officer_pc_12_qualification
+          - officer_rank
+    validation_tasks:
+      - name: no officer with more than 1 left date in a calendar month
+        where:
+          column: kind
+          op: equal
+          value: officer_left
+        group_by: uid
+        no_more_than_once_per_30_days:
+          date_from:
+            year_column: year
+            month_column: month
+            day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+  - _equal_
+  - _not_equal_
+  - _greater_than_
+  - _less_than_
+  - _greater_equal_
+  - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%package help
+Summary:	Development documents and examples for datavalid
+Provides:	python3-datavalid-doc
+%description help
+# Datavalid
+
+This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
+
+## Installation
+
+```bash
+pip install datavalid
+```
+
+## Usage
+
+Create a `datavalid.yml` file in your data folder:
+
+```yaml
+files:
+  fuse/complaint.csv:
+    schema:
+      uid:
+        description: >
+          accused officer's unique identifier. This references the `uid` column in personnel.csv
+      tracking_number:
+        description: >
+          complaint tracking number from the agency the data originate from
+      complaint_uid:
+        description: >
+          complaint unique identifier
+        unique: true
+        no_na: true
+    validation_tasks:
+      - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
+        unique:
+          - complaint_uid
+          - uid
+          - allegation
+      - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
+        empty:
+          and:
+            - column: allegation_finding
+              op: equal
+              value: sustained
+            - column: disposition
+              op: not_equal
+              value: sustained
+  fuse/event.csv:
+    schema:
+      event_uid:
+        description: >
+          unique identifier for each event
+        unique: true
+        no_na: true
+      kind:
+        options:
+          - officer_level_1_cert
+          - officer_pc_12_qualification
+          - officer_rank
+    validation_tasks:
+      - name: no officer with more than 1 left date in a calendar month
+        where:
+          column: kind
+          op: equal
+          value: officer_left
+        group_by: uid
+        no_more_than_once_per_30_days:
+          date_from:
+            year_column: year
+            month_column: month
+            day_column: day
+save_bad_rows_to: invalid_rows.csv
+```
+
+Then run datavalid command in that folder:
+
+```bash
+python -m datavalid
+```
+
+You can also specify a data folder that isn't the current working directory:
+
+```bash
+python -m datavalid --dir my_data_folder
+```
+
+## Config specification
+
+A config file is a file named `datavalid.yml` and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains [config object](#config-object) in YAML format.
+
+### Config object
+
+- **files**: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to [file object](#file-object) to learn more about file configuration.
+- **save_bad_rows_to**: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.
+
+### File object
+
+- **schema**: optional, description of each column in this file. This field accepts a [column schema object](#column-schema-object).
+- **validation_tasks**: optional, additional validation tasks to perform on this file. Refer to [task object](#task-object) to learn more.
+
+### Column schema object
+
+- **description**: optional, textual description of this column.
+- **unique**: optional, if set to true then this column can not contain duplicates.
+- **no_na**: optional, if set to true then this column cannot contain empty values.
+- **integer**: optional, if set to true then this column can only contain integers.
+- **float**: optional, if set to true then this column can only contain floats.
+- **options**: optional, list of valid values for this column.
+- **range**: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply `float: true`.
+- **title_case**: optional, if set to true then all words in this column must begin with an upper case letter.
+- **match_regex**: optional, regexp pattern to match against all values.
+
+### Task object
+
+Common fields:
+
+- **name**: required, name of validation task.
+- **where**: optional, how to filter the data. This field accepts a [condition object](#condition-object).
+- **group_by**: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
+- **warn_only**: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.
+
+Checker fields (define exactly one of these fields):
+
+- **unique**: optional, column name or list of column names to ensure uniqueness.
+- **empty**: optional, accepts a [condition object](#condition-object) and ensure that no row fulfill this condition.
+- **no_more_than_once_per_30_days**: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+- **no_consecutive_date**: optional, ensure that no row occur on consecutive days. Accepts the following fields:
+  - **date_from**: required, how to parse date from the given data. Accepts a [date parser](#date-parser) object.
+
+### Condition object
+
+There are 3 ways to define a condition. The first way is to provide `column`, `op` and `value`:
+
+- **column**: optional, column name to compare
+- **op**: optional, compare operation to use. Possible value are:
+  - _equal_
+  - _not_equal_
+  - _greater_than_
+  - _less_than_
+  - _greater_equal_
+  - _less_equal_
+- **value**: optional, the value to compare with.
+
+The second way is to provide `and` field:
+
+- **and**: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a [condition object](#condition-object).
+
+Finally the last way is to provide `or` field:
+
+- **or**: optional, same as `and` except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.
+
+### Date parser
+
+Combines multiple columns to create dates.
+
+- **year_column**: required, year column name.
+- **month_column**: required, month column name.
+- **day_column**: required, day column name.
+
+
+
+
+%prep
+%autosetup -n datavalid-0.3.6
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-datavalid -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.6-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..025b770
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+b5c8731e77c657f3cf7b04228cd7008c  datavalid-0.3.6.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-05-18 05:34:31 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-18 05:34:31 +0000
commit	bebd90bc66cbcfb0590683f370e21103d70ee3c9 (patch)
tree	1394a19a700c2824010e5ad614ab88ce6a74804c
parent	622b5c2729b799428fb2cfe53b24023b3084b153 (diff)