diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-pandera.spec | 978 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 980 insertions, 0 deletions
@@ -0,0 +1 @@ +/pandera-0.14.5.tar.gz diff --git a/python-pandera.spec b/python-pandera.spec new file mode 100644 index 0000000..71e3568 --- /dev/null +++ b/python-pandera.spec @@ -0,0 +1,978 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pandera +Version: 0.14.5 +Release: 1 +Summary: A light-weight and flexible data validation and testing tool for statistical data objects. +License: MIT +URL: https://github.com/pandera-dev/pandera +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/7b/09/ce690eb6248a37a773e975998fd291e3094c2410649a61ac0c3378814e50/pandera-0.14.5.tar.gz +BuildArch: noarch + +Requires: python3-multimethod +Requires: python3-numpy +Requires: python3-packaging +Requires: python3-pandas +Requires: python3-pydantic +Requires: python3-typing-inspect +Requires: python3-wrapt +Requires: python3-typing-extensions +Requires: python3-black +Requires: python3-pandas-stubs +Requires: python3-fastapi +Requires: python3-ray +Requires: python3-dask +Requires: python3-geopandas +Requires: python3-pyspark +Requires: python3-scipy +Requires: python3-pyyaml +Requires: python3-shapely +Requires: python3-modin +Requires: python3-frictionless +Requires: python3-hypothesis +Requires: python3-dask +Requires: python3-fastapi +Requires: python3-geopandas +Requires: python3-shapely +Requires: python3-scipy +Requires: python3-pyyaml +Requires: python3-black +Requires: python3-frictionless +Requires: python3-modin +Requires: python3-ray +Requires: python3-dask +Requires: python3-modin +Requires: python3-dask +Requires: python3-modin +Requires: python3-ray +Requires: python3-pandas-stubs +Requires: python3-pyspark +Requires: python3-hypothesis + +%description +<br> +<div align="center"><img src="https://raw.githubusercontent.com/pandera-dev/pandera/main/docs/source/_static/pandera-banner.png" width="400"></div> + +<hr> + +# A Statistical Data Testing Toolkit + +*A data validation library for scientists, engineers, and analysts seeking +correctness.* + +<br> + +[](https://github.com/pandera-dev/pandera/actions?query=workflow%3A%22CI+Tests%22+branch%3Amain) +[](https://pandera.readthedocs.io/en/stable/?badge=stable) +[](https://pypi.org/project/pandera/) +[](https://pypi.python.org/pypi/) +[](https://github.com/pyOpenSci/software-review/issues/12) +[](https://www.repostatus.org/#active) +[](https://pandera.readthedocs.io/en/latest/?badge=latest) +[](https://codecov.io/gh/pandera-dev/pandera) +[](https://pypi.python.org/pypi/pandera/) +[](https://doi.org/10.5281/zenodo.3385265) +[](https://pandera-dev.github.io/pandera-asv-logs/) +[](https://pepy.tech/project/pandera) +[](https://pepy.tech/project/pandera) +[](https://anaconda.org/conda-forge/pandera) +[](https://discord.gg/vyanhWuaKB) + +`pandera` provides a flexible and expressive API for performing data +validation on dataframe-like objects to make data processing pipelines more +readable and robust. + +Dataframes contain information that `pandera` explicitly validates at runtime. +This is useful in production-critical or reproducible research settings. With +`pandera`, you can: + +1. Define a schema once and use it to validate + [different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html) + including [pandas](http://pandas.pydata.org), [dask](https://dask.org), + [modin](https://modin.readthedocs.io/), and [pyspark](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/index.html). +1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and + properties of columns in a `DataFrame` or values in a `Series`. +1. Perform more complex statistical validation like + [hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis). +1. Seamlessly integrate with existing data analysis/processing pipelines + via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators). +1. Define dataframe models with the + [class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models) + with pydantic-style syntax and validate dataframes using the typing syntax. +1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies) + from schema objects for property-based testing with pandas data structures. +1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html) + dataframes so that all validation checks are executed before raising an error. +1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with + a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io), + [fastapi](https://fastapi.tiangolo.com/), and [mypy](http://mypy-lang.org/). + +## Documentation + +The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io + + +## Install + +Using pip: + +``` +pip install pandera +``` + +Using conda: + +``` +conda install -c conda-forge pandera +``` + +### Extras + +Installing additional functionality: + +<details> + +<summary><i>pip</i></summary> + +```bash +pip install pandera[hypotheses] # hypothesis checks +pip install pandera[io] # yaml/script schema io utilities +pip install pandera[strategies] # data synthesis strategies +pip install pandera[mypy] # enable static type-linting of pandas +pip install pandera[fastapi] # fastapi integration +pip install pandera[dask] # validate dask dataframes +pip install pandera[pyspark] # validate pyspark dataframes +pip install pandera[modin] # validate modin dataframes +pip install pandera[modin-ray] # validate modin dataframes with ray +pip install pandera[modin-dask] # validate modin dataframes with dask +pip install pandera[geopandas] # validate geopandas geodataframes +``` + +</details> + +<details> + +<summary><i>conda</i></summary> + +```bash +conda install -c conda-forge pandera-hypotheses # hypothesis checks +conda install -c conda-forge pandera-io # yaml/script schema io utilities +conda install -c conda-forge pandera-strategies # data synthesis strategies +conda install -c conda-forge pandera-mypy # enable static type-linting of pandas +conda install -c conda-forge pandera-fastapi # fastapi integration +conda install -c conda-forge pandera-dask # validate dask dataframes +conda install -c conda-forge pandera-pyspark # validate pyspark dataframes +conda install -c conda-forge pandera-modin # validate modin dataframes +conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray +conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask +conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes +``` + +</details> + +## Quick Start + +```python +import pandas as pd +import pandera as pa + + +# data to validate +df = pd.DataFrame({ + "column1": [1, 4, 0, 10, 9], + "column2": [-1.3, -1.4, -2.9, -10.1, -20.4], + "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"] +}) + +# define schema +schema = pa.DataFrameSchema({ + "column1": pa.Column(int, checks=pa.Check.le(10)), + "column2": pa.Column(float, checks=pa.Check.lt(-1.2)), + "column3": pa.Column(str, checks=[ + pa.Check.str_startswith("value_"), + # define custom checks as functions that take a series as input and + # outputs a boolean or boolean Series + pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2) + ]), +}) + +validated_df = schema(df) +print(validated_df) + +# column1 column2 column3 +# 0 1 -1.3 value_1 +# 1 4 -1.4 value_2 +# 2 0 -2.9 value_3 +# 3 10 -10.1 value_2 +# 4 9 -20.4 value_1 +``` + +## DataFrame Model + +`pandera` also provides an alternative API for expressing schemas inspired +by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and +[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel` +for the above `DataFrameSchema` would be: + + +```python +from pandera.typing import Series + +class Schema(pa.DataFrameModel): + + column1: Series[int] = pa.Field(le=10) + column2: Series[float] = pa.Field(lt=-1.2) + column3: Series[str] = pa.Field(str_startswith="value_") + + @pa.check("column3") + def column_3_check(cls, series: Series[str]) -> Series[bool]: + """Check that values have two elements after being split with '_'""" + return series.str.split("_", expand=True).shape[1] == 2 + +Schema.validate(df) +``` + +## Development Installation + +``` +git clone https://github.com/pandera-dev/pandera.git +cd pandera +pip install -r requirements-dev.txt +pip install -e . +``` + +## Tests + +``` +pip install pytest +pytest tests +``` + +## Contributing to pandera [](https://github.com/pandera-dev/pandera/graphs/contributors) + +All contributions, bug reports, bug fixes, documentation improvements, +enhancements and ideas are welcome. + +A detailed overview on how to contribute can be found in the +[contributing guide](https://github.com/pandera-dev/pandera/blob/main/.github/CONTRIBUTING.md) +on GitHub. + +## Issues + +Go [here](https://github.com/pandera-dev/pandera/issues) to submit feature +requests or bugfixes. + +## Need Help? + +There are many ways of getting help with your questions. You can ask a question +on [Github Discussions](https://github.com/pandera-dev/pandera/discussions/categories/q-a) +page or reach out to the maintainers and pandera community on +[Discord](https://discord.gg/vyanhWuaKB) + +## Why `pandera`? + +- [dataframe-centric data types](https://pandera.readthedocs.io/en/stable/dtypes.html), + [column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns), + and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns) + are first-class concepts. +- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with + [pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax. +- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration) + enable seamless integration with existing code. +- [`Check`s](https://pandera.readthedocs.io/en/stable/checks.html) provide flexibility and performance by providing access to `pandas` + API by design and offers built-in checks for common data tests. +- [`Hypothesis`](https://pandera.readthedocs.io/en/stable/hypothesis.html) class provides a tidy-first interface for statistical hypothesis + testing. +- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](https://pandera.readthedocs.io/en/stable/checks.html#wide-checks). +- Use schemas as generative contracts to [synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) for unit testing. +- [Schema inference](https://pandera.readthedocs.io/en/stable/schema_inference.html) allows you to bootstrap schemas from data. + +## Alternative Data Validation Libraries + +Here are a few other alternatives for validating Python data structures. + +**Generic Python object data validation** + +- [voloptuous](https://github.com/alecthomas/voluptuous) +- [schema](https://github.com/keleshev/schema) + +**`pandas`-specific data validation** + +- [opulent-pandas](https://github.com/danielvdende/opulent-pandas) +- [PandasSchema](https://github.com/TMiguelT/PandasSchema) +- [pandas-validator](https://github.com/c-data/pandas-validator) +- [table_enforcer](https://github.com/xguse/table_enforcer) +- [dataenforce](https://github.com/CedricFR/dataenforce) +- [strictly typed pandas](https://github.com/nanne-aben/strictly_typed_pandas) +- [marshmallow-dataframe](https://github.com/facultyai/marshmallow-dataframe) + +**Other tools for data validation** + +- [great_expectations](https://github.com/great-expectations/great_expectations) +- [frictionless schema](https://framework.frictionlessdata.io/docs/guides/framework/schema-guide/) + +## How to Cite + +If you use `pandera` in the context of academic or industry research, please +consider citing the **paper** and/or **software package**. + +### [Paper](https://conference.scipy.org/proceedings/scipy2020/niels_bantilan.html) + +``` +@InProceedings{ niels_bantilan-proc-scipy-2020, + author = { {N}iels {B}antilan }, + title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes }, + booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference }, + pages = { 116 - 124 }, + year = { 2020 }, + editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe }, + doi = { 10.25080/Majora-342d178e-010 } +} +``` + +### Software Package + +[](https://doi.org/10.5281/zenodo.3385265) + + +## License and Credits + +`pandera` is licensed under the [MIT license](license.txt) and is written and +maintained by Niels Bantilan (niels@pandera.ci) + + +%package -n python3-pandera +Summary: A light-weight and flexible data validation and testing tool for statistical data objects. +Provides: python-pandera +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pandera +<br> +<div align="center"><img src="https://raw.githubusercontent.com/pandera-dev/pandera/main/docs/source/_static/pandera-banner.png" width="400"></div> + +<hr> + +# A Statistical Data Testing Toolkit + +*A data validation library for scientists, engineers, and analysts seeking +correctness.* + +<br> + +[](https://github.com/pandera-dev/pandera/actions?query=workflow%3A%22CI+Tests%22+branch%3Amain) +[](https://pandera.readthedocs.io/en/stable/?badge=stable) +[](https://pypi.org/project/pandera/) +[](https://pypi.python.org/pypi/) +[](https://github.com/pyOpenSci/software-review/issues/12) +[](https://www.repostatus.org/#active) +[](https://pandera.readthedocs.io/en/latest/?badge=latest) +[](https://codecov.io/gh/pandera-dev/pandera) +[](https://pypi.python.org/pypi/pandera/) +[](https://doi.org/10.5281/zenodo.3385265) +[](https://pandera-dev.github.io/pandera-asv-logs/) +[](https://pepy.tech/project/pandera) +[](https://pepy.tech/project/pandera) +[](https://anaconda.org/conda-forge/pandera) +[](https://discord.gg/vyanhWuaKB) + +`pandera` provides a flexible and expressive API for performing data +validation on dataframe-like objects to make data processing pipelines more +readable and robust. + +Dataframes contain information that `pandera` explicitly validates at runtime. +This is useful in production-critical or reproducible research settings. With +`pandera`, you can: + +1. Define a schema once and use it to validate + [different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html) + including [pandas](http://pandas.pydata.org), [dask](https://dask.org), + [modin](https://modin.readthedocs.io/), and [pyspark](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/index.html). +1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and + properties of columns in a `DataFrame` or values in a `Series`. +1. Perform more complex statistical validation like + [hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis). +1. Seamlessly integrate with existing data analysis/processing pipelines + via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators). +1. Define dataframe models with the + [class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models) + with pydantic-style syntax and validate dataframes using the typing syntax. +1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies) + from schema objects for property-based testing with pandas data structures. +1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html) + dataframes so that all validation checks are executed before raising an error. +1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with + a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io), + [fastapi](https://fastapi.tiangolo.com/), and [mypy](http://mypy-lang.org/). + +## Documentation + +The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io + + +## Install + +Using pip: + +``` +pip install pandera +``` + +Using conda: + +``` +conda install -c conda-forge pandera +``` + +### Extras + +Installing additional functionality: + +<details> + +<summary><i>pip</i></summary> + +```bash +pip install pandera[hypotheses] # hypothesis checks +pip install pandera[io] # yaml/script schema io utilities +pip install pandera[strategies] # data synthesis strategies +pip install pandera[mypy] # enable static type-linting of pandas +pip install pandera[fastapi] # fastapi integration +pip install pandera[dask] # validate dask dataframes +pip install pandera[pyspark] # validate pyspark dataframes +pip install pandera[modin] # validate modin dataframes +pip install pandera[modin-ray] # validate modin dataframes with ray +pip install pandera[modin-dask] # validate modin dataframes with dask +pip install pandera[geopandas] # validate geopandas geodataframes +``` + +</details> + +<details> + +<summary><i>conda</i></summary> + +```bash +conda install -c conda-forge pandera-hypotheses # hypothesis checks +conda install -c conda-forge pandera-io # yaml/script schema io utilities +conda install -c conda-forge pandera-strategies # data synthesis strategies +conda install -c conda-forge pandera-mypy # enable static type-linting of pandas +conda install -c conda-forge pandera-fastapi # fastapi integration +conda install -c conda-forge pandera-dask # validate dask dataframes +conda install -c conda-forge pandera-pyspark # validate pyspark dataframes +conda install -c conda-forge pandera-modin # validate modin dataframes +conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray +conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask +conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes +``` + +</details> + +## Quick Start + +```python +import pandas as pd +import pandera as pa + + +# data to validate +df = pd.DataFrame({ + "column1": [1, 4, 0, 10, 9], + "column2": [-1.3, -1.4, -2.9, -10.1, -20.4], + "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"] +}) + +# define schema +schema = pa.DataFrameSchema({ + "column1": pa.Column(int, checks=pa.Check.le(10)), + "column2": pa.Column(float, checks=pa.Check.lt(-1.2)), + "column3": pa.Column(str, checks=[ + pa.Check.str_startswith("value_"), + # define custom checks as functions that take a series as input and + # outputs a boolean or boolean Series + pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2) + ]), +}) + +validated_df = schema(df) +print(validated_df) + +# column1 column2 column3 +# 0 1 -1.3 value_1 +# 1 4 -1.4 value_2 +# 2 0 -2.9 value_3 +# 3 10 -10.1 value_2 +# 4 9 -20.4 value_1 +``` + +## DataFrame Model + +`pandera` also provides an alternative API for expressing schemas inspired +by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and +[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel` +for the above `DataFrameSchema` would be: + + +```python +from pandera.typing import Series + +class Schema(pa.DataFrameModel): + + column1: Series[int] = pa.Field(le=10) + column2: Series[float] = pa.Field(lt=-1.2) + column3: Series[str] = pa.Field(str_startswith="value_") + + @pa.check("column3") + def column_3_check(cls, series: Series[str]) -> Series[bool]: + """Check that values have two elements after being split with '_'""" + return series.str.split("_", expand=True).shape[1] == 2 + +Schema.validate(df) +``` + +## Development Installation + +``` +git clone https://github.com/pandera-dev/pandera.git +cd pandera +pip install -r requirements-dev.txt +pip install -e . +``` + +## Tests + +``` +pip install pytest +pytest tests +``` + +## Contributing to pandera [](https://github.com/pandera-dev/pandera/graphs/contributors) + +All contributions, bug reports, bug fixes, documentation improvements, +enhancements and ideas are welcome. + +A detailed overview on how to contribute can be found in the +[contributing guide](https://github.com/pandera-dev/pandera/blob/main/.github/CONTRIBUTING.md) +on GitHub. + +## Issues + +Go [here](https://github.com/pandera-dev/pandera/issues) to submit feature +requests or bugfixes. + +## Need Help? + +There are many ways of getting help with your questions. You can ask a question +on [Github Discussions](https://github.com/pandera-dev/pandera/discussions/categories/q-a) +page or reach out to the maintainers and pandera community on +[Discord](https://discord.gg/vyanhWuaKB) + +## Why `pandera`? + +- [dataframe-centric data types](https://pandera.readthedocs.io/en/stable/dtypes.html), + [column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns), + and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns) + are first-class concepts. +- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with + [pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax. +- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration) + enable seamless integration with existing code. +- [`Check`s](https://pandera.readthedocs.io/en/stable/checks.html) provide flexibility and performance by providing access to `pandas` + API by design and offers built-in checks for common data tests. +- [`Hypothesis`](https://pandera.readthedocs.io/en/stable/hypothesis.html) class provides a tidy-first interface for statistical hypothesis + testing. +- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](https://pandera.readthedocs.io/en/stable/checks.html#wide-checks). +- Use schemas as generative contracts to [synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) for unit testing. +- [Schema inference](https://pandera.readthedocs.io/en/stable/schema_inference.html) allows you to bootstrap schemas from data. + +## Alternative Data Validation Libraries + +Here are a few other alternatives for validating Python data structures. + +**Generic Python object data validation** + +- [voloptuous](https://github.com/alecthomas/voluptuous) +- [schema](https://github.com/keleshev/schema) + +**`pandas`-specific data validation** + +- [opulent-pandas](https://github.com/danielvdende/opulent-pandas) +- [PandasSchema](https://github.com/TMiguelT/PandasSchema) +- [pandas-validator](https://github.com/c-data/pandas-validator) +- [table_enforcer](https://github.com/xguse/table_enforcer) +- [dataenforce](https://github.com/CedricFR/dataenforce) +- [strictly typed pandas](https://github.com/nanne-aben/strictly_typed_pandas) +- [marshmallow-dataframe](https://github.com/facultyai/marshmallow-dataframe) + +**Other tools for data validation** + +- [great_expectations](https://github.com/great-expectations/great_expectations) +- [frictionless schema](https://framework.frictionlessdata.io/docs/guides/framework/schema-guide/) + +## How to Cite + +If you use `pandera` in the context of academic or industry research, please +consider citing the **paper** and/or **software package**. + +### [Paper](https://conference.scipy.org/proceedings/scipy2020/niels_bantilan.html) + +``` +@InProceedings{ niels_bantilan-proc-scipy-2020, + author = { {N}iels {B}antilan }, + title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes }, + booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference }, + pages = { 116 - 124 }, + year = { 2020 }, + editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe }, + doi = { 10.25080/Majora-342d178e-010 } +} +``` + +### Software Package + +[](https://doi.org/10.5281/zenodo.3385265) + + +## License and Credits + +`pandera` is licensed under the [MIT license](license.txt) and is written and +maintained by Niels Bantilan (niels@pandera.ci) + + +%package help +Summary: Development documents and examples for pandera +Provides: python3-pandera-doc +%description help +<br> +<div align="center"><img src="https://raw.githubusercontent.com/pandera-dev/pandera/main/docs/source/_static/pandera-banner.png" width="400"></div> + +<hr> + +# A Statistical Data Testing Toolkit + +*A data validation library for scientists, engineers, and analysts seeking +correctness.* + +<br> + +[](https://github.com/pandera-dev/pandera/actions?query=workflow%3A%22CI+Tests%22+branch%3Amain) +[](https://pandera.readthedocs.io/en/stable/?badge=stable) +[](https://pypi.org/project/pandera/) +[](https://pypi.python.org/pypi/) +[](https://github.com/pyOpenSci/software-review/issues/12) +[](https://www.repostatus.org/#active) +[](https://pandera.readthedocs.io/en/latest/?badge=latest) +[](https://codecov.io/gh/pandera-dev/pandera) +[](https://pypi.python.org/pypi/pandera/) +[](https://doi.org/10.5281/zenodo.3385265) +[](https://pandera-dev.github.io/pandera-asv-logs/) +[](https://pepy.tech/project/pandera) +[](https://pepy.tech/project/pandera) +[](https://anaconda.org/conda-forge/pandera) +[](https://discord.gg/vyanhWuaKB) + +`pandera` provides a flexible and expressive API for performing data +validation on dataframe-like objects to make data processing pipelines more +readable and robust. + +Dataframes contain information that `pandera` explicitly validates at runtime. +This is useful in production-critical or reproducible research settings. With +`pandera`, you can: + +1. Define a schema once and use it to validate + [different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html) + including [pandas](http://pandas.pydata.org), [dask](https://dask.org), + [modin](https://modin.readthedocs.io/), and [pyspark](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/index.html). +1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and + properties of columns in a `DataFrame` or values in a `Series`. +1. Perform more complex statistical validation like + [hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis). +1. Seamlessly integrate with existing data analysis/processing pipelines + via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators). +1. Define dataframe models with the + [class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models) + with pydantic-style syntax and validate dataframes using the typing syntax. +1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies) + from schema objects for property-based testing with pandas data structures. +1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html) + dataframes so that all validation checks are executed before raising an error. +1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with + a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io), + [fastapi](https://fastapi.tiangolo.com/), and [mypy](http://mypy-lang.org/). + +## Documentation + +The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io + + +## Install + +Using pip: + +``` +pip install pandera +``` + +Using conda: + +``` +conda install -c conda-forge pandera +``` + +### Extras + +Installing additional functionality: + +<details> + +<summary><i>pip</i></summary> + +```bash +pip install pandera[hypotheses] # hypothesis checks +pip install pandera[io] # yaml/script schema io utilities +pip install pandera[strategies] # data synthesis strategies +pip install pandera[mypy] # enable static type-linting of pandas +pip install pandera[fastapi] # fastapi integration +pip install pandera[dask] # validate dask dataframes +pip install pandera[pyspark] # validate pyspark dataframes +pip install pandera[modin] # validate modin dataframes +pip install pandera[modin-ray] # validate modin dataframes with ray +pip install pandera[modin-dask] # validate modin dataframes with dask +pip install pandera[geopandas] # validate geopandas geodataframes +``` + +</details> + +<details> + +<summary><i>conda</i></summary> + +```bash +conda install -c conda-forge pandera-hypotheses # hypothesis checks +conda install -c conda-forge pandera-io # yaml/script schema io utilities +conda install -c conda-forge pandera-strategies # data synthesis strategies +conda install -c conda-forge pandera-mypy # enable static type-linting of pandas +conda install -c conda-forge pandera-fastapi # fastapi integration +conda install -c conda-forge pandera-dask # validate dask dataframes +conda install -c conda-forge pandera-pyspark # validate pyspark dataframes +conda install -c conda-forge pandera-modin # validate modin dataframes +conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray +conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask +conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes +``` + +</details> + +## Quick Start + +```python +import pandas as pd +import pandera as pa + + +# data to validate +df = pd.DataFrame({ + "column1": [1, 4, 0, 10, 9], + "column2": [-1.3, -1.4, -2.9, -10.1, -20.4], + "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"] +}) + +# define schema +schema = pa.DataFrameSchema({ + "column1": pa.Column(int, checks=pa.Check.le(10)), + "column2": pa.Column(float, checks=pa.Check.lt(-1.2)), + "column3": pa.Column(str, checks=[ + pa.Check.str_startswith("value_"), + # define custom checks as functions that take a series as input and + # outputs a boolean or boolean Series + pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2) + ]), +}) + +validated_df = schema(df) +print(validated_df) + +# column1 column2 column3 +# 0 1 -1.3 value_1 +# 1 4 -1.4 value_2 +# 2 0 -2.9 value_3 +# 3 10 -10.1 value_2 +# 4 9 -20.4 value_1 +``` + +## DataFrame Model + +`pandera` also provides an alternative API for expressing schemas inspired +by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and +[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel` +for the above `DataFrameSchema` would be: + + +```python +from pandera.typing import Series + +class Schema(pa.DataFrameModel): + + column1: Series[int] = pa.Field(le=10) + column2: Series[float] = pa.Field(lt=-1.2) + column3: Series[str] = pa.Field(str_startswith="value_") + + @pa.check("column3") + def column_3_check(cls, series: Series[str]) -> Series[bool]: + """Check that values have two elements after being split with '_'""" + return series.str.split("_", expand=True).shape[1] == 2 + +Schema.validate(df) +``` + +## Development Installation + +``` +git clone https://github.com/pandera-dev/pandera.git +cd pandera +pip install -r requirements-dev.txt +pip install -e . +``` + +## Tests + +``` +pip install pytest +pytest tests +``` + +## Contributing to pandera [](https://github.com/pandera-dev/pandera/graphs/contributors) + +All contributions, bug reports, bug fixes, documentation improvements, +enhancements and ideas are welcome. + +A detailed overview on how to contribute can be found in the +[contributing guide](https://github.com/pandera-dev/pandera/blob/main/.github/CONTRIBUTING.md) +on GitHub. + +## Issues + +Go [here](https://github.com/pandera-dev/pandera/issues) to submit feature +requests or bugfixes. + +## Need Help? + +There are many ways of getting help with your questions. You can ask a question +on [Github Discussions](https://github.com/pandera-dev/pandera/discussions/categories/q-a) +page or reach out to the maintainers and pandera community on +[Discord](https://discord.gg/vyanhWuaKB) + +## Why `pandera`? + +- [dataframe-centric data types](https://pandera.readthedocs.io/en/stable/dtypes.html), + [column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns), + and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns) + are first-class concepts. +- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with + [pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax. +- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration) + enable seamless integration with existing code. +- [`Check`s](https://pandera.readthedocs.io/en/stable/checks.html) provide flexibility and performance by providing access to `pandas` + API by design and offers built-in checks for common data tests. +- [`Hypothesis`](https://pandera.readthedocs.io/en/stable/hypothesis.html) class provides a tidy-first interface for statistical hypothesis + testing. +- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](https://pandera.readthedocs.io/en/stable/checks.html#wide-checks). +- Use schemas as generative contracts to [synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) for unit testing. +- [Schema inference](https://pandera.readthedocs.io/en/stable/schema_inference.html) allows you to bootstrap schemas from data. + +## Alternative Data Validation Libraries + +Here are a few other alternatives for validating Python data structures. + +**Generic Python object data validation** + +- [voloptuous](https://github.com/alecthomas/voluptuous) +- [schema](https://github.com/keleshev/schema) + +**`pandas`-specific data validation** + +- [opulent-pandas](https://github.com/danielvdende/opulent-pandas) +- [PandasSchema](https://github.com/TMiguelT/PandasSchema) +- [pandas-validator](https://github.com/c-data/pandas-validator) +- [table_enforcer](https://github.com/xguse/table_enforcer) +- [dataenforce](https://github.com/CedricFR/dataenforce) +- [strictly typed pandas](https://github.com/nanne-aben/strictly_typed_pandas) +- [marshmallow-dataframe](https://github.com/facultyai/marshmallow-dataframe) + +**Other tools for data validation** + +- [great_expectations](https://github.com/great-expectations/great_expectations) +- [frictionless schema](https://framework.frictionlessdata.io/docs/guides/framework/schema-guide/) + +## How to Cite + +If you use `pandera` in the context of academic or industry research, please +consider citing the **paper** and/or **software package**. + +### [Paper](https://conference.scipy.org/proceedings/scipy2020/niels_bantilan.html) + +``` +@InProceedings{ niels_bantilan-proc-scipy-2020, + author = { {N}iels {B}antilan }, + title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes }, + booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference }, + pages = { 116 - 124 }, + year = { 2020 }, + editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe }, + doi = { 10.25080/Majora-342d178e-010 } +} +``` + +### Software Package + +[](https://doi.org/10.5281/zenodo.3385265) + + +## License and Credits + +`pandera` is licensed under the [MIT license](license.txt) and is written and +maintained by Niels Bantilan (niels@pandera.ci) + + +%prep +%autosetup -n pandera-0.14.5 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pandera -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.14.5-1 +- Package Spec generated @@ -0,0 +1 @@ +6c34a4674fa3df5ec0303b08fcb0c49a pandera-0.14.5.tar.gz |
