diff options
author | CoprDistGit <infra@openeuler.org> | 2023-05-15 06:41:56 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-05-15 06:41:56 +0000 |
commit | c8f5a9116503e61c8479c066c4e2a39064650d07 (patch) | |
tree | b310526206ecb4cddcbc3f5189f3c16ae1c6b2c7 | |
parent | 3c9a15b7925b87eb9ef50682822634abffafd397 (diff) |
automatic import of python-piicatcher
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-piicatcher.spec | 528 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 530 insertions, 0 deletions
@@ -0,0 +1 @@ +/piicatcher-0.20.2.tar.gz diff --git a/python-piicatcher.spec b/python-piicatcher.spec new file mode 100644 index 0000000..ebbc8b9 --- /dev/null +++ b/python-piicatcher.spec @@ -0,0 +1,528 @@ +%global _empty_manifest_terminate_build 0 +Name: python-piicatcher +Version: 0.20.2 +Release: 1 +Summary: Find PII data in databases +License: Apache 2.0 +URL: https://tokern.io/ +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c6/4c/c4557ff1c8d7fc52a0706ce71154f2e84b428688c5b48f323b24aa347375/piicatcher-0.20.2.tar.gz +BuildArch: noarch + +Requires: python3-pyyaml +Requires: python3-click +Requires: python3-json-logger +Requires: python3-commonregex +Requires: python3-dbcat +Requires: python3-typer +Requires: python3-tabulate +Requires: python3-dataclasses +Requires: python3-great-expectations +Requires: python3-acryl-datahub +Requires: python3-tqdm +Requires: python3-catalogue + +%description +[](https://github.com/tokern/piicatcher/actions/workflows/ci.yml) +[](https://pypi.python.org/pypi/piicatcher) +[](https://pypi.org/project/piicatcher/) +[](https://pypi.org/project/piicatcher/) +[](https://hub.docker.com/r/tokern/piicatcher) + +# PII Catcher for Databases and Data Warehouses + +## Overview + +PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems +and tracks critical data. PIICatcher uses two techniques to detect PII: + +* Match regular expressions with column names +* Match regular expressions and using NLP libraries to match sample data in columns. + +Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies. + +PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. +For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect +PII in column data. + +PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy +scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources. + +There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns +and tables with PII and the type of PII tags. + + + + +## Resources + +* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production. +* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/) + +## Quick Start + +PIICatcher is available as a docker image or command-line application. + +### Installation + +Docker: + + alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest' + + +Pypi: + # Install development libraries for compiling dependencies. + # On Amazon Linux + sudo yum install mysql-devel gcc gcc-devel python-devel + + python3 -m venv .env + source .env/bin/activate + pip install piicatcher + + # Install Spacy plugin + pip install piicatcher_spacy + + +### Command Line Usage + # add a sqlite source + piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb' + + # run piicatcher on a sqlite db and print report to console + piicatcher detect --source-name sqldb + ╭─────────────┬─────────────┬─────────────┬─────────────╮ + │ schema │ table │ column │ has_pii │ + ├─────────────┼─────────────┼─────────────┼─────────────┤ + │ main │ full_pii │ a │ 1 │ + │ main │ full_pii │ b │ 1 │ + │ main │ no_pii │ a │ 0 │ + │ main │ no_pii │ b │ 0 │ + │ main │ partial_pii │ a │ 1 │ + │ main │ partial_pii │ b │ 0 │ + ╰─────────────┴─────────────┴─────────────┴─────────────╯ + + +### API Usage +```python3 +from dbcat.api import open_catalog, add_postgresql_source +from piicatcher.api import scan_database + +# PIICatcher uses a catalog to store its state. +# The easiest option is to use a sqlite memory database. +# For production usage check, https://tokern.io/docs/data-catalog +catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret') + +with catalog.managed_session: + # Add a postgresql source + source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser", + password="p11secret", database="piidb") + output = scan_database(catalog=catalog, source=source) + +print(output) + +# Example Output +[['public', 'sample', 'gender', 'PiiTypes.GENDER'], + ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], + ['public', 'sample', 'lname', 'PiiTypes.PERSON'], + ['public', 'sample', 'fname', 'PiiTypes.PERSON'], + ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'email', 'PiiTypes.EMAIL']] +``` + +## Plugins + +PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques: +* Metadata +* Data + +Plugins can be created for either of these two techniques. Plugins are then registered using an API or using +[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/). + +To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py) +or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py). + +In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) +If you are detecting a new PII type, then you can define a new class that inherits from PIIType. + +For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins). + + +## Supported Databases + +PIICatcher supports the following databases: +1. **Sqlite3** v3.24.0 or greater +2. **MySQL** 5.6 or greater +3. **PostgreSQL** 9.4 or greater +4. **AWS Redshift** +5. **AWS Athena** +6. **Snowflake** + +## Documentation + +For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher). + +## Survey + +Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. +The responses will help to prioritize improvements to the project. + +## Contributing + +For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). + + + +%package -n python3-piicatcher +Summary: Find PII data in databases +Provides: python-piicatcher +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-piicatcher +[](https://github.com/tokern/piicatcher/actions/workflows/ci.yml) +[](https://pypi.python.org/pypi/piicatcher) +[](https://pypi.org/project/piicatcher/) +[](https://pypi.org/project/piicatcher/) +[](https://hub.docker.com/r/tokern/piicatcher) + +# PII Catcher for Databases and Data Warehouses + +## Overview + +PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems +and tracks critical data. PIICatcher uses two techniques to detect PII: + +* Match regular expressions with column names +* Match regular expressions and using NLP libraries to match sample data in columns. + +Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies. + +PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. +For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect +PII in column data. + +PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy +scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources. + +There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns +and tables with PII and the type of PII tags. + + + + +## Resources + +* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production. +* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/) + +## Quick Start + +PIICatcher is available as a docker image or command-line application. + +### Installation + +Docker: + + alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest' + + +Pypi: + # Install development libraries for compiling dependencies. + # On Amazon Linux + sudo yum install mysql-devel gcc gcc-devel python-devel + + python3 -m venv .env + source .env/bin/activate + pip install piicatcher + + # Install Spacy plugin + pip install piicatcher_spacy + + +### Command Line Usage + # add a sqlite source + piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb' + + # run piicatcher on a sqlite db and print report to console + piicatcher detect --source-name sqldb + ╭─────────────┬─────────────┬─────────────┬─────────────╮ + │ schema │ table │ column │ has_pii │ + ├─────────────┼─────────────┼─────────────┼─────────────┤ + │ main │ full_pii │ a │ 1 │ + │ main │ full_pii │ b │ 1 │ + │ main │ no_pii │ a │ 0 │ + │ main │ no_pii │ b │ 0 │ + │ main │ partial_pii │ a │ 1 │ + │ main │ partial_pii │ b │ 0 │ + ╰─────────────┴─────────────┴─────────────┴─────────────╯ + + +### API Usage +```python3 +from dbcat.api import open_catalog, add_postgresql_source +from piicatcher.api import scan_database + +# PIICatcher uses a catalog to store its state. +# The easiest option is to use a sqlite memory database. +# For production usage check, https://tokern.io/docs/data-catalog +catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret') + +with catalog.managed_session: + # Add a postgresql source + source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser", + password="p11secret", database="piidb") + output = scan_database(catalog=catalog, source=source) + +print(output) + +# Example Output +[['public', 'sample', 'gender', 'PiiTypes.GENDER'], + ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], + ['public', 'sample', 'lname', 'PiiTypes.PERSON'], + ['public', 'sample', 'fname', 'PiiTypes.PERSON'], + ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'email', 'PiiTypes.EMAIL']] +``` + +## Plugins + +PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques: +* Metadata +* Data + +Plugins can be created for either of these two techniques. Plugins are then registered using an API or using +[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/). + +To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py) +or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py). + +In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) +If you are detecting a new PII type, then you can define a new class that inherits from PIIType. + +For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins). + + +## Supported Databases + +PIICatcher supports the following databases: +1. **Sqlite3** v3.24.0 or greater +2. **MySQL** 5.6 or greater +3. **PostgreSQL** 9.4 or greater +4. **AWS Redshift** +5. **AWS Athena** +6. **Snowflake** + +## Documentation + +For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher). + +## Survey + +Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. +The responses will help to prioritize improvements to the project. + +## Contributing + +For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). + + + +%package help +Summary: Development documents and examples for piicatcher +Provides: python3-piicatcher-doc +%description help +[](https://github.com/tokern/piicatcher/actions/workflows/ci.yml) +[](https://pypi.python.org/pypi/piicatcher) +[](https://pypi.org/project/piicatcher/) +[](https://pypi.org/project/piicatcher/) +[](https://hub.docker.com/r/tokern/piicatcher) + +# PII Catcher for Databases and Data Warehouses + +## Overview + +PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems +and tracks critical data. PIICatcher uses two techniques to detect PII: + +* Match regular expressions with column names +* Match regular expressions and using NLP libraries to match sample data in columns. + +Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies. + +PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. +For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect +PII in column data. + +PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy +scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources. + +There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns +and tables with PII and the type of PII tags. + + + + +## Resources + +* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production. +* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/) + +## Quick Start + +PIICatcher is available as a docker image or command-line application. + +### Installation + +Docker: + + alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest' + + +Pypi: + # Install development libraries for compiling dependencies. + # On Amazon Linux + sudo yum install mysql-devel gcc gcc-devel python-devel + + python3 -m venv .env + source .env/bin/activate + pip install piicatcher + + # Install Spacy plugin + pip install piicatcher_spacy + + +### Command Line Usage + # add a sqlite source + piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb' + + # run piicatcher on a sqlite db and print report to console + piicatcher detect --source-name sqldb + ╭─────────────┬─────────────┬─────────────┬─────────────╮ + │ schema │ table │ column │ has_pii │ + ├─────────────┼─────────────┼─────────────┼─────────────┤ + │ main │ full_pii │ a │ 1 │ + │ main │ full_pii │ b │ 1 │ + │ main │ no_pii │ a │ 0 │ + │ main │ no_pii │ b │ 0 │ + │ main │ partial_pii │ a │ 1 │ + │ main │ partial_pii │ b │ 0 │ + ╰─────────────┴─────────────┴─────────────┴─────────────╯ + + +### API Usage +```python3 +from dbcat.api import open_catalog, add_postgresql_source +from piicatcher.api import scan_database + +# PIICatcher uses a catalog to store its state. +# The easiest option is to use a sqlite memory database. +# For production usage check, https://tokern.io/docs/data-catalog +catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret') + +with catalog.managed_session: + # Add a postgresql source + source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser", + password="p11secret", database="piidb") + output = scan_database(catalog=catalog, source=source) + +print(output) + +# Example Output +[['public', 'sample', 'gender', 'PiiTypes.GENDER'], + ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], + ['public', 'sample', 'lname', 'PiiTypes.PERSON'], + ['public', 'sample', 'fname', 'PiiTypes.PERSON'], + ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], + ['public', 'sample', 'email', 'PiiTypes.EMAIL']] +``` + +## Plugins + +PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques: +* Metadata +* Data + +Plugins can be created for either of these two techniques. Plugins are then registered using an API or using +[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/). + +To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py) +or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py). + +In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) +If you are detecting a new PII type, then you can define a new class that inherits from PIIType. + +For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins). + + +## Supported Databases + +PIICatcher supports the following databases: +1. **Sqlite3** v3.24.0 or greater +2. **MySQL** 5.6 or greater +3. **PostgreSQL** 9.4 or greater +4. **AWS Redshift** +5. **AWS Athena** +6. **Snowflake** + +## Documentation + +For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher). + +## Survey + +Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. +The responses will help to prioritize improvements to the project. + +## Contributing + +For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). + + + +%prep +%autosetup -n piicatcher-0.20.2 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-piicatcher -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 0.20.2-1 +- Package Spec generated @@ -0,0 +1 @@ +6822aade2ca650b57886f63727dfdb70 piicatcher-0.20.2.tar.gz |