summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-15 06:41:56 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-15 06:41:56 +0000
commitc8f5a9116503e61c8479c066c4e2a39064650d07 (patch)
treeb310526206ecb4cddcbc3f5189f3c16ae1c6b2c7
parent3c9a15b7925b87eb9ef50682822634abffafd397 (diff)
automatic import of python-piicatcher
-rw-r--r--.gitignore1
-rw-r--r--python-piicatcher.spec528
-rw-r--r--sources1
3 files changed, 530 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..32b0e96 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/piicatcher-0.20.2.tar.gz
diff --git a/python-piicatcher.spec b/python-piicatcher.spec
new file mode 100644
index 0000000..ebbc8b9
--- /dev/null
+++ b/python-piicatcher.spec
@@ -0,0 +1,528 @@
+%global _empty_manifest_terminate_build 0
+Name: python-piicatcher
+Version: 0.20.2
+Release: 1
+Summary: Find PII data in databases
+License: Apache 2.0
+URL: https://tokern.io/
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c6/4c/c4557ff1c8d7fc52a0706ce71154f2e84b428688c5b48f323b24aa347375/piicatcher-0.20.2.tar.gz
+BuildArch: noarch
+
+Requires: python3-pyyaml
+Requires: python3-click
+Requires: python3-json-logger
+Requires: python3-commonregex
+Requires: python3-dbcat
+Requires: python3-typer
+Requires: python3-tabulate
+Requires: python3-dataclasses
+Requires: python3-great-expectations
+Requires: python3-acryl-datahub
+Requires: python3-tqdm
+Requires: python3-catalogue
+
+%description
+[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)
+[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
+[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)
+
+# PII Catcher for Databases and Data Warehouses
+
+## Overview
+
+PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems
+and tracks critical data. PIICatcher uses two techniques to detect PII:
+
+* Match regular expressions with column names
+* Match regular expressions and using NLP libraries to match sample data in columns.
+
+Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.
+
+PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata.
+For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect
+PII in column data.
+
+PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy
+scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
+
+There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns
+and tables with PII and the type of PII tags.
+
+![PIIcatcher Screencast](https://user-images.githubusercontent.com/1638298/143765818-87c7059a-f971-447b-83ca-e21182e28051.gif)
+
+
+## Resources
+
+* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.
+* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)
+
+## Quick Start
+
+PIICatcher is available as a docker image or command-line application.
+
+### Installation
+
+Docker:
+
+ alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
+
+
+Pypi:
+ # Install development libraries for compiling dependencies.
+ # On Amazon Linux
+ sudo yum install mysql-devel gcc gcc-devel python-devel
+
+ python3 -m venv .env
+ source .env/bin/activate
+ pip install piicatcher
+
+ # Install Spacy plugin
+ pip install piicatcher_spacy
+
+
+### Command Line Usage
+ # add a sqlite source
+ piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'
+
+ # run piicatcher on a sqlite db and print report to console
+ piicatcher detect --source-name sqldb
+ ╭─────────────┬─────────────┬─────────────┬─────────────╮
+ │ schema │ table │ column │ has_pii │
+ ├─────────────┼─────────────┼─────────────┼─────────────┤
+ │ main │ full_pii │ a │ 1 │
+ │ main │ full_pii │ b │ 1 │
+ │ main │ no_pii │ a │ 0 │
+ │ main │ no_pii │ b │ 0 │
+ │ main │ partial_pii │ a │ 1 │
+ │ main │ partial_pii │ b │ 0 │
+ ╰─────────────┴─────────────┴─────────────┴─────────────╯
+
+
+### API Usage
+```python3
+from dbcat.api import open_catalog, add_postgresql_source
+from piicatcher.api import scan_database
+
+# PIICatcher uses a catalog to store its state.
+# The easiest option is to use a sqlite memory database.
+# For production usage check, https://tokern.io/docs/data-catalog
+catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')
+
+with catalog.managed_session:
+ # Add a postgresql source
+ source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
+ password="p11secret", database="piidb")
+ output = scan_database(catalog=catalog, source=source)
+
+print(output)
+
+# Example Output
+[['public', 'sample', 'gender', 'PiiTypes.GENDER'],
+ ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'lname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'fname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'email', 'PiiTypes.EMAIL']]
+```
+
+## Plugins
+
+PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
+* Metadata
+* Data
+
+Plugins can be created for either of these two techniques. Plugins are then registered using an API or using
+[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).
+
+To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)
+or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).
+
+In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py)
+If you are detecting a new PII type, then you can define a new class that inherits from PIIType.
+
+For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).
+
+
+## Supported Databases
+
+PIICatcher supports the following databases:
+1. **Sqlite3** v3.24.0 or greater
+2. **MySQL** 5.6 or greater
+3. **PostgreSQL** 9.4 or greater
+4. **AWS Redshift**
+5. **AWS Athena**
+6. **Snowflake**
+
+## Documentation
+
+For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).
+
+## Survey
+
+Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher.
+The responses will help to prioritize improvements to the project.
+
+## Contributing
+
+For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development).
+
+
+
+%package -n python3-piicatcher
+Summary: Find PII data in databases
+Provides: python-piicatcher
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-piicatcher
+[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)
+[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
+[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)
+
+# PII Catcher for Databases and Data Warehouses
+
+## Overview
+
+PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems
+and tracks critical data. PIICatcher uses two techniques to detect PII:
+
+* Match regular expressions with column names
+* Match regular expressions and using NLP libraries to match sample data in columns.
+
+Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.
+
+PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata.
+For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect
+PII in column data.
+
+PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy
+scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
+
+There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns
+and tables with PII and the type of PII tags.
+
+![PIIcatcher Screencast](https://user-images.githubusercontent.com/1638298/143765818-87c7059a-f971-447b-83ca-e21182e28051.gif)
+
+
+## Resources
+
+* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.
+* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)
+
+## Quick Start
+
+PIICatcher is available as a docker image or command-line application.
+
+### Installation
+
+Docker:
+
+ alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
+
+
+Pypi:
+ # Install development libraries for compiling dependencies.
+ # On Amazon Linux
+ sudo yum install mysql-devel gcc gcc-devel python-devel
+
+ python3 -m venv .env
+ source .env/bin/activate
+ pip install piicatcher
+
+ # Install Spacy plugin
+ pip install piicatcher_spacy
+
+
+### Command Line Usage
+ # add a sqlite source
+ piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'
+
+ # run piicatcher on a sqlite db and print report to console
+ piicatcher detect --source-name sqldb
+ ╭─────────────┬─────────────┬─────────────┬─────────────╮
+ │ schema │ table │ column │ has_pii │
+ ├─────────────┼─────────────┼─────────────┼─────────────┤
+ │ main │ full_pii │ a │ 1 │
+ │ main │ full_pii │ b │ 1 │
+ │ main │ no_pii │ a │ 0 │
+ │ main │ no_pii │ b │ 0 │
+ │ main │ partial_pii │ a │ 1 │
+ │ main │ partial_pii │ b │ 0 │
+ ╰─────────────┴─────────────┴─────────────┴─────────────╯
+
+
+### API Usage
+```python3
+from dbcat.api import open_catalog, add_postgresql_source
+from piicatcher.api import scan_database
+
+# PIICatcher uses a catalog to store its state.
+# The easiest option is to use a sqlite memory database.
+# For production usage check, https://tokern.io/docs/data-catalog
+catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')
+
+with catalog.managed_session:
+ # Add a postgresql source
+ source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
+ password="p11secret", database="piidb")
+ output = scan_database(catalog=catalog, source=source)
+
+print(output)
+
+# Example Output
+[['public', 'sample', 'gender', 'PiiTypes.GENDER'],
+ ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'lname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'fname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'email', 'PiiTypes.EMAIL']]
+```
+
+## Plugins
+
+PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
+* Metadata
+* Data
+
+Plugins can be created for either of these two techniques. Plugins are then registered using an API or using
+[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).
+
+To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)
+or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).
+
+In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py)
+If you are detecting a new PII type, then you can define a new class that inherits from PIIType.
+
+For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).
+
+
+## Supported Databases
+
+PIICatcher supports the following databases:
+1. **Sqlite3** v3.24.0 or greater
+2. **MySQL** 5.6 or greater
+3. **PostgreSQL** 9.4 or greater
+4. **AWS Redshift**
+5. **AWS Athena**
+6. **Snowflake**
+
+## Documentation
+
+For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).
+
+## Survey
+
+Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher.
+The responses will help to prioritize improvements to the project.
+
+## Contributing
+
+For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development).
+
+
+
+%package help
+Summary: Development documents and examples for piicatcher
+Provides: python3-piicatcher-doc
+%description help
+[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)
+[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
+[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)
+[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)
+
+# PII Catcher for Databases and Data Warehouses
+
+## Overview
+
+PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems
+and tracks critical data. PIICatcher uses two techniques to detect PII:
+
+* Match regular expressions with column names
+* Match regular expressions and using NLP libraries to match sample data in columns.
+
+Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.
+
+PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata.
+For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect
+PII in column data.
+
+PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy
+scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
+
+There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns
+and tables with PII and the type of PII tags.
+
+![PIIcatcher Screencast](https://user-images.githubusercontent.com/1638298/143765818-87c7059a-f971-447b-83ca-e21182e28051.gif)
+
+
+## Resources
+
+* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.
+* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)
+
+## Quick Start
+
+PIICatcher is available as a docker image or command-line application.
+
+### Installation
+
+Docker:
+
+ alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
+
+
+Pypi:
+ # Install development libraries for compiling dependencies.
+ # On Amazon Linux
+ sudo yum install mysql-devel gcc gcc-devel python-devel
+
+ python3 -m venv .env
+ source .env/bin/activate
+ pip install piicatcher
+
+ # Install Spacy plugin
+ pip install piicatcher_spacy
+
+
+### Command Line Usage
+ # add a sqlite source
+ piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'
+
+ # run piicatcher on a sqlite db and print report to console
+ piicatcher detect --source-name sqldb
+ ╭─────────────┬─────────────┬─────────────┬─────────────╮
+ │ schema │ table │ column │ has_pii │
+ ├─────────────┼─────────────┼─────────────┼─────────────┤
+ │ main │ full_pii │ a │ 1 │
+ │ main │ full_pii │ b │ 1 │
+ │ main │ no_pii │ a │ 0 │
+ │ main │ no_pii │ b │ 0 │
+ │ main │ partial_pii │ a │ 1 │
+ │ main │ partial_pii │ b │ 0 │
+ ╰─────────────┴─────────────┴─────────────┴─────────────╯
+
+
+### API Usage
+```python3
+from dbcat.api import open_catalog, add_postgresql_source
+from piicatcher.api import scan_database
+
+# PIICatcher uses a catalog to store its state.
+# The easiest option is to use a sqlite memory database.
+# For production usage check, https://tokern.io/docs/data-catalog
+catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')
+
+with catalog.managed_session:
+ # Add a postgresql source
+ source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
+ password="p11secret", database="piidb")
+ output = scan_database(catalog=catalog, source=source)
+
+print(output)
+
+# Example Output
+[['public', 'sample', 'gender', 'PiiTypes.GENDER'],
+ ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'lname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'fname', 'PiiTypes.PERSON'],
+ ['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
+ ['public', 'sample', 'email', 'PiiTypes.EMAIL']]
+```
+
+## Plugins
+
+PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
+* Metadata
+* Data
+
+Plugins can be created for either of these two techniques. Plugins are then registered using an API or using
+[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).
+
+To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)
+or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).
+
+In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py)
+If you are detecting a new PII type, then you can define a new class that inherits from PIIType.
+
+For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).
+
+
+## Supported Databases
+
+PIICatcher supports the following databases:
+1. **Sqlite3** v3.24.0 or greater
+2. **MySQL** 5.6 or greater
+3. **PostgreSQL** 9.4 or greater
+4. **AWS Redshift**
+5. **AWS Athena**
+6. **Snowflake**
+
+## Documentation
+
+For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).
+
+## Survey
+
+Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher.
+The responses will help to prioritize improvements to the project.
+
+## Contributing
+
+For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development).
+
+
+
+%prep
+%autosetup -n piicatcher-0.20.2
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-piicatcher -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 0.20.2-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..de4f6ab
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+6822aade2ca650b57886f63727dfdb70 piicatcher-0.20.2.tar.gz