diff options
author | CoprDistGit <infra@openeuler.org> | 2023-04-10 15:44:19 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-04-10 15:44:19 +0000 |
commit | 19a90b2fd12b8e976e25f82fd26012cb6e4267fe (patch) | |
tree | 269be72ba38a3ded13e93f42fdd5f288862ad9b1 | |
parent | 0570092288748d61afcdd65ce415934235f836b3 (diff) |
automatic import of python-tentaclio
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-tentaclio.spec | 1014 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 1016 insertions, 0 deletions
@@ -0,0 +1 @@ +/tentaclio-1.1.0.tar.gz diff --git a/python-tentaclio.spec b/python-tentaclio.spec new file mode 100644 index 0000000..c844b9f --- /dev/null +++ b/python-tentaclio.spec @@ -0,0 +1,1014 @@ +%global _empty_manifest_terminate_build 0 +Name: python-tentaclio +Version: 1.1.0 +Release: 1 +Summary: Unification of data connectors for distributed data tasks +License: MIT +URL: https://github.com/octoenergy/tentaclio +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/8c/a9/f3af00d2f1a5cc15e9301851e40e2191c7609958bdaee547719dd6b15598/tentaclio-1.1.0.tar.gz +BuildArch: noarch + +Requires: python3-urllib3 +Requires: python3-requests +Requires: python3-sqlalchemy +Requires: python3-pysftp +Requires: python3-pandas +Requires: python3-click +Requires: python3-pyyaml +Requires: python3-importlib-metadata +Requires: python3-tentaclio-athena +Requires: python3-tentaclio-databricks +Requires: python3-tentaclio-gdrive +Requires: python3-tentaclio-gs +Requires: python3-tentaclio-postgres +Requires: python3-tentaclio-s3 +Requires: python3-tentaclio-snowflake + +%description +# Tentaclio + +[](https://circleci.com/gh/octoenergy/tentaclio/tree/master) +[](https://tentaclio.readthedocs.io/en/latest/?badge=latest) + +Python library that simplifies: +* Handling streams from different protocols such as `file:`, `ftp:`, `sftp:`, `s3:`, ... +* Opening database connections. +* Managing the credentials in distributed systems. + +Main considerations in the design: +* Easy to use: all streams are open via `tentaclio.open`, all database connections through `tentaclio.db`. +* URLs are the basic resource locator and db connection string. +* Automagic authentication for protected resources. +* Extensible: you can add your own handlers for other schemes. +* Pandas interaction. + +# Quick Examples. + +## Read and write streams. +```python +import tentaclio +contents = "π π" + +with tentaclio.open("ftp://localhost:2021/upload/file.txt", mode="w") as writer: + writer.write(contents) + +# Using boto3 authentication under the hood. +bucket = "s3://my-bucket/octopus/hello.txt" +with tentaclio.open(bucket) as reader: + print(reader.read()) +``` + +## Copy streams +```python +import tentaclio + +tentaclio.copy("/home/constantine/data.csv", "sftp://constantine:tentacl3@sftp.octoenergy.com/uploads/data.csv") +``` +## Delete resources +```python +import tentaclio + +tentaclio.remove("s3://my-bucket/octopus/the-9th-tentacle.txt") +``` +## List resources +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +## Authenticated resources. +```python +import os + +import tentaclio + +print("env ftp credentials", os.getenv("OCTOIO__CONN__OCTOENERGY_FTP")) +# This prints `sftp://constantine:tentacl3@sftp.octoenergy.com/` + +# Credentials get automatically injected. + +with tentaclio.open("sftp://sftp.octoenergy.com/uploads/data.csv") as reader: + print(reader.read()) +``` + +## Database connections. +```python +import os + +import tentaclio + +print("env TENTACLIO__CONN__DB", os.getenv("TENTACLIO__CONN__DB")) + +# This prints `postgresql://octopus:tentacle@localhost:5444/example` + +# hostname is a wildcard, the credentials get injected. +with tentaclio.db("postgresql://hostname/example") as pg: + results = pg.query("select * from my_table") +``` + +## Pandas interaction. +```python +import pandas as pd # πΌπΌ +import tentaclio # π + +df = pd.DataFrame([[1, 2, 3], [10, 20, 30]], columns=["col_1", "col_2", "col_3"]) + +bucket = "s3://my-bucket/data/pandas.csv" + +with tentaclio.open(bucket, mode="w") as writer: # supports more pandas readers + df.to_csv(writer, index=False) + +with tentaclio.open(bucket) as reader: + new_df = pd.read_csv(reader) + +# another example: using pandas.DataFrame.to_sql() with tentaclio to upload +with tentaclio.db( + connection_info, + connect_args={'options': '-csearch_path=schema_name'} + ) as client: + df.to_sql( + name='observations', # table name + con=client.conn, + ) +``` + +# Installation + +You can get tentaclio using pip + +```sh +pip install tentaclio +``` +or pipenv +```sh +pipenv install tentaclio +``` + +## Developing. + +Clone this repo and install [pipenv](https://pipenv.readthedocs.io/en/latest/): + +In the `Makefile` you'll find some useful targets for linting, testing, etc. i.e.: +```sh +make test +``` + + +## How to use +This is how to use `tentaclio` for your daily data ingestion and storing needs. + +### Streams +In order to open streams to load or store data the universal function is: + +```python +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + contents = reader.read() + +with tentaclio.open("s3://bucket/file", mode='w') as writer: + writer.write(contents) + +``` +Allowed modes are `r`, `w`, `rb`, and `wb`. You can use `t` instead of `b` to indicate text streams, but that's the default. + +In order to keep tentaclio as light as possible, it only includes `file`, `ftp`, `sftp`, `http` and `https` schemes by default. +However, many more are easily available by installing extra packages: + +Default: +* `/local/file` +* `file:///local/file` +* `ftp://path/to/file` +* `sftp://path/to/file` +* `http://host.com/path/to/resource` +* `https://host.com/path/to/resource` + +[tentaclio-s3](https://github.com/octoenergy/tentaclio-s3) +* `s3://bucket/file` + +[tentaclio-gs](https://github.com/octoenergy/tentaclio-gs) +* `gs://bucket/file` +* `gsc://bucket/file` + +[tentaclio-gdrive](https://github.com/octoenergy/tentaclio-gdrive) +* `gdrive:/My Drive/file` +* `googledrive:/My Drive/file` + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://host/database::table` will allow you to write from a csv format into a database with the same column names (note that the table goes after `::` :warning:). + + +You can add the credentials for any of the urls in order to access protected resources. + + +You can use these readers and writers with pandas functions like: + +```python +import pandas as pd +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + df = pd.read_csv(reader) + +[...] + +with tentaclio.open("s3::/path/to/my/file", mode='w') as writer: + df.to_parquet(writer) +``` +`Readers`, `Writers` and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions. + +##### Notes on writing files for Spark, Presto, and similar downstream systems + +The default behaviour for the `open` context manager in python is to create an empty file when opening +it in writable mode. This can be annoying if the process that creates the data within the `with` clause +yields empty dataframes and nothing gets written. This will make Spark and Presto panic. + +To avoid this we can make the stream _empty safe_ so the empty buffer won't be flushed if no writes have been performed so no empty file will be created. + + +``` +with tio.make_empty_safe(tio.open("s3://bucket/file.parquet", mode="wb")) as writer: + if not df.empty: + df.to_parquet(writer) +``` + +### File system like operations to resources +#### Listing resources +Some URL schemes allow listing resources in a pythonnic way: +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +Whereas `listdir` might be convinient we also offer `scandir`, which returns a list of [DirEntry](https://github.com/octoenergy/tentaclio/blob/ddbc28615de4b99106b956556db74a20e4761afe/src/tentaclio/fs/scanner.py#L13)s, and, `walk`. All functions follow as closely as possible their standard library definitions. + + +### Database access + +In order to open db connections you can use `tentaclio.db` and have instant access to postgres, sqlite, athena and mssql. + +```python +import tentaclio + +[...] + +query = "select 1"; +with tentaclio.db(POSTGRES_TEST_URL) as client: + result =client.query(query) +[...] +``` + +The supported db schemes are: + +Default: +* `sqlite://` +* `mssql://` +* + Any other scheme supported by sqlalchemy. + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://` + +[tentaclio-athena](https://github.com/octoenergy/tentaclio-athena) +* `awsathena+rest://` + +[tentaclio-databricks](https://github.com/octoenergy/tentaclio-databricks) +* `databricks+thrift://` + +[tentaclio-snowflake](https://github.com/octoenergy/tentaclio-snowflake) +* `snowflake://` + + +#### Extras for databases +For postgres you can set the variable `TENTACLIO__PG_APPLICATION_NAME` and the value will be injected +when connecting to the database. + +### Automatic credentials injection + +1. Configure credentials by using environmental variables prefixed with `TENTACLIO__CONN__` (i.e. `TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@ftp.octoenergy.com`). + +2. Open a stream: +```python +with tentaclio.open("sftp://ftp.octoenergy.com/file.csv") as reader: + reader.read() +``` +The credentials get injected into the url. + +3. Open a db client: +```python +import tentaclio + +with tentaclio.db("postgresql://hostname/my_data_base") as client: + client.query("select 1") +``` +Note that `hostname` in the url to be authenticated is a wildcard that will match any hostname. So `authenticate("http://hostname/file.txt")` will be injected to `http://user:pass@octo.co/file.txt` if the credential for `http://user:pass@octo.co/` exists. + +Different components of the URL are set differently: +- Scheme and path will be set from the URL, and null if missing. +- Username, password and hostname will be set from the stored credentials. +- Port will be set from the stored credentials if it exists, otherwise from the URL. +- Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be + overriden) + +#### Credentials file + +You can also set a credentials file that looks like: +``` +secrets: + db_1: postgresql://user1:pass1@myhost.com/database_1 + db_2: mssql://user2:pass2@otherhost.com/database_2?driver=ODBC+Driver+17+for+SQL+Server + ftp_server: ftp://fuser:fpass@ftp.myhost.com +``` +And make it accessible to tentaclio by setting the environmental variable `TENTACLIO__SECRETS_FILE`. The actual name of each url is for traceability and has no effect in the functionality. + +(Note that you may need to add `?driver={driver from /usr/local/etc/odbcinst.ini}` for mssql database connection strings; see above example) + +Alternatively you can run `curl https://raw.githubusercontent.com/octoenergy/tentaclio/master/extras/init_tentaclio.sh` to create a secrets file in `~/.tentaclio.yml` and +automatically configure your environment. + +## Quick note on protocols structural subtyping. + +In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we use typed [protocols](https://mypy.readthedocs.io/en/latest/protocols.html#simple-user-defined-protocols). This allows a more flexible dependency injection than using subclassing or [more complex approches](http://code.activestate.com/recipes/413268/). This idea is heavily inspired by how this exact thing is done in [go](https://www.youtube.com/watch?v=ifBUfIb7kdo). Learn more about this principle in our [tech blog](https://tech.octopus.energy/news/2019/03/21/python-interfaces-a-la-go.html). + + + + +%package -n python3-tentaclio +Summary: Unification of data connectors for distributed data tasks +Provides: python-tentaclio +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-tentaclio +# Tentaclio + +[](https://circleci.com/gh/octoenergy/tentaclio/tree/master) +[](https://tentaclio.readthedocs.io/en/latest/?badge=latest) + +Python library that simplifies: +* Handling streams from different protocols such as `file:`, `ftp:`, `sftp:`, `s3:`, ... +* Opening database connections. +* Managing the credentials in distributed systems. + +Main considerations in the design: +* Easy to use: all streams are open via `tentaclio.open`, all database connections through `tentaclio.db`. +* URLs are the basic resource locator and db connection string. +* Automagic authentication for protected resources. +* Extensible: you can add your own handlers for other schemes. +* Pandas interaction. + +# Quick Examples. + +## Read and write streams. +```python +import tentaclio +contents = "π π" + +with tentaclio.open("ftp://localhost:2021/upload/file.txt", mode="w") as writer: + writer.write(contents) + +# Using boto3 authentication under the hood. +bucket = "s3://my-bucket/octopus/hello.txt" +with tentaclio.open(bucket) as reader: + print(reader.read()) +``` + +## Copy streams +```python +import tentaclio + +tentaclio.copy("/home/constantine/data.csv", "sftp://constantine:tentacl3@sftp.octoenergy.com/uploads/data.csv") +``` +## Delete resources +```python +import tentaclio + +tentaclio.remove("s3://my-bucket/octopus/the-9th-tentacle.txt") +``` +## List resources +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +## Authenticated resources. +```python +import os + +import tentaclio + +print("env ftp credentials", os.getenv("OCTOIO__CONN__OCTOENERGY_FTP")) +# This prints `sftp://constantine:tentacl3@sftp.octoenergy.com/` + +# Credentials get automatically injected. + +with tentaclio.open("sftp://sftp.octoenergy.com/uploads/data.csv") as reader: + print(reader.read()) +``` + +## Database connections. +```python +import os + +import tentaclio + +print("env TENTACLIO__CONN__DB", os.getenv("TENTACLIO__CONN__DB")) + +# This prints `postgresql://octopus:tentacle@localhost:5444/example` + +# hostname is a wildcard, the credentials get injected. +with tentaclio.db("postgresql://hostname/example") as pg: + results = pg.query("select * from my_table") +``` + +## Pandas interaction. +```python +import pandas as pd # πΌπΌ +import tentaclio # π + +df = pd.DataFrame([[1, 2, 3], [10, 20, 30]], columns=["col_1", "col_2", "col_3"]) + +bucket = "s3://my-bucket/data/pandas.csv" + +with tentaclio.open(bucket, mode="w") as writer: # supports more pandas readers + df.to_csv(writer, index=False) + +with tentaclio.open(bucket) as reader: + new_df = pd.read_csv(reader) + +# another example: using pandas.DataFrame.to_sql() with tentaclio to upload +with tentaclio.db( + connection_info, + connect_args={'options': '-csearch_path=schema_name'} + ) as client: + df.to_sql( + name='observations', # table name + con=client.conn, + ) +``` + +# Installation + +You can get tentaclio using pip + +```sh +pip install tentaclio +``` +or pipenv +```sh +pipenv install tentaclio +``` + +## Developing. + +Clone this repo and install [pipenv](https://pipenv.readthedocs.io/en/latest/): + +In the `Makefile` you'll find some useful targets for linting, testing, etc. i.e.: +```sh +make test +``` + + +## How to use +This is how to use `tentaclio` for your daily data ingestion and storing needs. + +### Streams +In order to open streams to load or store data the universal function is: + +```python +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + contents = reader.read() + +with tentaclio.open("s3://bucket/file", mode='w') as writer: + writer.write(contents) + +``` +Allowed modes are `r`, `w`, `rb`, and `wb`. You can use `t` instead of `b` to indicate text streams, but that's the default. + +In order to keep tentaclio as light as possible, it only includes `file`, `ftp`, `sftp`, `http` and `https` schemes by default. +However, many more are easily available by installing extra packages: + +Default: +* `/local/file` +* `file:///local/file` +* `ftp://path/to/file` +* `sftp://path/to/file` +* `http://host.com/path/to/resource` +* `https://host.com/path/to/resource` + +[tentaclio-s3](https://github.com/octoenergy/tentaclio-s3) +* `s3://bucket/file` + +[tentaclio-gs](https://github.com/octoenergy/tentaclio-gs) +* `gs://bucket/file` +* `gsc://bucket/file` + +[tentaclio-gdrive](https://github.com/octoenergy/tentaclio-gdrive) +* `gdrive:/My Drive/file` +* `googledrive:/My Drive/file` + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://host/database::table` will allow you to write from a csv format into a database with the same column names (note that the table goes after `::` :warning:). + + +You can add the credentials for any of the urls in order to access protected resources. + + +You can use these readers and writers with pandas functions like: + +```python +import pandas as pd +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + df = pd.read_csv(reader) + +[...] + +with tentaclio.open("s3::/path/to/my/file", mode='w') as writer: + df.to_parquet(writer) +``` +`Readers`, `Writers` and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions. + +##### Notes on writing files for Spark, Presto, and similar downstream systems + +The default behaviour for the `open` context manager in python is to create an empty file when opening +it in writable mode. This can be annoying if the process that creates the data within the `with` clause +yields empty dataframes and nothing gets written. This will make Spark and Presto panic. + +To avoid this we can make the stream _empty safe_ so the empty buffer won't be flushed if no writes have been performed so no empty file will be created. + + +``` +with tio.make_empty_safe(tio.open("s3://bucket/file.parquet", mode="wb")) as writer: + if not df.empty: + df.to_parquet(writer) +``` + +### File system like operations to resources +#### Listing resources +Some URL schemes allow listing resources in a pythonnic way: +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +Whereas `listdir` might be convinient we also offer `scandir`, which returns a list of [DirEntry](https://github.com/octoenergy/tentaclio/blob/ddbc28615de4b99106b956556db74a20e4761afe/src/tentaclio/fs/scanner.py#L13)s, and, `walk`. All functions follow as closely as possible their standard library definitions. + + +### Database access + +In order to open db connections you can use `tentaclio.db` and have instant access to postgres, sqlite, athena and mssql. + +```python +import tentaclio + +[...] + +query = "select 1"; +with tentaclio.db(POSTGRES_TEST_URL) as client: + result =client.query(query) +[...] +``` + +The supported db schemes are: + +Default: +* `sqlite://` +* `mssql://` +* + Any other scheme supported by sqlalchemy. + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://` + +[tentaclio-athena](https://github.com/octoenergy/tentaclio-athena) +* `awsathena+rest://` + +[tentaclio-databricks](https://github.com/octoenergy/tentaclio-databricks) +* `databricks+thrift://` + +[tentaclio-snowflake](https://github.com/octoenergy/tentaclio-snowflake) +* `snowflake://` + + +#### Extras for databases +For postgres you can set the variable `TENTACLIO__PG_APPLICATION_NAME` and the value will be injected +when connecting to the database. + +### Automatic credentials injection + +1. Configure credentials by using environmental variables prefixed with `TENTACLIO__CONN__` (i.e. `TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@ftp.octoenergy.com`). + +2. Open a stream: +```python +with tentaclio.open("sftp://ftp.octoenergy.com/file.csv") as reader: + reader.read() +``` +The credentials get injected into the url. + +3. Open a db client: +```python +import tentaclio + +with tentaclio.db("postgresql://hostname/my_data_base") as client: + client.query("select 1") +``` +Note that `hostname` in the url to be authenticated is a wildcard that will match any hostname. So `authenticate("http://hostname/file.txt")` will be injected to `http://user:pass@octo.co/file.txt` if the credential for `http://user:pass@octo.co/` exists. + +Different components of the URL are set differently: +- Scheme and path will be set from the URL, and null if missing. +- Username, password and hostname will be set from the stored credentials. +- Port will be set from the stored credentials if it exists, otherwise from the URL. +- Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be + overriden) + +#### Credentials file + +You can also set a credentials file that looks like: +``` +secrets: + db_1: postgresql://user1:pass1@myhost.com/database_1 + db_2: mssql://user2:pass2@otherhost.com/database_2?driver=ODBC+Driver+17+for+SQL+Server + ftp_server: ftp://fuser:fpass@ftp.myhost.com +``` +And make it accessible to tentaclio by setting the environmental variable `TENTACLIO__SECRETS_FILE`. The actual name of each url is for traceability and has no effect in the functionality. + +(Note that you may need to add `?driver={driver from /usr/local/etc/odbcinst.ini}` for mssql database connection strings; see above example) + +Alternatively you can run `curl https://raw.githubusercontent.com/octoenergy/tentaclio/master/extras/init_tentaclio.sh` to create a secrets file in `~/.tentaclio.yml` and +automatically configure your environment. + +## Quick note on protocols structural subtyping. + +In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we use typed [protocols](https://mypy.readthedocs.io/en/latest/protocols.html#simple-user-defined-protocols). This allows a more flexible dependency injection than using subclassing or [more complex approches](http://code.activestate.com/recipes/413268/). This idea is heavily inspired by how this exact thing is done in [go](https://www.youtube.com/watch?v=ifBUfIb7kdo). Learn more about this principle in our [tech blog](https://tech.octopus.energy/news/2019/03/21/python-interfaces-a-la-go.html). + + + + +%package help +Summary: Development documents and examples for tentaclio +Provides: python3-tentaclio-doc +%description help +# Tentaclio + +[](https://circleci.com/gh/octoenergy/tentaclio/tree/master) +[](https://tentaclio.readthedocs.io/en/latest/?badge=latest) + +Python library that simplifies: +* Handling streams from different protocols such as `file:`, `ftp:`, `sftp:`, `s3:`, ... +* Opening database connections. +* Managing the credentials in distributed systems. + +Main considerations in the design: +* Easy to use: all streams are open via `tentaclio.open`, all database connections through `tentaclio.db`. +* URLs are the basic resource locator and db connection string. +* Automagic authentication for protected resources. +* Extensible: you can add your own handlers for other schemes. +* Pandas interaction. + +# Quick Examples. + +## Read and write streams. +```python +import tentaclio +contents = "π π" + +with tentaclio.open("ftp://localhost:2021/upload/file.txt", mode="w") as writer: + writer.write(contents) + +# Using boto3 authentication under the hood. +bucket = "s3://my-bucket/octopus/hello.txt" +with tentaclio.open(bucket) as reader: + print(reader.read()) +``` + +## Copy streams +```python +import tentaclio + +tentaclio.copy("/home/constantine/data.csv", "sftp://constantine:tentacl3@sftp.octoenergy.com/uploads/data.csv") +``` +## Delete resources +```python +import tentaclio + +tentaclio.remove("s3://my-bucket/octopus/the-9th-tentacle.txt") +``` +## List resources +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +## Authenticated resources. +```python +import os + +import tentaclio + +print("env ftp credentials", os.getenv("OCTOIO__CONN__OCTOENERGY_FTP")) +# This prints `sftp://constantine:tentacl3@sftp.octoenergy.com/` + +# Credentials get automatically injected. + +with tentaclio.open("sftp://sftp.octoenergy.com/uploads/data.csv") as reader: + print(reader.read()) +``` + +## Database connections. +```python +import os + +import tentaclio + +print("env TENTACLIO__CONN__DB", os.getenv("TENTACLIO__CONN__DB")) + +# This prints `postgresql://octopus:tentacle@localhost:5444/example` + +# hostname is a wildcard, the credentials get injected. +with tentaclio.db("postgresql://hostname/example") as pg: + results = pg.query("select * from my_table") +``` + +## Pandas interaction. +```python +import pandas as pd # πΌπΌ +import tentaclio # π + +df = pd.DataFrame([[1, 2, 3], [10, 20, 30]], columns=["col_1", "col_2", "col_3"]) + +bucket = "s3://my-bucket/data/pandas.csv" + +with tentaclio.open(bucket, mode="w") as writer: # supports more pandas readers + df.to_csv(writer, index=False) + +with tentaclio.open(bucket) as reader: + new_df = pd.read_csv(reader) + +# another example: using pandas.DataFrame.to_sql() with tentaclio to upload +with tentaclio.db( + connection_info, + connect_args={'options': '-csearch_path=schema_name'} + ) as client: + df.to_sql( + name='observations', # table name + con=client.conn, + ) +``` + +# Installation + +You can get tentaclio using pip + +```sh +pip install tentaclio +``` +or pipenv +```sh +pipenv install tentaclio +``` + +## Developing. + +Clone this repo and install [pipenv](https://pipenv.readthedocs.io/en/latest/): + +In the `Makefile` you'll find some useful targets for linting, testing, etc. i.e.: +```sh +make test +``` + + +## How to use +This is how to use `tentaclio` for your daily data ingestion and storing needs. + +### Streams +In order to open streams to load or store data the universal function is: + +```python +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + contents = reader.read() + +with tentaclio.open("s3://bucket/file", mode='w') as writer: + writer.write(contents) + +``` +Allowed modes are `r`, `w`, `rb`, and `wb`. You can use `t` instead of `b` to indicate text streams, but that's the default. + +In order to keep tentaclio as light as possible, it only includes `file`, `ftp`, `sftp`, `http` and `https` schemes by default. +However, many more are easily available by installing extra packages: + +Default: +* `/local/file` +* `file:///local/file` +* `ftp://path/to/file` +* `sftp://path/to/file` +* `http://host.com/path/to/resource` +* `https://host.com/path/to/resource` + +[tentaclio-s3](https://github.com/octoenergy/tentaclio-s3) +* `s3://bucket/file` + +[tentaclio-gs](https://github.com/octoenergy/tentaclio-gs) +* `gs://bucket/file` +* `gsc://bucket/file` + +[tentaclio-gdrive](https://github.com/octoenergy/tentaclio-gdrive) +* `gdrive:/My Drive/file` +* `googledrive:/My Drive/file` + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://host/database::table` will allow you to write from a csv format into a database with the same column names (note that the table goes after `::` :warning:). + + +You can add the credentials for any of the urls in order to access protected resources. + + +You can use these readers and writers with pandas functions like: + +```python +import pandas as pd +import tentaclio + +with tentaclio.open("/path/to/my/file") as reader: + df = pd.read_csv(reader) + +[...] + +with tentaclio.open("s3::/path/to/my/file", mode='w') as writer: + df.to_parquet(writer) +``` +`Readers`, `Writers` and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions. + +##### Notes on writing files for Spark, Presto, and similar downstream systems + +The default behaviour for the `open` context manager in python is to create an empty file when opening +it in writable mode. This can be annoying if the process that creates the data within the `with` clause +yields empty dataframes and nothing gets written. This will make Spark and Presto panic. + +To avoid this we can make the stream _empty safe_ so the empty buffer won't be flushed if no writes have been performed so no empty file will be created. + + +``` +with tio.make_empty_safe(tio.open("s3://bucket/file.parquet", mode="wb")) as writer: + if not df.empty: + df.to_parquet(writer) +``` + +### File system like operations to resources +#### Listing resources +Some URL schemes allow listing resources in a pythonnic way: +```python +import tentaclio + +for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"): + print("Entry", entry) +``` + +Whereas `listdir` might be convinient we also offer `scandir`, which returns a list of [DirEntry](https://github.com/octoenergy/tentaclio/blob/ddbc28615de4b99106b956556db74a20e4761afe/src/tentaclio/fs/scanner.py#L13)s, and, `walk`. All functions follow as closely as possible their standard library definitions. + + +### Database access + +In order to open db connections you can use `tentaclio.db` and have instant access to postgres, sqlite, athena and mssql. + +```python +import tentaclio + +[...] + +query = "select 1"; +with tentaclio.db(POSTGRES_TEST_URL) as client: + result =client.query(query) +[...] +``` + +The supported db schemes are: + +Default: +* `sqlite://` +* `mssql://` +* + Any other scheme supported by sqlalchemy. + +[tentaclio-postgres](https://github.com/octoenergy/tentaclio-postgres) +* `postgresql://` + +[tentaclio-athena](https://github.com/octoenergy/tentaclio-athena) +* `awsathena+rest://` + +[tentaclio-databricks](https://github.com/octoenergy/tentaclio-databricks) +* `databricks+thrift://` + +[tentaclio-snowflake](https://github.com/octoenergy/tentaclio-snowflake) +* `snowflake://` + + +#### Extras for databases +For postgres you can set the variable `TENTACLIO__PG_APPLICATION_NAME` and the value will be injected +when connecting to the database. + +### Automatic credentials injection + +1. Configure credentials by using environmental variables prefixed with `TENTACLIO__CONN__` (i.e. `TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@ftp.octoenergy.com`). + +2. Open a stream: +```python +with tentaclio.open("sftp://ftp.octoenergy.com/file.csv") as reader: + reader.read() +``` +The credentials get injected into the url. + +3. Open a db client: +```python +import tentaclio + +with tentaclio.db("postgresql://hostname/my_data_base") as client: + client.query("select 1") +``` +Note that `hostname` in the url to be authenticated is a wildcard that will match any hostname. So `authenticate("http://hostname/file.txt")` will be injected to `http://user:pass@octo.co/file.txt` if the credential for `http://user:pass@octo.co/` exists. + +Different components of the URL are set differently: +- Scheme and path will be set from the URL, and null if missing. +- Username, password and hostname will be set from the stored credentials. +- Port will be set from the stored credentials if it exists, otherwise from the URL. +- Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be + overriden) + +#### Credentials file + +You can also set a credentials file that looks like: +``` +secrets: + db_1: postgresql://user1:pass1@myhost.com/database_1 + db_2: mssql://user2:pass2@otherhost.com/database_2?driver=ODBC+Driver+17+for+SQL+Server + ftp_server: ftp://fuser:fpass@ftp.myhost.com +``` +And make it accessible to tentaclio by setting the environmental variable `TENTACLIO__SECRETS_FILE`. The actual name of each url is for traceability and has no effect in the functionality. + +(Note that you may need to add `?driver={driver from /usr/local/etc/odbcinst.ini}` for mssql database connection strings; see above example) + +Alternatively you can run `curl https://raw.githubusercontent.com/octoenergy/tentaclio/master/extras/init_tentaclio.sh` to create a secrets file in `~/.tentaclio.yml` and +automatically configure your environment. + +## Quick note on protocols structural subtyping. + +In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we use typed [protocols](https://mypy.readthedocs.io/en/latest/protocols.html#simple-user-defined-protocols). This allows a more flexible dependency injection than using subclassing or [more complex approches](http://code.activestate.com/recipes/413268/). This idea is heavily inspired by how this exact thing is done in [go](https://www.youtube.com/watch?v=ifBUfIb7kdo). Learn more about this principle in our [tech blog](https://tech.octopus.energy/news/2019/03/21/python-interfaces-a-la-go.html). + + + + +%prep +%autosetup -n tentaclio-1.1.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-tentaclio -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 1.1.0-1 +- Package Spec generated @@ -0,0 +1 @@ +10078558b8b3523e84b0be856df7198f tentaclio-1.1.0.tar.gz |