%global _empty_manifest_terminate_build 0 Name: python-pyarrowfs-adlgen2 Version: 0.2.4 Release: 1 Summary: Use pyarrow with Azure Data Lake gen2 License: MIT URL: https://github.com/kaaveland/pyarrowfs-adlgen2 Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d5/87/0b46d3f3781591604d54a9d15771f2a1c5133291cc1a177de3d7e9289b42/pyarrowfs-adlgen2-0.2.4.tar.gz BuildArch: noarch Requires: python3-pyarrow Requires: python3-azure-storage-file-datalake Requires: python3-pandas Requires: python3-pytest %description pyarrowfs-adlgen2 == pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2. It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first. Installation -- `pip install pyarrowfs-adlgen2` Reading datasets -- Example usage with pandas dataframe: ```python import azure.identity import pandas as pd import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) df = pd.read_parquet('container/dataset.parq', filesystem=fs) ``` Example usage with arrow tables: ```python import azure.identity import pyarrow.dataset import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs) table = ds.to_table() ``` Configuring timeouts -- Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds. ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential(), timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10) ) # or mutate it: handler.timeouts.file_client_timeout = 20 ``` Writing datasets -- With pyarrow version 3 or greater, you can write datasets from arrow tables: ```python import pyarrow as pa import pyarrow.dataset pyarrow.dataset.write_dataset( table, 'name.pq', format='parquet', partitioning=pyarrow.dataset.partitioning( schema=pyarrow.schema([('year', pa.int32())]), flavor='hive' ), filesystem=pyarrow.fs.PyFileSystem(handler) ) ``` With earlier versions, files must be opened/written one at a time: As of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files: ```python with fs.open_output_stream('container/out.parq') as out: df.to_parquet(out) ``` Or with arrow tables: ```python import pyarrow.parquet with fs.open_output_stream('container/out.parq') as out: pyarrow.parquet.write_table(table, out) ``` Accessing only a single container/file-system -- If you do not want, or can't access the whole storage account as a single filesystem, you can use `pyarrowfs_adlgen2.FilesystemHandler` to view a single file system within an account: ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name( "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential()) ``` All access is done through the file system within the storage account. Set http headers for files for pyarrow >= 5 -- You can set headers for any output files by using the `metadata` argument to `handler.open_output_stream`: ```python import pyarrowfs_adlgen2 fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs() metadata = {"content_type": "application/json"} with fs.open_output_stream("container/data.json", metadata) as out: out.write("{}") ``` Note that the spelling is different than you might expect! For a list of valid keys, see [ContentSettings](https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.contentsettings?view=azure-python). You can do this for pyarrow >= 5 when using `pyarrow.fs.PyFileSystem`, and for any pyarrow if using the handlers from pyarrowfs_adlgen2 directly. Running tests -- To run the integration tests, you need: - Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account) - To configure azure login (f. ex. use `$ az login` or set up environment variables, see ` azure.identity.DefaultAzureCredential`) - Install pytest, f. ex. `pip install pytest` **NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT** ``` AZUREARROWFS_TEST_ACT=thestorageaccount pytest ``` %package -n python3-pyarrowfs-adlgen2 Summary: Use pyarrow with Azure Data Lake gen2 Provides: python-pyarrowfs-adlgen2 BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-pyarrowfs-adlgen2 pyarrowfs-adlgen2 == pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2. It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first. Installation -- `pip install pyarrowfs-adlgen2` Reading datasets -- Example usage with pandas dataframe: ```python import azure.identity import pandas as pd import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) df = pd.read_parquet('container/dataset.parq', filesystem=fs) ``` Example usage with arrow tables: ```python import azure.identity import pyarrow.dataset import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs) table = ds.to_table() ``` Configuring timeouts -- Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds. ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential(), timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10) ) # or mutate it: handler.timeouts.file_client_timeout = 20 ``` Writing datasets -- With pyarrow version 3 or greater, you can write datasets from arrow tables: ```python import pyarrow as pa import pyarrow.dataset pyarrow.dataset.write_dataset( table, 'name.pq', format='parquet', partitioning=pyarrow.dataset.partitioning( schema=pyarrow.schema([('year', pa.int32())]), flavor='hive' ), filesystem=pyarrow.fs.PyFileSystem(handler) ) ``` With earlier versions, files must be opened/written one at a time: As of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files: ```python with fs.open_output_stream('container/out.parq') as out: df.to_parquet(out) ``` Or with arrow tables: ```python import pyarrow.parquet with fs.open_output_stream('container/out.parq') as out: pyarrow.parquet.write_table(table, out) ``` Accessing only a single container/file-system -- If you do not want, or can't access the whole storage account as a single filesystem, you can use `pyarrowfs_adlgen2.FilesystemHandler` to view a single file system within an account: ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name( "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential()) ``` All access is done through the file system within the storage account. Set http headers for files for pyarrow >= 5 -- You can set headers for any output files by using the `metadata` argument to `handler.open_output_stream`: ```python import pyarrowfs_adlgen2 fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs() metadata = {"content_type": "application/json"} with fs.open_output_stream("container/data.json", metadata) as out: out.write("{}") ``` Note that the spelling is different than you might expect! For a list of valid keys, see [ContentSettings](https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.contentsettings?view=azure-python). You can do this for pyarrow >= 5 when using `pyarrow.fs.PyFileSystem`, and for any pyarrow if using the handlers from pyarrowfs_adlgen2 directly. Running tests -- To run the integration tests, you need: - Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account) - To configure azure login (f. ex. use `$ az login` or set up environment variables, see ` azure.identity.DefaultAzureCredential`) - Install pytest, f. ex. `pip install pytest` **NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT** ``` AZUREARROWFS_TEST_ACT=thestorageaccount pytest ``` %package help Summary: Development documents and examples for pyarrowfs-adlgen2 Provides: python3-pyarrowfs-adlgen2-doc %description help pyarrowfs-adlgen2 == pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2. It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first. Installation -- `pip install pyarrowfs-adlgen2` Reading datasets -- Example usage with pandas dataframe: ```python import azure.identity import pandas as pd import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) df = pd.read_parquet('container/dataset.parq', filesystem=fs) ``` Example usage with arrow tables: ```python import azure.identity import pyarrow.dataset import pyarrow.fs import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential()) fs = pyarrow.fs.PyFileSystem(handler) ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs) table = ds.to_table() ``` Configuring timeouts -- Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds. ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.AccountHandler.from_account_name( 'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential(), timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10) ) # or mutate it: handler.timeouts.file_client_timeout = 20 ``` Writing datasets -- With pyarrow version 3 or greater, you can write datasets from arrow tables: ```python import pyarrow as pa import pyarrow.dataset pyarrow.dataset.write_dataset( table, 'name.pq', format='parquet', partitioning=pyarrow.dataset.partitioning( schema=pyarrow.schema([('year', pa.int32())]), flavor='hive' ), filesystem=pyarrow.fs.PyFileSystem(handler) ) ``` With earlier versions, files must be opened/written one at a time: As of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files: ```python with fs.open_output_stream('container/out.parq') as out: df.to_parquet(out) ``` Or with arrow tables: ```python import pyarrow.parquet with fs.open_output_stream('container/out.parq') as out: pyarrow.parquet.write_table(table, out) ``` Accessing only a single container/file-system -- If you do not want, or can't access the whole storage account as a single filesystem, you can use `pyarrowfs_adlgen2.FilesystemHandler` to view a single file system within an account: ```python import azure.identity import pyarrowfs_adlgen2 handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name( "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential()) ``` All access is done through the file system within the storage account. Set http headers for files for pyarrow >= 5 -- You can set headers for any output files by using the `metadata` argument to `handler.open_output_stream`: ```python import pyarrowfs_adlgen2 fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs() metadata = {"content_type": "application/json"} with fs.open_output_stream("container/data.json", metadata) as out: out.write("{}") ``` Note that the spelling is different than you might expect! For a list of valid keys, see [ContentSettings](https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.contentsettings?view=azure-python). You can do this for pyarrow >= 5 when using `pyarrow.fs.PyFileSystem`, and for any pyarrow if using the handlers from pyarrowfs_adlgen2 directly. Running tests -- To run the integration tests, you need: - Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account) - To configure azure login (f. ex. use `$ az login` or set up environment variables, see ` azure.identity.DefaultAzureCredential`) - Install pytest, f. ex. `pip install pytest` **NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT** ``` AZUREARROWFS_TEST_ACT=thestorageaccount pytest ``` %prep %autosetup -n pyarrowfs-adlgen2-0.2.4 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-pyarrowfs-adlgen2 -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue Apr 11 2023 Python_Bot - 0.2.4-1 - Package Spec generated