%global _empty_manifest_terminate_build 0 Name: python-spark-etl Version: 0.0.122 Release: 1 Summary: Generic ETL Pipeline Framework for Apache Spark License: MIT URL: https://github.com/stonezhong/spark_etl Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9e/b4/940f4b3aea2b51b6358bfa552ea03c04001978cbd16694f666a129e5f97a/spark-etl-0.0.122.tar.gz BuildArch: noarch Requires: python3-requests Requires: python3-Jinja2 Requires: python3-termcolor %description * [Overview](#overview) * [Goal](#goal) * [Benefit](#benefit) * [Application](#application) * [Build your application](#build_your_application) * [Deploy your application](#deploy_your_application) * [Run your application](#run_your_application) * [Supported platforms](#supported_platforms) * [Demos](#demos) * [APIs](#apis) * [Job Deployer](#job-deployer) * [Job Submitter](#job-submitter) # Overview ## Goal There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list. However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms. spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms. ## Benefit Your application using `spark-etl` can be deployed and launched from different cloud spark platforms without changing the source code. ## Application An application is a python program. It contains: * A `main.py` file which contains the application entry * A `manifest.json` file, which specify the metadata of the application. * A `requirements.txt` file, which specify the application dependency. ### Application entry signature In your application's `main.py`, you shuold have a `main` function with the following signature: * `spark` is the spark session object * `input_args` a dict, is the argument user specified when running this application. * `sysops` is the system options passed, it is platform specific. Job submitter may inject platform specific object in `sysops` object. * Your `main` function's return value should be a JSON object, it will be returned from the job submitter to the caller. ``` def main(spark, input_args, sysops={}): # your code here ``` [Here](examples/apps/demo01) is an application example. ## Build your application `etl -a build -c -p ` ## Deploy your application `etl -a deploy -c -p -f ` ## Run your application `etl -a run -c -p -f --run-args ` ## Supported platforms

	You setup your own Apache Spark Cluster.
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud

# Demos * [Using local pyspark, access data on local disk](examples/pyspark_local/readme.md) * [Using local pyspark, access data on AWS S3](examples/pyspark_s3/readme.md) * [Using on-premise spark, access data on HDFS](examples/livy_hdfs1/readme.md) * [Using on-premise spark, access data on AWS S3](examples/livy_hdfs2/readme.md) * [Using AWS EMR's spark, access data on AWS S3](examples/aws_emr/readme.md) * [Using Oracle OCI's Dataflow with API key, access data on Object Storage](examples/oci_dataflow1/readme.md) * [Using Oracle OCI's Dataflow with instance principal, access data on Object Storage](examples/oci_dataflow2/readme.md) # APIs [pydocs for APIs](https://stonezhong.github.io/spark_etl/pydocs/spark_etl.html) ## Job Deployer For job deployers, please check the [wiki](https://github.com/stonezhong/spark_etl/wiki#job-deployer-classes) . ## Job Submitter For job submitters, please check the [wiki](https://github.com/stonezhong/spark_etl/wiki#job-submitter-classes) %package -n python3-spark-etl Summary: Generic ETL Pipeline Framework for Apache Spark Provides: python-spark-etl BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-spark-etl * [Overview](#overview) * [Goal](#goal) * [Benefit](#benefit) * [Application](#application) * [Build your application](#build_your_application) * [Deploy your application](#deploy_your_application) * [Run your application](#run_your_application) * [Supported platforms](#supported_platforms) * [Demos](#demos) * [APIs](#apis) * [Job Deployer](#job-deployer) * [Job Submitter](#job-submitter) # Overview ## Goal There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list. However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms. spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms. ## Benefit Your application using `spark-etl` can be deployed and launched from different cloud spark platforms without changing the source code. ## Application An application is a python program. It contains: * A `main.py` file which contains the application entry * A `manifest.json` file, which specify the metadata of the application. * A `requirements.txt` file, which specify the application dependency. ### Application entry signature In your application's `main.py`, you shuold have a `main` function with the following signature: * `spark` is the spark session object * `input_args` a dict, is the argument user specified when running this application. * `sysops` is the system options passed, it is platform specific. Job submitter may inject platform specific object in `sysops` object. * Your `main` function's return value should be a JSON object, it will be returned from the job submitter to the caller. ``` def main(spark, input_args, sysops={}): # your code here ``` [Here](examples/apps/demo01) is an application example. ## Build your application `etl -a build -c -p ` ## Deploy your application `etl -a deploy -c -p -f ` ## Run your application `etl -a run -c -p -f --run-args ` ## Supported platforms

	You setup your own Apache Spark Cluster.
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud

# Demos * [Using local pyspark, access data on local disk](examples/pyspark_local/readme.md) * [Using local pyspark, access data on AWS S3](examples/pyspark_s3/readme.md) * [Using on-premise spark, access data on HDFS](examples/livy_hdfs1/readme.md) * [Using on-premise spark, access data on AWS S3](examples/livy_hdfs2/readme.md) * [Using AWS EMR's spark, access data on AWS S3](examples/aws_emr/readme.md) * [Using Oracle OCI's Dataflow with API key, access data on Object Storage](examples/oci_dataflow1/readme.md) * [Using Oracle OCI's Dataflow with instance principal, access data on Object Storage](examples/oci_dataflow2/readme.md) # APIs [pydocs for APIs](https://stonezhong.github.io/spark_etl/pydocs/spark_etl.html) ## Job Deployer For job deployers, please check the [wiki](https://github.com/stonezhong/spark_etl/wiki#job-deployer-classes) . ## Job Submitter For job submitters, please check the [wiki](https://github.com/stonezhong/spark_etl/wiki#job-submitter-classes) %package help Summary: Development documents and examples for spark-etl Provides: python3-spark-etl-doc %description help * [Overview](#overview) * [Goal](#goal) * [Benefit](#benefit) * [Application](#application) * [Build your application](#build_your_application) * [Deploy your application](#deploy_your_application) * [Run your application](#run_your_application) * [Supported platforms](#supported_platforms) * [Demos](#demos) * [APIs](#apis) * [Job Deployer](#job-deployer) * [Job Submitter](#job-submitter) # Overview ## Goal There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list. However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms. spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms. ## Benefit Your application using `spark-etl` can be deployed and launched from different cloud spark platforms without changing the source code. ## Application An application is a python program. It contains: * A `main.py` file which contains the application entry * A `manifest.json` file, which specify the metadata of the application. * A `requirements.txt` file, which specify the application dependency. ### Application entry signature In your application's `main.py`, you shuold have a `main` function with the following signature: * `spark` is the spark session object * `input_args` a dict, is the argument user specified when running this application. * `sysops` is the system options passed, it is platform specific. Job submitter may inject platform specific object in `sysops` object. * Your `main` function's return value should be a JSON object, it will be returned from the job submitter to the caller. ``` def main(spark, input_args, sysops={}): # your code here ``` [Here](examples/apps/demo01) is an application example. ## Build your application `etl -a build -c -p ` ## Deploy your application `etl -a deploy -c -p -f ` ## Run your application `etl -a run -c -p -f --run-args ` ## Supported platforms

	You setup your own Apache Spark Cluster.
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud