PySpark Integration

pytd.spark.download_td_spark(spark_binary_version='2.11', version='latest', destination=None)[source]

Download a td-spark jar file from S3.

Parameters
spark_binary_versionstring, default: ‘2.11’

Apache Spark binary version.

versionstring, default: ‘latest’

td-spark version.

destionationstring, optional

Where a downloaded jar file to be stored.

pytd.spark.fetch_td_spark_context(apikey=None, endpoint=None, td_spark_path=None, download_if_missing=True, spark_configs=None)[source]

Build TDSparkContext via td-pyspark.

Parameters
apikeystring, optional

Treasure Data API key. If not given, a value of environment variable TD_API_KEY is used by default.

endpointstring, optional

Treasure Data API server. If not given, https://api.treasuredata.com is used by default. List of available endpoints is: https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints

td_spark_pathstring, optional

Path to td-spark-assembly_x.xx-x.x.x.jar. If not given, seek a path TDSparkContextBuilder.default_jar_path() by default.

download_if_missingboolean, default: True

Download td-spark if it does not exist at the time of initialization.

spark_configsdict, optional

Additional Spark configurations to be set via SparkConf’s set method.

Returns
td_pyspark.TDSparkContext

td_pyspark.TDSparkContext

pytd.spark.fetch_td_spark_context() returns td_pyspark.TDSparkContext(). See the documentation below and sample usage on Google Colab for more information.