Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext

The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI. You can review the documentation for Spark 1.3.1 for SparkConf to get a complete list of parameters. SparkConf documentation

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")

Now that we have a SparkConf we can pass it into SparkContext so our driver application knows how to access the cluster.

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")

val sc = new SparkContext(conf)

You can review the documentation for Spark 1.3.1 for SparkContext to get a complete list of parameters. SparkContext documentation

Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors. Below is a diagram showing the relationship between the driver applications, the cluster resource manager and executors.

Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

One of Sparks's modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation

HDInsight provides multiple end user experiences to interact with your cluster. It has a Spark Job Server HDInsight spark job server that allows you to remotely copy and submit your jar to the cluster. This submission will run your driver application. In your jar you have to implement SparkConf, SparkContext and HiveContext. HDInsight also provides a notebook experience with Jupyter and Zeppelin. For the Zeppelin notebook it automatically creates the SparkContext and HiveContext for you. For Jupyter you must create them yourself. You can read more about Spark on HDInsight at HDInsight Spark Overview and HDInsight Spark Resource Management

I hope understanding the relationship between SparkConf, SparkContext, SQLContext, and HiveContext and how a Spark driver application uses them will makes your Spark on HDInsight project a successful one!

Bill

Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112