How to use parameter substitution with Pig Latin and PowerShell

August 12, 2014, 9:37 am

≫ Next: How to use HBase Java API with HDInsight HBase cluster, part 1

When running Pig in a production environment, you'll likely have one or more Pig Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to locate their input data based on when or where they are run. For example, you may have a Pig job that performs daily log ingestion by geographic region. It would be costly and error prone to manually edit the script to reference the location of the input data each time log data needs to be ingested. Ideally, you'd like to pass the date and geographic region to the Pig script as parameters at the time the script is executed. Fortunately, Pig provides this capability via parameter substitution. There are four different mechanisms to define parameters that can be referenced in a Pig Latin script:

Parameters can be defined as command line arguments; each parameter is passed to Pig as a separate argument using -param switches at script execution time
Parameters can be defined in a parameter file that's passed to Pig using the -param_file command line argument when the script is executed
Parameters can be defined inside Pig Latin scripts using the "%declare" and "%default" preprocessor statements

You can use none, one or any combination of the above options.

Let's look at an example Pig script that could be run to perform IIS log ingestion. The script loads and filters an IIS log looking for requests that didn't complete with status-code of 200 or 201.

Note that parameter names in Pig Latin scripts are preceded by a dollar sign, $. For example, the LOAD statement references six parameters; $WASB_SCHEME, $ROOT_FOLDER, $YEAR, $MONTH, $DAY and $INPUTFILE.

Note also the script makes use of the %default preprocessor statement to define default values for the WASB_SCHEME and ROOT_FOLDER parameters:

Specifying Parameters in a Parameter File

Parameters are defined as key-value pairs. Below is an example parameter file that defines four parameters referenced by the above script; YEAR, MONTH, DAY and INPUTFILE. The YEAR key has a value of 2014, the DAY key has a value of 27 the MONTH key has a value of 07 and the INPUTFILE key has a value of iis.log:

The Pig preprocessor locates parameters in the Pig script by searching for the parameter name prepended with a dollar sign, $, and substitutes the value of the key for the parameter. You can pass the parameter file to Pig using the -param_file command line argument:

pig -param_file d:\users\rdpuser\documents\parameters.txt -f d:\users\rdpuser\documents\LoadLog.pig

Specifying Parameters on the Command Line

The second method of passing parameters to your Pig script at execution time is to pass each parameter as a separate key-value pair using individual -param arguments.

pig -param "YEAR=2014" -param "MONTH=07" -param "DAY=27" -param "INPUTFILE=iis.log" -f d:\users\rdpuser\documents\LoadLog.pig

Note: On Windows key-value pairs must be enclosed in double quotes as the equal sign, =, is an assignment operator in the Windows cmd shell.

Testing Parameter Substitution Using the -dryrun Command Line Option

Before submitting the Pig script to the cluster's Templeton endpoint for execution using PowerShell, let's make sure that parameter substitution will work as desired. There's a useful Pig command line parameter, -dryrun, that can be used to test parameter substitution. The -dryrun option directs Pig to substitute parameter values for parameters in the Pig script, write the resulting script to a file named <original_script_name>.substituted and shut down without executing the script. The best way to try -dryrun is to enable remote access to your cluster, and use RDP to log into your HDInsight cluster's active headnode. Once you're logged in, you can execute PIG.CMD interactively as demonstrated below. Pig will report the name and location of the substituted file before it shuts down:

C:\apps\dist\pig-0.12.1.2.1.3.0-1887\bin>pig -param_file d:\users\rdpuser\documents\parameters.txt -param "MONTH=08" -param "DAY=24" -dryrun -f d:\users\rdpuser\documents\LoadLog.pig

. . .

2014-08-24 15:58:37,625 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file D:\Users\dansha/.pigbootup not found
2014-08-24 15:58:37,638 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for MONTH. Using value 08
2014-08-24 15:58:37,638 [main] WARN org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for DAY. Using value 24
2014-08-24 15:58:38,305 [main] INFO org.apache.pig.Main - Dry run completed. Substituted pig script is at d:\users\dansha\documents\LoadLog.pig.substituted

Precedence Rules for Parameter Substitution

Note the "Warning" messages that showed up in the -dryrun output. If a parameter is defined more than once, there are precedence rules that determine what the final value of the parameter will be. The following precedence order is documented in the Pig parameter substitution documentation. The list is ordered from highest to lowest precedence.

Parameters defined using a declare preprocessor statement have the highest precedence
Parameters defined on the command line using -param have the second highest precedence
Parameters defined in parameter files have the third highest precedence
Parameters defined using the default preprocessor statement have the lowest precedence

Given the above precedence rules, even though the MONTH and DAY parameters were defined in the parameter file, the individual command line parameters specified with the -param arguments overrode them.

Below please find the content of the LoadLog.pig.substituted file that was output by the -dryrun command. Note that all parameters were replaced with values. Some parameters were replaced by values specified in the parameter file, some were replaced by parameters passed via the -param argument, and others were replaced by parameters defined with the default preprocessor statements.

Submitting a Pig Job that Uses Parameters with PowerShell

Now, let's bring it all together with an example that demonstrates how to use the Azure HDInsight PowerShell cmdlets to submit a Pig job that uses command line parameters and a parameter file.

There are a couple of things in the script that are worthy of closer examination. First, if the job will reference any files they need to be copied to one of the storage accounts the target HDInsight cluster is configured to use. This gives the Templeton server access to the files to set the job up for execution. For the example we've been referring to, we needed to copy the Pig Latin script, LoadLog.pig, and the parameter file, Parameters.txt, to Azure blob storage using the Set-AzureStorageBlobContent cmdlet.

# Get storage context

$AzureStorageContext=New-AzureStorageContext-StorageAccountName$BlobStorageAccount-StorageAccountKey$PrimaryStorageKey

# Copy pig script and parameter file up to Azure storage where they can be accessed by the Templeton server while setting up the job for execution

Set-AzureStorageBlobContent-FileC:\src\Hadoop\Pig\LoadLog.pig-BlobTypeBlock-Container$DefaultStorageContainer-Context$AzureStorageContext-Blobhttp://$BlobStorageAccount.blob.core.windows.net/$DefaultStorageContainer/$ScriptsFolder/$ScriptName

Set-AzureStorageBlobContent-FileC:\src\Hadoop\Pig\ParamFile.txt-BlobTypeBlock-Container$DefaultStorageContainer-Context$AzureStorageContext-Blobhttp://$BlobStorageAccount.blob.core.windows.net/$DefaultStorageContainer/$ScriptsFolder/$ParamFile

Passing Command Line Options via PowerShell

Passing parameters to Pig jobs via the PowerShell cmdlets can be a bit confusing, and we've received a number of inquiries how to go about it. Keeping that in mind, the most important thing to "call out" from the job submission script is how to pass parameters to a Pig script using the -param and -param_file Pig command line arguments. Command line arguments are specified at the time the Pig job is defined with the New-AzureHDInsightPigJobDefinition cmdlet. A job's command line arguments must be passed to New-AzureHDInsightPigJobDefinition as an array of String objects using the -Arguments parameter. Each command line element that will be passed to Pig is stored as a separate array entry. This is straight forward for command line options that are "switches" with no associated arguments like "-verbose", "-warning" and "-stop_on_failure"; each of these command line arguments are added as separate entries to the $pigParams array:

$pigParams="-verbose","-warning","-stop_on_failure"

However, things get tricky when passing command line arguments that have associated values. Individual Pig parameters are passed using a -param command line argument followed directly its associated key-value pair. The key-value pair is added to the $pigParams array as a separate, but adjacent, array entry.

For example, consider the first line of code below where the INPUTFILE parameter is added to the $pigParams parameter array. First the command line parameter, "-param" is added. Next, the key-value pair associated with the -param argument, "INPUTFILE=$InputFile" are added in the adjacent array entry. The pattern simply repeats for each successive command line parameter.

$pigParams+="-param","INPUTFILE=$InputFile"

$pigParams+="-param","MONTH=$Month"

$pigParams+="-param","DAY=$Day"

For the parameter file, the "-param_file" argument is added to the $pigParams array followed by a separate, but adjacent, array entry that specifies the parameter file name. Finally, the $pigParams are passed to New-AzureHDInsightPigJobDefinition using the -Arguments parameter.

$pigParams+="-param_file","$param_file"

# Create pig job definition

$pigJobDefinition=New-AzureHDInsightPigJobDefinition-File$PigScript-Arguments$PigParams

The job definition created by New-AzureHDInsightPigJobDefinition is then used by the Start-AzureHDInsightJob cmdlet to submit the Pig script to the Azure HDInsight cluster for execution:

$pigJob=Start-AzureHDInsightJob-Subscription$subscriptionName -Cluster $clusterDnsName -JobDefinition $pigJobDefinition

I hope this post clears up questions some have had about how to pass parameters to Pig jobs via PowerShell, and that you found it informative. Please let us know how we are doing, and what kind of content you would like us to write about in the future.

↧

How to use HBase Java API with HDInsight HBase cluster, part 1

November 4, 2014, 9:56 pm

≫ Next: Some Commonly Used Yarn Memory Settings

≪ Previous: How to use parameter substitution with Pig Latin and PowerShell

Recently we worked with a customer, who was trying to use HBase Java API to interact with an HDInsight HBase cluster. Having worked with the customer and trying to follow our existing documentations here and here, we realized that it may be helpful if we clarify a few things around HBase JAVA API connectivity to HBase cluster and show a simpler way of running the JAVA client application using HBase JAVA APIs. In this blog, we will explain the recommended steps for using HBase JAVA APIs to interact with HDInsight HBase cluster.

The Background:

Our existing documentation here does a nice job in explaining how to use Maven to develop a Java application and use HBase JAVA API to interact with HDInsight HBase cluster – but one may wonder why we are packaging the HBase Java client code as a MapReduce JAR and running the jar as a MapReduce job. This part begs a little more clarity. Remember that HBase JAVA API uses RPC (Remote Procedure Call) to communicate with an HBase Cluster, which means that the client application running HBase JAVA API code and the HBase cluster needs to exist in the same network and subnet. In the absence of Azure Virtual Netowrk, aka VNet, (I imagine, the documentation was written before we introduced the capability of installing HBase cluster in a Virtual Network), the example takes an approach of packaging the HBase Client code as a mapreduce JAR and submitting the job as a mapreduce job via WebHCat/Templeton. With this approach, the client Java JAR (containing HBase Java API calls) runs on one of the worker nodes in the HBase cluster and runs successfully. However, with the current capability of provisioning an HDInsight HBase cluster in a Virtual Network, as shown in this documentation, we feel that a more realistic and better approach for using HBase JAVA APIs is to provision the HDInsight HBase cluster in a VNet, provision the client machine/VM in the same Vnet and then run the HBase Java API Client on the client VM within the same Vnet – this is shown in the diagram below-

We will touch on each of these steps below –

Provision HDInsight HBase cluster in a VNet:

You can follow our HDInsight HBase documentation which has very detailed steps on how we can do this either via Azure Portal or Azure PowerShell.

Provision a Microsoft Azure VM in the same VNet and subnet:

Following the same documentation above, provision a Microsoft Azure virtual machine in the same VNet and subnet as the HDInsight HBase cluster – A standard Windows Server 2012 image with a small VM size should be sufficient. Since we need JDK installed on the VM in order to use HBase JAVA API, we have found it convenient to use an Oracle JDK image from the gallery for our testing (this is not required though and may have special pricing), like below –

If you choose a standard windows server VM (that does not have JDK installed), you can install JDK from Zulu.

Get the DNS Suffix to build FQDN of ZooKeeper nodes:

When using HBase Java API to connect to HBase cluster remotely, we must use the fully qualified domain name (FQDN). To determine this, we need to get the connection-specific DNS suffix of the HDInsight HBase cluster. The documentation shows multiple ways to accomplish this. The simplest way is to RDP into the HDInsight HBase cluster, and execute ipconfig /all and copy and paste the connection specific DNS suffix for the Ethernet adapter as shown on the screenshot below –

So, in my cluster as shown above, connection specific DNS suffix is AzimHbaseTest.d3.internal.cloudapp.net. Please make a note of the value from your cluster and we will use this to build the ZooKeeper FQDN in the next section. To verify that the virtual machine can communicate with the HBase cluster, use the following command, ping headnode0.<dns suffix> from the virtual machine, as shown below –

C:\Users\DBAdmin>ping headnode0.AzimHbaseTest.d3.internal.cloudapp.net
Pinging headnode0.AzimHbaseTest.d3.internal.cloudapp.net [10.0.0.6] with 32 bytes of data:
Reply from 10.0.0.6: bytes=32 time=3ms TTL=128
…...

Develop/Test HBase JAVA API Client on the Azure VM:

Our documentation has detailed steps of how to use Maven to develop a Java Client using HBase JAVA APIs and we don't want to repeat all the steps here – but we would like to share our own experience and show a few different ways we can use Maven for developing the JAVA client. A few options (not limited to) are –

Use Maven command line and a JAVA IDE (like Eclipse, IntelliJ etc)
Use a JAVA IDE (that comes integrated with Maven) like Eclipse to develop the JAVA client

Using Maven command line and a Java IDE:

Note: I am using IntelliJ as an example - you can use your preferred JAVA IDE. Also, the steps below assume that you have installed IntelliJ on the Azure VM (client).

From the command-line on your Azure VM, go to the folder where you wish to create the project. For example, cd C:\Maven\MavenProjects
Use the mvn command, which is installed with Maven, to generate the project template, as shown below –
mvn archetype:generate -DgroupId=com.microsoft.css -DartifactId=HBaseJavaApiTest -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
This will create the src directory and POM.xml in the directory HbaseJavaApiTest (same asartifactId)
Start the JAVA IDE IntelliJ and select 'Import Project' and point to the POM.xml created in the last step, as shown below –
On the next window, in addition to the default settings enabled, also select the options 'Import Maven Projects automatically' and automatically download 1)sources and 2)documentation, as shown below –
Select the default options on the next windows and the project will open inside IntelliJ – add the necessary JAVA source files and remove 'test' folder if you don't plan to use it. In our case, we just just tested the CreateTable.java from the above documentation page and it looks something like this –
Modify the POM.xml file as shown in the documentation– something like this –
Create a new directory named conf in the HbaseJavaApiTest directory. In the conf directory, create a new file named hbase-site.xml and use the ZooKeeper FQDNs created using the DNS suffix you got previously, as shown below:
Open a command prompt and change directories to the HbaseJavaApiTest directory. Use the following command to build a JAR containing the application:
mvn clean package
This will clean any previous build artifacts, download any dependencies that have not already been installed, then build and package the application. The command will create a jar file HBaseJavaApiTest-1.0-SNAPSHOT.jar in the directory HbaseJavaApiTest\target.

Using Eclipse IDE to develop and build the HBase JAVA client application:

You can use the same steps as above for generating the project template using Maven and then import the project (POM.xml) in Eclipse. Alternatively, you can use the Eclipse IDE itself (without using Maven command line) to create the Maven project, as shown below-

1. Install a latest package of 'Eclipse IDE for Java EE Developers', such as Kepler SR2 or Luna SR1

2. Open Eclipse IDE and select File -> new -> Project -> Maven -> New Maven Project, Leave the default options and enter GroupId and ArtifactId

This will create a vanilla maven project. You can then modify the project to add dependencies to pom.xml, and add/modify the source code like CreateTable.java etc. When loading the project on Eclipse IDE, you may notice errors such as "Missing artifact jdk.tools:jdk.tools.jar:1.7". This can be fixed by either modifying eclipse.ini to add the -vm argument to point to the JDK\bin directory, or by including the following dependency within pom.xml -

    </dependency>
        <groupId>jdk.tools</groupId>
        <artifactId>jdk.tools</artifactId>
        <version>${java.version}</version>
        <scope>system</scope>
        <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
    </dependency>

This is due to a limitation with Maven support on Eclipse IDE. It is documented here. Once the above changes are done, you can build the project and run it from within Eclipse IDE and debug as needed.

Running the HBase JAVA API Client on Azure VM:

If you have made your JAR an executable one using a Maven build plugin (see the POM.xml file above) like this –

<plugin>

<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>com.microsoft.css.CreateTable</mainClass>
</manifest>
</archive>
</configuration>
</plugin>

You can run the executable JAR from a command line. Change directory to HbaseJavaApiTest\target and run the following command –

java -jar HBaseJavaApiTest-1.0-SNAPSHOT.jar

Alternatively, you can test and debug the code within the IDE itself, by setting a breakpoint and stepping through the code, as shown in the screenshot below-

I hope you find the blog helpful in using HBase JAVA API to interact with an HDInsight HBase cluster, we would love to hear your feedback! – in part 2, we will discuss some troubleshooting tools you can use for an HBase JAVA API client application.

Thanks Farooq for reviewing this!

- Azim Uddin and Dharshana Kumar

↧

Some Commonly Used Yarn Memory Settings

November 11, 2014, 4:27 am

≫ Next: Loading data in HBase Tables on HDInsight using bult-in ImportTsv utility

≪ Previous: How to use HBase Java API with HDInsight HBase cluster, part 1

We were recently working on an out of memory issue that was occurring with certain workloads on HDInsight clusters. I thought it might be a good time to write on this topic based on all the current experience troubleshooting some memory issues. There are a few memory settings that can be tuned to suit your specific workloads. The nice thing about some of these settings is that they can be configured either at the Hadoop cluster level, or can be set for specific queries known to exceed the cluster's memory limits.

Some Key Memory Configuration Parameters

So, as we all know by now, Yarn is the new data operating system that handles resource management and also serves batch workloads that can use MapReduce and other interactive and real-time workloads. There are memory settings that can be set at the Yarn container level and also at the mapper and reducer level. Memory is requested in increments of the Yarn container size. Mapper and reducer tasks run inside a container. Let us introduce some parameters here and understand what they mean.

mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Description: Upper memory limit for the map/reduce task and if memory subscribed by this task exceeds this limit, the corresponding container will be killed.

These parameters determine the maximum amount of memory that can be assigned to mapper and reduce tasks respectively. Let us look at an example to understand this well. Say, a Hive job runs on MR framework and it needs a mapper and reducer. Mapper is bound by an upper limit for memory which is defined in the configuration parameter mapreduce.map.memory.mb. However, if the value for yarn.scheduler.minimum-allocation-mb is greater than this value of mapreduce.map.memory.mb, then the yarn.scheduler.minimum-allocation-mb is respected and the containers of that size are given out.

This parameter needs to be set carefully as this will restrict the amount of memory that a mapper/reducer task has to work with and if not set properly, this could lead to slower performance or OOM errors. Also, if it is set to a large value that is not typically needed for your workloads, it could result in a reduced concurrency on a busy system, as lesser applications can run in parallel due to larger container allocations. It is important to test workloads and set this parameter to an optimal value that can serve most workloads and tweak this value at the job level for unique memory requirements. Please note that for Hive queries on Tez, the configuration parameters hive.tez.container.size and hive.tez.java.opts can be used to set the container limits. By default these values are set to -1, which means that they default to the mapreduce settings, however there is an option to override this at the Tez level. Shanyu's blog covers this in greater detail.

How to set this property: Can be set at site level with mapred-site.xml. This change does not require a service restart.

yarn.scheduler.minimum-allocation-mb

Description:This is the minimum allocation size of a container. All memory requests will be handed out as increments of this size.
.
Yarn uses the concept of containers for acquiring resources in increments. So, the minimum allocation size of a container is determined by the configuration property yarn.scheduler.minimum-allocation-mb. This is the minimum unit of container memory grant possible. So, even if a certain workload needs 100 MB of memory, it will still be granted 512 MB of memory as that is the minimum size as defined in yarn.scheduler.minimum-allocation-mb property for this cluster. This parameter needs to be chosen carefully based on your workload to ensure that it is providing the optimal performance you need while also utilizing the cluster resources. If this configuration parameter is not able to accommodate the memory needed by a task, it will lead to out of memory errors. The error below is from my cluster where a container has oversubscribed memory, so the NodeManager kills the container and you will notice an error message like below on the logs. So, I would need to look at my workload and increase the value for this configuration property to allow my workloads to complete without OOM errors.

Vertex failed,vertexName=Reducer 2, vertexId=vertex_1410869767158_0011_1_00,diagnostics=[Task failed, taskId=task_1410869767158_0011_1_00_000006,diagnostics=[AttemptID:attempt_1410869767158_0011_1_00_000006_0
Info:Container container_1410869767158_0011_01_000009 COMPLETED with diagnostics set to [Container [pid=container_1410869767158_0011_01_000009,containerID=container_1410869767158_0011_01_000009] is running beyond physical memory limits.
Current usage: 512.3 MB of 512MB physical memory used; 517.0 MB of 1.0 GB virtual memory used. Killing container.
Dump of the process-tree for container_1410869767158_0011_01_000009 :
|- PID CPU_TIME(MILLIS) VMEM(BYTES)WORKING_SET(BYTES)
|- 7612 62 1699840 2584576|- 9088 15 663552 2486272
|- 7620 3451328 539701248 532135936
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137

How to set this property: Can be set at site level with mapred-site.xml, or can be set at the job level. This change needs a recycle of the RM service. On HDInsight, this needs to be done when provisioning the cluster with custom configuration parameters.

yarn.scheduler.maximum-allocation-mb

Description: This is the maximum allocation size allowed for a container.

This property defines the maximum memory allocation possible for an application master container allocation request. Again, this needs to be chosen carefully as if the value is not large enough to accommodate the needs for processing, then this would result in an OOM error. Say mapreduce.map.memory.mb is set to 1024 and if the yarn.scheduler.maximum-allocation-mb is set to 300, it leads to a problem as the maximum allocation possible for a container is 300 MB, but the mapper task would need 1024 MB as defined in the mapreduce.map.memory.mb setting. Here is the error you would see in the logs:

org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1024, maxMemory=300

How to set this property: Can be set at site level by with mapred-site.xml, or can be set at the job level. This change needs a recycle of the RM service. On HDInsight, this needs to be done when provisioning the cluster with custom configuration parameters.

mapreduce.reduce.java.opts and mapreduce.map.java.opts

Description: This allows to configure the maximum and minimum JVM heap size. For maximum use, -Xmx and for minimum use –Xms.

This property value needs to be less than the upper bound for map/reduce task as defined in mapreduce.map.memory.mb/mapreduce.reduce.memory.mb, as it should fit within the memory allocation for the map/reduce task. These configuration parameters specify the amount of heap space that is available for the JVM process to work within a container.If this parameter is not properly configured, this will lead to Java heap space errors as shown below.

You can get the yarn logs by executing the following command by feeding in the applicationId.

Now, let us look at the relevant error messages from the Yarn logs that has been extracted to the heaperr.txt by executing the command above –

2014-09-18 14:36:32,303 INFO [IPC Server handler 29 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1410959438010_0051_r_000004 asked for a task
2014-09-18 14:36:32,303 INFO [IPC Server handler 29 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1410959438010_0051_r_000004 given task: attempt_1410959438010_0051_r_000000_1
2014-09-18 14:41:11,809 FATAL [IPC Server handler 18 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1410959438010_0051_r_000000_1 - exited : Java heap space
2014-09-18 14:41:11,809 INFO [IPC Server handler 18 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1410959438010_0051_r_000000_1: Error: Java heap space
2014-09-18 14:41:11,809 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1410959438010_0051_r_000000_1: Error: Java heap space

As we can see the reduce attempt 1 has exited with an error due to insufficient Java heap space. I am able to tell that this is a reduce attempt due to the letter 'r' in the task attempt ID as highlighted here - jvm_1410959438010_0051_r_000004. It would be a letter 'm' for the mapper. This reduce task would be tried 4 times as defined in mapred.reduce.max.attempts config property in mapred-site.xml and in my case all the four attempts failed due to the same error. When you are testing a workload to determine memory settings on a standalone one-node box, you can reduce the number of attempts by reducing the mapred.reduce.max.attempts config property and find the right amount of memory that the workload would need by tweaking the different memory settings and determining the right configuration for the cluster. From the above output, it is clear that a reduce task has a problem with the available heap space and I could solve this issue by increasing the heap space with a set statement just for this query as most of my other queries were happy with the default heap space as defined in the mapred-site.xml for the cluster. Counter committed heap bytes can be used to look at the heap memory that the job eventually consumed. This is accessible from the job counters page.

How to set this property: Can be set at site level with mapred-site.xml, or can be set at the job level. This change does not require a service restart.

yarn.app.mapreduce.am.resource.mb

Description: This is the amount of memory that the Application Master for MR framework would need.

Again, this needs to be set with care as a larger allocation for the AM would mean lesser concurrency, as you can spin up only so many AMs before exhausting the containers on a busy system. This value also needs to be less than what is defined in yarn.scheduler.maximum-allocation-mb, if not, it will create an error condition – example below.

2014-10-23 13:31:13,816 ERROR [main]: exec.Task (TezTask.java:execute(186)) - Failed to execute tez graph.
org.apache.tez.dag.api.TezException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=699
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:385)

How to set this property: Can be set at site level with mapred-site.xml, or can be set at the job level. This change does not require a service restart.

We looked at some common Yarn memory parameters in this post. This brings us to the end of our journey in highlighting some of the most commonly used memory config parameters with Yarn. There is a nice reference from Hortonworks here that talks about a tool that gives some best practice suggestions for memory settings and also goes over how to manually set these values. This can be used as a starting point for testing some of your workloads and tuning it iteratively from there to suit your specific needs. Hope you found some of these nuggets helpful. We will try and explore some more tuning parameters in further posts.

-Dharshana

@dharshb

Thanks to Dan and JasonH for reviewing this post!

↧

Loading data in HBase Tables on HDInsight using bult-in ImportTsv utility

December 12, 2014, 10:02 am

≫ Next: Problems When Using a Shared Default Storage Container with Multiple HDInsight Clusters

≪ Previous: Some Commonly Used Yarn Memory Settings

Apache HBase can give random access to very large tables-- billions of rows X millions of columns. But the question is how do you upload that kind of data in the Hbase tables in the first place? HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.

Overview

HBase ships with build-in ImportTsv utility and in many cases it will be much faster and easier to upload data in HBase using ImportTsv utility compared to other methods. As the name suggests using ImportTsv tool you can upload data in TSV format into HBase. In a TSV file each field value of a record is separated from the next by a tab stop character. However, the tool has an option importtsv.separator which allows you to specify a separator if the filed are separated on a different separator instead of tab – for example pipes or comma. ImportTsv has two distinct usages.

Loading data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)
Preparing StoreFiles to be loaded via the completebulkload(Bulk Loading)

If you don't have huge amount of data may be you can directly upload to HBase via Puts (#1). Using Bulk Loading (#2) on the other hand will come in handy when you have huge amount of data to upload. Bulk loading will be faster as it uses less CPU and network resources than simply using the HBase API. However, keep in mind bulk loading bypasses the write path, the Write Ahead Log (WAL) doesn't get written to as part of the process and it can cause some issue for some other process, for example, replication. To find out more about HBase bulk loading please review the Bulk Loading page in Apache HBase reference guide. The HBase bulk load process consists of two main steps.

The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload.

Examples

HBase in HDInsight (Hadoop in Microsoft Azure) is the same in its core as HBase in any other environment. However, someone not familiar with Microsoft Azure environment may get stuck by some minor differences when interacting with the HBase cluster in HDInsight. This is why the examples provided in this blog are specific to HBase cluster in HDInsight and I hope it will make your experience with Hbase cluster in HDInsight smoother. We will provide detail steps for both the usage scenarios of ImportTsv utility.

Prerequisites

Before uploading the data to HBase we need to move the data to Windows Azure Storage Blob (WASB) first and we also need to create an empty HBase table to upload the data. So let's do the following steps to get ready to upload the data in HBase using the ImportTsv utility.

For this blog we will use the sample data.tsv as shown below where each filed in a row is separated by a Tab.
row1    c1    c2
row2    c1    c2
row3    c1    c2
row4    c1    c2
row5    c1    c2
row6    c1    c2
row7    c1    c2
row8    c1    c2
row9    c1    c2
row10    c1    c2
row11    c1    c2
row12    c1    c2
Follow any of the methods/tools described in Upload data for Hadoop jobs in HDInsight Azure document to upload data.tsv file to WASB. For example I used the PowerShell script sample provided in the above link to upload the data.tsv file at example/data/data.tsv and used the Azure Storage Explorer tool to verify that the file is uploaded in the right location.
Now we need to create the table from HBase shell. We will call the table 't1' and our row key will be the first column. We will have the two remaining columns in a column family called 'cf'.
If you are preparing a lot of data for bulk loading, you need to make sure the target HBase table is pre-split appropriately. The best practice when creating a table is to split it according to the row key distribution. If your rowkeys start with a letter or number, you can split your table at letter or number boundaries. In our sample data.tsv file we only have 12 rows but we will use three splits just to show how it works.
To open HBase shell we need to RDP to the head node; open Hadoop command line; navigate to %hbase_home%\bin and then type the following.
C:\apps\dist\hbase-0.98.0.2.1.6.0-2103-hadoop2\bin>hbase shell
Then run the following from Hbase shell to create the table with 3 splits.
hbase(main):008:0> create 't1', {NAME => 'cf1'}, {SPLITS => ['row5', 'row9']}
Now let's browse to the link below Hbase dashboard from the headnode to check the table we just created.
http://zookeeper2.MyHbaseCluster.d3.internal.cloudapp.net:60010/master-status
In the dashboard go to the Table Details tab and you will see the list of all tables and the one we just created 't1'. Names of all the tables are hyper linked. Click 't1' and should be able to view the three regions and other details as shown in the screenshot below.

Usage 1: Upload the data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)

Open a new Hadoop command like and type 'cd %hbase_home%\bin' to navigate to the HBase home and then run the following to upload the data from the tsv file data.tsv in HDFS to Hbase table t1.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" t1 /example/data/data.tsv

Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following to upload the data to the Hbase table 't1' where the separator comma ("," ) is specified using the option importtsv.separator.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.separator="," t1 /example/data/data.csv

To verify that the data is uploaded open HBase shell again and run the following.

scan 't1'

You should see the rows as below.

Usage 2: Preparing StoreFiles to be loaded via the completebulkload (bulk Loading).

We will use the same table 't1' to bulk load the data from the same input file. So let's disable, drop and recreate table 't1' from HBase shell as shown in the screen shot below. Our input data file data.tsv will remain in the same location in WASB.

Now that table 't1' is recreated let's follow the steps to prepare StoreFiles and then load them to the Hbase table via the completebulkload tool.

Run the following to transform the data file to StoreFiles and store at a relative path specified by Dimporttsv.bulk.output.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" t1 /example/data/data.tsv
You should see the output as below in WASB (this screen shot is taken using Azure Storage Explorer). Notice there are three files under "example/data/storeDataFileOutput/cf1/", one per region.

Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.separator="," -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" t1 /example/data/data.csv

Now we need to use the completebulkload tool to complete the bulk upload. Run the following to upload the data from the HFiles located at /example/data/storeDataFileOutput to the HBase table t1.
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
/example/data/storeDataFileOutput t1

Again to verify that the data is uploaded open HBase shell and run the following.

scan 't1'

You can also use Hive and Pig to upload data in HBase tables on HDInsight. I intend to blog on those in future. This is it for today and I hope it was helpful.

↧

Problems When Using a Shared Default Storage Container with Multiple HDInsight Clusters

February 12, 2015, 8:06 am

≫ Next: Azure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight configuration cmdlets

≪ Previous: Loading data in HBase Tables on HDInsight using bult-in ImportTsv utility

We have seen several cases come in to Microsoft Support that ended up being caused by having multiple HDInsight clusters using the same Azure Blob Storage container for default storage. While we don't currently block you from creating clusters using the same default storage container, we do know that this can cause some specific problems. Many folks have been asking whether this configuration is supported, and the short answer is that it is not.

When it comes to determining whether a particular setup is supportable, we typically look at whether the configuration is tested and proven to work reliably. Since HDInsight is based on Apache Hadoop, this is obviously a bit more complex. If you look out into the Hadoop ecosystem there is not much precedence for primary storage being shared between multiple clusters. It just happens to be easy to manually configure HDInsight clusters in this way, and some customers have chosen to do so because it provides convenient access to shared data in the container. The problems may not manifest for many days or weeks, depending on some specific timing conditions on job completion and background maintenance, so it can appear to be working just fine for a while.

The types of problems that we have seen center around errors retrieving job status, which can cascade into unexpected errors, hangs or delays in Hive, Pig, WebHCat/Templeton, and Oozie. Each of these frameworks has different error handling and retry logic so the ways in which the problems surface are very broad.

What this means is that if you are using a shared default container between multiple HDInsight clusters and you call in to support, we will ask you to eliminate the shared default container configuration as a first troubleshooting step.

If you need to use a shared container to provide access to data for multiple HDInsight clusters then you should add it as an Additional Storage Account in the cluster configuration. This option is available when using the Azure Portal, PowerShell (Add-AzureHDInsightStorage), or the SDK (AdditionalStorageAccounts) to provision clusters.

Note: For detailed information about how HDInsight uses Blob storage check out: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/

↧

Azure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight configuration cmdlets

February 16, 2015, 7:16 am

≫ Next: Sqoop Job Performance Tuning in HDinsight (Hadoop)

≪ Previous: Problems When Using a Shared Default Storage Container with Multiple HDInsight Clusters

We recently pushed out the 0.8.14 release of Azure PowerShell. This release includes some updates to the following cmdlets to ensure that values passed in via the PowerShell pipeline, or via the -Config parameter, are maintained:

Set-AzureHDInsightDefaultStorage
Add-AzureHDInsightStorage
Add-AzureHDInsightMetastore

Previously if you had done something like:

$myconfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2 -ClusterType HBase
$myconfig = Set-AzureHDInsightDefaultStorage -Config $myconfig `
                  -StorageContainerName "somecontainer" `
                  -StorageAccountName "somedefaultstorage.blob.core.windows.net" `
                  -StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg=="
$myconfig = Add-AzureHDInsightStorage -Config $myconfig `
                  -StorageAccountName "someaddedstorage.blob.core.windows.net" `
                  -StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg=="
$myconfig = Add-AzureHDInsightMetastore -Config $myconfig `
                  -Credential (Get-Credential) -DatabaseName "somedatabase" `
                  -MetastoreType HiveMetastore -SqlAzureServerName "someserver"
$myconfig | Format-Custom # This is where you would usually call New-AzureHDInsightCluster

New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2 -ClusterType HBase | `
Set-AzureHDInsightDefaultStorage -StorageContainerName "somecontainer" `
                  -StorageAccountName "somedefaultstorage.blob.core.windows.net" `
                  -StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg==" | `
Add-AzureHDInsightStorage -StorageAccountName "someaddedstorage.blob.core.windows.net" `
                  -StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg==" | `
Add-AzureHDInsightMetastore -Credential (Get-Credential) -DatabaseName "somedatabase" `
                  -MetastoreType HiveMetastore -SqlAzureServerName "someserver" | `
Format-Custom # This is where you would usually call New-AzureHDInsightCluster

You would have found that some elements, like the initial ClusterType of "HBase" would have been lost from the configuration. These values will now be maintained as you add elements to the configuration. This should also address some scenarios where people have found that they needed to set options in a particular order for them to be maintained.

Side Note: Passing a configuration object to Format-Custom before using it for New-AzureHDInsightCluster is a great way to troubleshoot whether the configuration object is set up as you expect.

↧

Sqoop Job Performance Tuning in HDinsight (Hadoop)

February 17, 2015, 5:36 pm

≫ Next: Understanding HDInsight Custom Node VM Sizes

≪ Previous: Azure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight configuration cmdlets

Overview

Apache Sqoop is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. HDInsight is Hadoop cluster deployed in Microsoft Azure and it includes Sqoop. When transferring small amount of data Sqoop performance is not an issue. However, when transferring huge amount of data it is important to consider the things that can improve the performance to keep the execution time within the desirable limit.

Increase the number of parallel tasks by using an appropriate value for –m parameter

A Sqoop job essentially boils down to bunch of map tasks (there is no reducer). So the performance tuning of any Sqoop job is somewhat same as optimizing a map-reduce job or at least this is where one should start. Therefore, the first thing one should consider to improve the performance of a Sqoop job is to increase the number of parallel tasks. In other words increase the number of mappers to utilize maximum available resources in the cluster. This may require some experimentation given the user's dataset and system in which Sqoop is running. The argument is "-m, --num-mappers". By default –m is set to 4 and therefore if not specified Sqoop will use only four map tasks in parallel. In general you would want to use a higher value of –m to increase the degree of parallelism and hence the performance. However, it is not recommended to increase the degree of parallelism greater than the resources available in the cluster because mappers will run serially and will likely increase the amount of time required to complete the job.

Now the questions arises how do you determine the right value for –m. Actually there is no perfect way to find that magic number but you can try to determine some approximate range based on your cluster size and test to find out which gives you the best results. Hadoop 2.x (HDI 3.x) uses YARN and each Yarn task is assigned a container which has a memory limit. In other words each mapper would require a container to run. So if we can get a rough estimate of the maximum number containers available in your cluster then maybe you can use that number for –m as a starting point assuming there is no other job running in the cluster and you do not want to run the multiple sets of mappers in serial. The number of available containers in a cluster depends on few configuration settings. Based on the HDInsight release notes following are the default settings in HDInsight for the mapper, reducer and AM (Application Master) as of 10/7/2014 release:

mapreduce.map.memory.mb = 768

mapreduce.reduce.memory.mb = 1536

yarn.app.mapreduce.am.resource.mb = 768

Currently each data node of an HDInsight cluster uses a Large Size Azure PaaS VM which has 4 cores and 7 GB RAM. Out of which about 1 GB is used by the node manager (The NodeManager daemon's heap size is set via yarn-env.sh, via its YARN_NODEMANAGER_HEAPSIZE env-var.) So with the remaining 6 GB you can have maximum (6*1024)/768 = 8 containers per worker node for mapper. The reducers are configured to use twice as much memory (1536 MB) as mappers but in a Sqoop job there is no reducer. Let's assume we have a 16 node cluster and there is no other job running. So the total number of available containers or maximum number of parallel map tasks we can have is 8x16=128. So if you do not want to run multiple sets to map talks in serial and there is no other job running in the cluster then maybe we can set –m as 128.

Use smaller fs.azure.block.size to increase the number of mapper further.

However, the value passed for –m parameter is a guide only and the actual number of mappers may be different based on other factors like input file size and number, dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc. If the individual input files are smaller than the block size then we will have one map task for each input file. However, if an input file is bigger than the block size; number of mappers for that input file would be (file size/block size). Therefore if you have resources available in the cluster you can try to increase the number of mappers by setting a smaller value for block size (which is represented by fs.azure.block.size in WASB) and see if that improves the performance and this is the second thing you should consider when tuning the performance of a Sqoop job. For Hive or Map-reduce jobs we can set the fs.azure.block.size property to a different value during running the job. But unfortunately HDInsight PowerShell object New-AzureHDInsightSqoopJobDefinition doesn't include the parameter [-Defines <Hashtable>] which allows Hadoop configuration values to be set during the job execution. However, we can always provision a customize HDInsight cluster and set the fs.azure.block.size to a smaller value when creating the HDInsight cluster if needed. This will change the default in the cluster level and will use that value for all jobs running in the cluster.

Is my cluster too small to handle the data?

After you have tested enough to find optimum values for –m parameter and the fs.azure.block.size and feel that you can't improve the performance of your sqoop job any further then may be it is time to think about increasing the cluster resources or in other words increase the size of your cluster and is the third thing you should consider when performance tuning a sqoop job. You should especially consider this when you are transferring huge amount of data and you need to bring down the execution time significantly. To give you some idea I have recently worked on a case where the customer wanted to export about 75GB data and each input file was ~500 MB. Initially we used a 24 node cluster with –m set to 160 and the export took about ~58 hours to complete. Then we tested with a 36 node cluster setting –m as 300 and it took about ~34 hours to compete. This customer didn't want to try setting a smaller value for fs.azure.block.size as they didn't want to change the default in the cluster level. If you are transferring a reasonably huge amount of data then you should start with a reasonable size cluster even before starting to play with –m or fs.azure.block.size parameters. I hope the example of my customer's data and cluster size gives you some idea in that regard.

Is the database a bottleneck?

Increasing the cluster size or the degree of parallelism will not improve the performance indefinitely. For example if you increase the degree of parallelism higher than that which your database can reasonably support it will not improve the overall performance. Sqoop exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. This brings us to the fourth thing that you should consideration while tuning Sqoop job performance and that is to check if the database is the bottleneck. If logs captured from the database side show indeed that is the case then you need to figure out if there is a way to scale up the database capabilities. The customer I mentioned earlier was using Azure SQL Database and we actually found that the database was a significant bottleneck for performance. We scaled up his SQL Azure Database by using higher performance levels of Database Throughput Unit (or simply DTU) and as a result we were able to improve the overall performance by 50%. This Azure SQL Database Service Tiers and Performance Levels MSDN article has more information on different scale up options for Azure SQL Database.

Is the storage a bottleneck?

HDInsight uses Windows Azure Blob Storage, WASB, for storing the data and this Azure document details what are the benefits of using WASB. However, the HDInsight cluster can throttled when the throughput rate of Windows Azure Storage blob exceeds the limits detailed in this blog post. Therefore, when running Sqoop jobs in HDInsight cluster another point of bottleneck could be the WASB throughput and that is the fifth thing you should consider while tuning the performance of a Sqoop job. You can use the Windows Azure Storage Log Analysis Tool detailed in this blog post to determine if that is case and then take appropriate measures to mitigate the same. While importing data in WASB you also want to make sure the data size did not cross the WASB block size limits as described in this MSDN document otherwise you may see error like below.

Caused by: com.microsoft.windowsazure.services.core.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.

Two other scenario specific Sqoop performance tips

Let's briefly discuss two other scenario specific Sqoop performance tips. For Sqoop export you can use --batch argument which uses batch mode for underlying statement execution and thus may improve performance. For example you can set --batch=200 or higher. If the table has too many columns and you use a higher batch value you may end up seeing OOM errors. The second one is specific to when running sqoop jobs from Oozie. Sqoop copies the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.

Conclusion

I am sure there are other ways one can think of optimizing the performance of a Sqoop job in HDInsight. I tried to cover the main ones in this blog post and I hope it either helps you to improve the performance of your Sqoop job or at least serves as a starting point for you.

References:

Apache Sqoop User Guide (v1.4.5)

↧

Understanding HDInsight Custom Node VM Sizes

May 11, 2015, 11:00 am

≫ Next: Why are the Hadoop services disabled on my HDInsight cluster

≪ Previous: Sqoop Job Performance Tuning in HDinsight (Hadoop)

With the 02/18/2015 update to HDInsight and Azure Powershell 0.8.14 we introduced a lot more options for configuring custom Head Node VM size as well as Data Node VM size and Zookeper VM size. Some workloads can benefit from increased CPU performance, increased local storage throughput, or larger memory configurations. You can only select custom node sizes when provisioning a new cluster, whereas you are able to change the number of Data Nodes on a running cluster with the Cluster Scaling feature.

We've seen some questions come up regarding how to properly select the various sizes, and what that means for the availability of memory within the cluster.

If you are creating an HDInsight cluster through the Azure Management Portal, you can select custom VM sizes by using the "Custom Create" option.

In the second page of the wizard, you will have the option to customize the Head Node and Data Node Size for Hadoop clusters, the Head Node, Data Node and Zookeeper Sizes for HBase clusters, and the Nimbus Node, Supervisor Node, and Zookeeper Sizes for Storm Clusters:

If you are using Azure PowerShell, you can specify the node sizes directly in New-AzureHDInsightCluster like:

# Get HTTP Services Credential
$cred = Get-Credential -Message "Enter Credential for Hadoop HTTP Services"
# Set Cluster Name
$clustername = "mycustomizedcluster"
# Get Storage Account Details
$hdistorage = Get-AzureStorageAccount 'myazurestorage'
# Create Cluster
New-AzureHDInsightCluster -Name $clustername -ClusterSizeInNodes 2                    `
                                             -HeadNodeVMSize "A7"                     `
                                             -DataNodeVMSize "Standard_D3"            `
                                             -ZookeeperNodeVMSize "A7"                `
                                             -ClusterType HBase                       `
    -DefaultStorageAccountName $hdistorage.StorageAccountName                         `
    -DefaultStorageAccountKey                                                         `
     (Get-AzureStorageKey -StorageAccountName $hdistorage.StorageAccountName).Primary `
    -DefaultStorageContainerName $clustername                                         `
    -Location $hdistorage.Location -Credential $cred

or you can include them in New-AzureHDInsightClusterConfig like:

# Get HTTP Services Credential
$cred = Get-Credential -Message "Enter Credential for Hadoop HTTP Services"
# Set Cluster Name
$clustername = "mycustomizedcluster"
# Get Storage Account Details
$hdistorage = Get-AzureStorageAccount 'myazurestorage'
# Set up new HDInsightClusterConfig
$hdiconfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2         `
                                             -HeadNodeVMSize "A7"          `
                                             -DataNodeVMSize "Standard_D3" `
                                             -ZookeeperNodeVMSize "Large"  `
                                             -ClusterType HBase
# Add other options to hdiconfig
$hdiconfig = Set-AzureHDInsightDefaultStorage -StorageContainerName $clustername     `
    -StorageAccountName $hdistorage.StorageAccountName                               `
    -StorageAccountKey                                                               `
    (Get-AzureStorageKey -StorageAccountName $hdistorage.StorageAccountName).Primary `
    -Config $hdiconfig
    # Create Cluster
    New-AzureHDInsightCluster -config $hdiconfig -Name $clustername `
               -Location $hdistorage.Location -Credential $cred

Note: The cmdlets only understand "HeadNodeVMSize" and "DataNodeVMSize" parameters, but the naming conventions for these nodes is a bit different for HBase and Storm clusters. HeadNodeVMSize is used for Hadoop Head Node, HBase Head Node, and Storm Nimbus Server node size. DataNodeVMSize is used for Hadoop Data/Worker Node, HBase Region Server, and Storm Supervisor node size.

Finding the right string values to specify the different VM sizes can be a bit tricky. If you specify an unrecognized VM size, you will get an error back that reads: "New-AzureHDInsightCluster : Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind state: 'Error'. Message: 'PreClusterCreationValidationFailure'.". The allowed sizes are described on the Pricing page at: http://azure.microsoft.com/en-us/pricing/details/hdinsight/ but the string values that relate to the different sizes can be found at: https://msdn.microsoft.com/en-us/library/azure/dn197896.aspx. Note that the first column in the table of sizes has a heading "Size – Management Portal\cmdlets & APIs". You have to use the value to the right of the '\', unless of course it says "(same)" in which case you use the value on the left. This means that for an "A3" size, you have to specify "Large", "A7" is just "A7", and "D3" needs to be specified as "Standard_D3".

Special Considerations for Memory Settings:

If you use the default cluster type, or specify 'Hadoop' as your cluster type, then the Hadoop, YARN, MapReduce & Hive settings will be modified to make use of the additional memory available on Data Nodes. If you specify 'HBase' or 'Storm' for the cluster type, these values currently remain at the lower defaults associated with the default node size and the additional memory is reserved for the HBase or Storm workload on the cluster. You can see the relevant settings by connecting to the cluster via RDP and examining yarn-site.xml, mapred-site.xml, and hive-site.xml in the respective configuration folders under C:\apps\dist.

It is possible to alter all of these memory configuration settings by customizing them at cluster provisioning time, but this can be a complicated endeavor when you consider that you have to balance the different allocations to make sure that you don't allocate beyond the physical memory on the nodes.

Note: If you are actively working with Azure Powershell, be sure to update to the Latest Release in order to take advantage of new features and fixes.

↧

Why are the Hadoop services disabled on my HDInsight cluster

May 31, 2015, 12:26 am

≫ Next: How to install Splunk on HDINSIGHT with a custom action script

≪ Previous: Understanding HDInsight Custom Node VM Sizes

I came across this question while working with a few customers recently and thought I would share a few tips with others who may find it helpful. There are times when we may need to check the status of a Hadoop service or restart the service as part of troubleshooting an issue – for example, you may want to restart Oozie service or Hive Metastore service while troubleshooting an oozie issue or a Hive issue. To restart a Hadoop service on an HdInsight cluster on Windows, currently we need to Remote Desktop (aka RDP) to the Headnode of the cluster. Consider a scenario where you have enabled Remote Desktop to the cluster via 'Enable Remote' button on the Azure portal (https://manage.windowsazure.com) or other alternative ways available. You then RDP to the Headode of the HDInsight cluster and open Windows services console via start -> Run -> services.msc

And then you find that hadoop services are disabled, like below –

This is interesting, right? We would expect Hadoop services to be running on the HDInsight Headnode. Consider another scenario, you have run a Hive query from the Hive CLI on the HDInsight Headnode and as part of troubleshooting the query, you wanted to review the hive.log file. You go to %hive_home%\logs folder and find that logs folder is empty, like below –

Another possiblity is, the log file does exist under the logs folder but we are not seeing the log entries for our specific execution or test. All of the above scenarios (Hadoop services being disabled or some of the logs folder being empty or log entries being non-existent on the Headnode) have the same underlying reason, which is, when we have connected via RDP we have not landed on the currently active Headnode. Remember that, as described in the Azure HdInsight documentation here, for High availability implementation, each HDInsight cluster has two Headnodes – headnode0 and headnode1. Due to various reasons, Headnode failover may occur and either headnode0 or headnode1 can become the active Headnode at any given time. So, in order to find the running Hadoop services on the Headnode or to find the logs that exist on the Headnode, we need to log on to the currently active Headnode.

How do I find the current active headnode?

On the desktop of each HDInsight Headnode, you will see an icon called 'Hadoop service availability', as shown below -

Double clicking the 'Hadoop service availability' icon will show you the current active Headnode where the services are running. For example, we can see below that the active Headnode is headnode0.

Now open a Hadoop command prompt or any command prompt and run the hostname command to verify which node we are logged into – in the example below, we can see that we are on headnode1.

If the hostname and Hadoop service Availability page shows the same Headnode name, then we are already on the active headnode. In my case, current active headnode is headnode0 and RDP is connecting to headnode1.

How do I RDP to active Headnode?

If you run into the above scenario where the default RDP connection is not connecting you to HDInsight active Headnode, you can use a simple workaround like below –

Click on Connect button on the azure portal and on the prompt, save the RDP file (with .rdp extension) on your workstation, instead of opening it, like below –

Right click on the .rdp file you saved in the previois step and select 'open with' and then select 'choose default program' option, as shown below-

After you select Notepad or any other text editor the first time, it will be shown in your available options the next time onwards. Open the file with any text editor like Notepad, as shown below-

All we need to do is to switch the last digit (circled above) on the line 3 between 1 and 0 in order to switch between headnode1 and headnode0. For example, the above .rdp file logged me into headnode1. If I need my RDP connection to log me into headnode0, we need to change the file like this (only change is the last digit on the line 3)-

Now, save the .rdp file. To connect via this saved .RDP file, either double click on the saved icon or right click on the .rdp file and select 'Connect'

Once conneted, we can now verify that we are now on the active Headnode (in my example, headnode0) and hadoop services are running on this active Headnode, as shown below-

I hope this helps to clarify some confusion around this! Ideally RDP should connect us to active Headnode and we have requests logged to change the behavior – until that is implemented, you can use this simple workaround if you run into the scenarios described above.

↧

How to install Splunk on HDINSIGHT with a custom action script

June 1, 2015, 2:48 pm

≫ Next: How to access Hive using JDBC on HDInsight

≪ Previous: Why are the Hadoop services disabled on my HDInsight cluster

Recently I worked with a customer that wanted to use Splunk Enterprise and Splunk Forwarder to monitor and manage their HDINSIGHT Storm cluster. You can learn more about Splunk at http://www.splunk.com/ . Splunk has a version called Splunk Light that you can download for free. There are some restrictions, so read the documentation and license agreement. Splunk Light offers real-time log search and analysis. In this post I will show you how to install Splunk Light on all nodes of an HDINSIGHT cluster.

HDINSIGHT has a feature called custom action scripts that allow you to customize an HDINSIGHT's cluster during provisioning. With custom action scripts, you can do things like install software, change Hadoop configuration files, set environment variables, and many other things. You can read more about HDINSIGHT's custom action scripts at http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/ and https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-script-actions/

Let's get started.

We first need to download Splunk light for windows X64 msi from their web site at http://www.splunk.com/en_us/products/splunk-light.html. I'll be using splunklight-6.2.2-255606-x64-release.msi for the article, but download the latest version from Splunk.

Next we need to create public container in an Azure Storage account. The storage account and container must remain accessible throughout the lifetime of the cluster. HDINSIGHT can re-image nodes and when it does, the custom action script will be executed again. The custom action script also has to be idempotent. The Azure Storage account can be the HDINSIGHT default account or an additional storage account. For example I have created a scriptactions container in my storage account for this purpose (https://portalvhdszmhjyc3XXXXXX.blob.core.windows.net/scriptactions).

Now that we have our .msi file in a container in our storage account we will write and place our custom action script in the container. HDINSIGHT custom action scripts are PowerShell scripts. I have named mine splunk-installer-v1.ps1.The full script is below.

The script checks for the existence of a c:\apps\dist\temp_splunk folder. C:\apps and D:\ are safe to write data to. The re-image process will delete or re ACL files and folders in other locations. We will use this folder to copy our .msi file from the storage container to the HDINSIGHT node. If the folder does not exit we create it. The script then downloads the .msi to the c:\apps\dist\temp_splunk folder. It then executes the msi with msiexec with the /lv, AGREETOLICENSE =YES and /quiet switches. The /lv creates a verbose install log in case the install fails. We can search this log for "Return Value 3" for the reason of the fatal error. The AGREETOLICENSE =YES indicates that we agree to Splunk's License agreement, and the /quiet does a silent installation of the msi package. The next code block loops for up to five minutes checking for the Splunkd service. This is the service name and not the display name. We need to give the msi time to execute and install. This code block allow for that. Finally we clean up after ourselves by deleting the c:\apps\dist\temp_splunk folder. Go ahead and review the script. You can add exception handling but I wanted to keep things simple for the article.

You can execute the script either through Azure PowerShell cmdlets or .net code. You can use the Add-AzureHDInsightScriptAction cmdlet, https://msdn.microsoft.com/en-us/library/dn858088.aspx.

You can also use the azure portal and do a HDINSIGHT custom create. The last form will give you the option to add your custom action script. You can give the script action a name and which nodes to run the script on.

The cluster customization stage is the last stage before it becomes operational. If the cluster is created but the Splunk software is not installed under D:\program files\Splunk, you can review the install log at c:\applications\splunk-install.log. If the whole cluster creation failed due to your custom action script you can review the HDINSIGHT install logs. Each HDINSIGHT cluster provision writes a setuplog to Azure Table Storage. You can review this log to troubleshoot the cluster provisioning failure. The following blog post discusses the log in the Azure Table Storage and how to access them. http://blogs.msdn.com/b/brian_swan/archive/2014/01/06/accessing-hadoop-logs-in-hdinsight.aspx. This is your best option to determine why your custom action script failed.

You now can remote desktop into the node and use windows search and execute Splunk Light. The default user is admin and the password is changeme. You will be required to change the password. You can now start to add data to monitor and perform searches and log analysis. To find out more about using Splunk visit http://www.splunk.com/view/SP-CAAAG2R

I hope this show how you can customize you HDINSIGHT with Splunk or other software. Our development team has written examples of custom action scripts for Spark, R, Solr, and Giraph. These are good examples to review to learn more about HDINSIGHT's custom action scripts.

Install Spark - See Install and use Spark on HDInsight clusters.

Install R - See Install and use R on HDInsight clusters.

Install Solr - See Install and use Solr on HDInsight clusters.

Install Giraph - See Install and use Giraph on HDInsight clusters.

Happy Splunking!

Bill

↧

How to access Hive using JDBC on HDInsight

June 9, 2015, 3:59 pm

≫ Next: Spark on Azure HDInsight is available

≪ Previous: How to install Splunk on HDINSIGHT with a custom action script

While following up on a customer question recently on this topic, I realized that we have seen the same question coming up from other users a few times and thought I would share a simple example here on how to connect to HiveServer2 on Azure HDInsight using JDBC. For background, please review the apache wiki and the Cloudera blog on the architecture and the benefits of HiveServer2 for applications connecting to Hive remotely via ODBC, JDBC etc. There are also some good articles like this one which shows a step-by-step example for an on-premise Hadoop cluster. For an Azure HDInsight cluster, it's worth noting the following points –

HiveServer2 service has been started as part of cluster provisioning and is running on the active Headnode of the cluster, as shown below –
You can verify this by RDP-ing to the active Headnode of the Azure HDInsight cluster. If you don't see the HiveServer2 service running or if you find it disabled, please review this blog.
HiveServer2 is running in HTTP mode, on port 10001 and can be verified from hive-site.xml configuration file located under %HIVE_HOME%\conf folder, as shown below –
But, as explained in this blog, HDInsight is a managed cloud service and is secured via a gateway which exposes HiveServer2 endpoint (and other endpoints) on port 443. So, from your workstation, you may not be able to connect to Hiveserver2 directly on port 10001, rather client applications make a secure connection to port 443 and the gateway redirects to HiveServer2 on port 10001. So, a JDBC connection string for HDInsight may look something like this-
jdbc:hive2://myClusterName.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=/hive2

With that said, I am going to show a simple example of accessing HiveServer2 via JDBC from a JAVA application, using Maven. If you have already used Hiveserver2 using JDBC for an on-premise Hadoop cluster, you can skip the TLDR part below :-), you can just review the code sample below and see the difference in Hive JDBC connection string for Azure HDInsight. For others who like details, please review the steps below -

Developing a Java Application to access Hive via JDBC:

It is assumed that Maven is installed on your workstation. Ensure that maven is added to your PATH environment variable. Open a command prompt on your workstation and change folder to where you wish to create the project. For example, cd C:\Maven\MavenProjects
Use the mvn command, to generate the project template, as shown below –
mvn archetype:generate -DgroupId=com.microsoft.css -DartifactId=HiveJdbcTest -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
This will create the src directory and POM.xml in the directory HiveJdbcTest (same as artifactId)
Open the POM.xml with your favorite Java IDE, I have used Intellij. You can review this blog for more details with screenshots on how to use Maven with IntelliJ or Eclipse.
Open IntelliJ -> Import Project -> browse to POM.xml for your project created in step 2
In intelliJ, delete the default class created by IDE called app and delete the test folder if you don't plan to use it. Create a Java class called MyHiveJdbcTest
Modify the POM.xml to something like this –
NOTE on POM.xml:
a. How do we know which dependency jars we need to add? This is kind of a trial and error. I typically go to http://mvnrepository.com and do full text search and start with the ones that seem relevant. In this case, I first started with hive-jdbc JAR, at this point, my code compiles, but I still got run time error like ClassNotFoundException - then I added hadoop-common JAR. I was still getting 'Connection RESET' error from JDBC - I then added hive-exec and other dependency JARs. Also I have tried with Hive 0.13 and 0.14 JARs and both versions worked.
b. Even though it was not necessary, I added an entry for making a shade plugin under <build>, to make it an 'uber' (or fat) JAR - this is useful when the versions of dependency JARs in the runtime environment is not predictable or keeps changing. Uber JAR contains all dependencies in itself and hence is not dependent on dependency JARs' versions in the runtime environment.
Add the following code in the class MyHiveJdbcTest –
Build the project using maven command line:
cd c:\Maven\MavenProjects\HiveJdbctest
mvn clean package
Once it builds successfully, we can step through and debug the code in IDE.
We can also make it an executable JAR as shown in this blog.

I hope you find the blog useful! Please feel free to share your feedback.

↧

Spark on Azure HDInsight is available

July 14, 2015, 1:57 pm

≫ Next: Azure Data Factory JSON Changes in July 2015

≪ Previous: How to access Hive using JDBC on HDInsight

Spark on Azure HDInsight (public preview) is now available!

The following components are included as part of a Spark cluster on Azure HDInsight.

Spark 1.3.1 Comes with Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
Anaconda. A collection of powerful packages for python.
Spark Job Server, which allows your to submit jars or python scripts remotely.
Zeppelin Notebook for interactive querying.
Ipython Notebook for interactive querying.
Spark in HDInsight also provides an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.

Below are articles and documentation on Spark on Azure HDInsight to get you started!

Article	Link
Overview: Apache Spark on Azure HDINSIGHT	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-overview/
Provision Apache Spark clusters in HDInsight using custom options	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-provision-clusters/
Quick Start: Provision Apache Spark on HDInsight and run interactive queries using Spark SQL	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
Use BI tools with Apache Spark on Azure HDInsight	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/
Spark Streaming: Process events from Azure Event Hubs with Apache Spark on HDInsight	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-csharp-apache-zeppelin-eventhub-streaming/
Build Machine Learning applications using Apache Spark on Azure HDInsight	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-ipython-notebook-machine-learning/
Manage resources for the Apache Spark cluster in Azure HDInsight	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-resource-manager/
Spark Job Server on Azure HDInsight clusters	https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-job-server/

↧

Azure Data Factory JSON Changes in July 2015

July 21, 2015, 8:39 pm

≫ Next: Spark or Hadoop

≪ Previous: Spark on Azure HDInsight is available

Azure Data Factory factories are designed with a series of fairly simple JSON documents and uploaded to Azure using either the web interface, PowerShell, .Net, or Visual Studio.

If you were using the pre-release public preview of Azure Data Factory, you should be aware of a recent change in the SDK, in order to make the transition as seamless as possible. Recently there have been there is a set of JSON Schema changes in Azure Data Factory that happen when the new PowerShell SDK (update as of July 17, 2015) is downloaded. The various Azure SDKs are released monthly I think, so the best way to know what you are using is to remember the date when you last downloaded it. http://azure.microsoft.com/en-us/downloads/

The hope is that the new JSON format makes more sense to users, so pardon our dust as we evolve toward making it more user friendly to develop with Azure Data Factory.

The changes should be easy to see and managed.

Three things you need to know:

When you are in the web portal https://portal.azure.com inside the Data Factory scripts, the changes are signaled with this informative notice. Your JSON should automatically be upgraded. When using the web portal to Author and Deploy, creating new objects, the web editor will know the new version of the JSON schema and can autofix the documents to the new format

The "Author and deploy" editor now uses new JSON format for data factory entities. Note:

Properties specific to an entity type are now specified in "typeProperties" object.
Type names have been changed.
When a draft containing the old JSON format is opened, you will be prompted to automatically upgrade the JSON to the new format.

2. The same JSON change impact Visual Studio users. More info TBD

3. We have a freely downloadable tool that helps convert all JSON design documents in one directory as a batch, from the command line console:

Main page on GitHub here https://github.com/Azure/Azure-DataFactory
Direct download link https://github.com/Azure/Azure-DataFactory/releases/download/v1.0/Microsoft.Azure.DataFactoryJsonConversionTool.msi
Example usage: JsonUpgradeTool.exe /sd <source directory> /td <target directory> [/v <target API version>]

Documentation References to observe

The documentation team has done a great job documenting the exact changes in the JSON schema that may impact your designs.

Release Notes https://azure.microsoft.com/en-us/documentation/articles/data-factory-release-notes/#notes-for-07172015-release-of-data-factory
JSON Documentation https://msdn.microsoft.com/en-us/library/azure/dn835050.aspx
The old schemas of the JSON format are mentioned in parallel for reference if you need to keep using the old PowerShell SDK version. https://msdn.microsoft.com/en-us/library/azure/mt185729.aspx

Hope this helps – we will update this blog if we notice any issues! ~JasonH

↧

Spark or Hadoop

July 27, 2015, 7:45 am

≫ Next: Using cross/outer apply in Azure Stream Analytics

≪ Previous: Azure Data Factory JSON Changes in July 2015

Spark is the most active Apache project and has a lot of media press in the big data world. So how do you know if Spark is right for your project and what is the difference between Spark and Hadoop when run on HDInsight? I'll cover some of the differences between Spark and Hadoop and some of the things to consider for your next project.

Spark and Hadoop are both big data frameworks. Spark can run on top of Hadoop but it does not have to. You can run Spark in local standalone mode on your laptop or in a distributed manner on a cluster. Spark has its own resource manager (Standalone Scheduler) as well as supporting other resource managers like Mesos and Yarn. On HDInsight by default, Spark uses its own resource manager and not Yarn. On HDInsight, Spark has a SparkMaster service on the headnodes and a SparkSlave service on the workernodes. These services start and manage JVM' for Spark.

One of the main reasons Spark is run on top of Hadoop is that Spark does not have a distributed file system like HDFS or Windows Azure Storage. Running Spark on top of Hadoop gives Spark access to distributed data that most big data projects require.

Spark can be faster in some circumstances and workloads. Spark can handle a lot of operations in memory which reduces the time to write and read from physical disk. Memory access is faster than disk access. MapReduce, on the other had writes data back to disk after operations in order insure recoverability on failure. Spark uses RDDs, Resilient Distributed Datasets. RDD's are datasets of objects that are distributed across nodes of the cluster. RDD's are automatically recoverable on failure, so intermediate data does not have to be written to disk. RDD's are also partitioned. Figuring out a RDD's correct partition size can be a challenge for optimal performance.

A spark job is broken up into stages and tasks. Each task has its own thread and is scheduled on an executor. An executor can run multiple tasks which means executors are multi-threaded. The executors also store Sparks' cache which stores the RDD's. As tasks are scheduled on an executor it runs code against an RDD's partitioned data. An executors multi-threaded nature helps improve performance.

Both Spark and Hadoop have shuffle operations. Spark writes intermediate data to physical disk. On HDInsight the shuffles intermediate data is written to disk locally on the virtual machines and not the default storage account on Windows Azure Storage. The shuffle can be a bottleneck for both Spark and Hadoop.

Spark has a rich programming choice. It supports Java, Scala, Python, and in Spark 1.4, R. This gives your development team a wide choice of languages to choose from. Spark is written in Scala. Scala is a functional programming language and is not as well-known as Java. Python is widely known and has a large developer base to work from. Python and R are widely used by data scientists for machine learning.

Spark uses "lazy execution". Spark commands are either transformations or actions. A transformation command builds up the plans lineage (metadata) and is not executed. The return type of a transformation is a RDD. Actions take the linage and executes it. Actions usually writes data back to the driver application or writes data to disk. Getting used to Sparks "lazy execution" can take some getting used to.

Spark can be a one stop shop instead of stitching together multiple projects in Hadoop. Sparks core contains functionality for scheduling, memory management, fault tolerance and different storage systems. It also has packages for Spark SQL, Spark Streaming, Spark Machine Learning and Spark GraphX processing. Instead of using multiple Hadoop sub projects like Storm, Hive, Sqoop, and other projects to create a solution you might be able to just use Spark to create the same solution.

Spark moves big data closer to interactive processing. Spark on HDInsight has multiple ways an end user can interact with the cluster. It has a Spark Dashboard to help manage and troubleshoot. It has IPython and Zeppelin notebooks to run interactive queries from your desktop. It has a Spark Job Submission Service so you can use Rest API to copy a local .jar or python script to Windows Azure Storage and then execute it on the Spark cluster in a batch mode. This can be done or scheduled remotely from your desktop so you don't have to remote desktop to the cluster to execute it. It also support the Spark ODBC driver so you can use Azure Power BI or Tableau to do interactive analysis. Spark on HDInsight gives end users a rich way to interact with the cluster.

This should give you a sense of some of the similarities and differences between Spark and Hadoop and how they interact with each other. Any big data project has a lot of challenges. For your next project, give Spark on HDInsight a look and see if it is right for you and your team!

Bill

↧

Using cross/outer apply in Azure Stream Analytics

August 5, 2015, 4:54 am

≫ Next: Why is my spark application running out of disk space?

≪ Previous: Spark or Hadoop

Recently I got involved in working with a problem where JSON data events contain an array of values. The goal was to read and process entire JSON data event including the array and the nested values using Microsoft Azure Stream Analytics service.

Below is the sample data as an example

{
    "Itemid": "0001",
    "Itemtype": "donut",
    "Itemname": "Cake",
    "Itemppu": 0.55,
    "Itemtopping":
        [
            { "Tid": "5001", "Ttype": "Liquid Eggs" },
            { "Tid": "5002", "Ttype": "Syrups" },
            { "Tid": "5005", "Ttype": "Cocoa" },
            { "Tid": "5007", "Ttype": "Powdered Sugar" },
            { "Tid": "5006", "Ttype": "Chocolate with Sprinkles" },
            { "Tid": "5003", "Ttype": "Chocolate" },
            { "Tid": "5004", "Ttype": "Vanilla" }
        ]

}

In the above sample data, you will notice that it contains an array (named Itemtopping). You can see the convention is to list multiple records objects, each one in curly braces {}, with commas between the multiple records, all inside the same array (denoted with square brackets [ ] ). This is a standard way to make an array when following JSON conventions as you can read at https://www.json.com/json-array

This blog explain about how such a stream can be processed in Microsoft Azure Stream Analytics Service.

To process such streams cross/outer apply operator can be used. This operator used to flatten a stream containing arrays in one or more columns. The difference between cross apply and outer apply is that if array is empty there is no output will be written in case of cross apply while with outer apply it return one row with Arrayindex and Arrayvalue as null. Below are the steps explained to test the scenario.

Sending Events to Event hub

In this example I am using a convenient and free tool called the Service Bus Explorer for sending my sample events to Microsoft Azure Event Hub. Stream Analytics has built-in adapters to read from and write to Event Hub as an input or output. Please make sure an Azure Event Hub exists before connecting/sending events to it (see Service Bus in the Azure portal). There are other methods to do the same via code or other tools, so you can use whatever method is comfortable to you.

Using the tool, I have sent couple of events into the queue for this test scenario by editing the sample data a bit before sending each message.

Creating Input in Microsoft Azure Stream Analytics

Go to Microsoft Azure portal website
Click STREAM ANALYTICS
Click NEW. Provide a Job name, specify the Region, and select a Regional Monitoring Storage Account. Click CREATE STREAM ANALYTICS JOB

Click INPUTS and ADD AN INPUT

Click Data Stream and Next
Select Event Hub and Next
Provide the details for the Event Hub

Click Next
Select EVENT SERIALIZATION FORMAT as JSON
Select ENCODING as UTF8
Click Finish

Writing A Query

Click QUERY
Provide below query. Notice the highlighted portion of the query where the nested JSON object is parsed and represented. Notice the GetElements() function is used to expand the nested array of records, and the Label.ArrayValue and Label.ArrayIndex retrieves metadata about the JSON array that is fetched.

SELECT e.Itemid as Itemid, e.Itemtype as Itemtype, e.Itemname as Itemname, e.Itemppu as Itemppu,
toppings.ArrayValue AS Bvalue, toppings.ArrayIndex AS Bindex
INTO output
FROM input AS e
CROSS APPLY GetElements (e.Itemtopping) as toppings

Click SAVE

Setting the Output

Click OUTPUTS
Click ADD AN OUTPUT
Select Table storage
Click Next

Note:- I am using Bindex as PARTITION KEY and Itemid as ROW KEY. For more information on partition and row key please refer here.

Click Finish
Once output is created, click START to run ASA job
I am using CUSTOM TIME option since I already have events in event hub

Click Ok

Analyzing Output

After wait for some time, the output will be seen in table storage as mentioned below.In this example, I am using Azure Storage Explorer to view the output.

Notice the bvalue column has the token “Record” value rather than actual data. The reason for this is since Cross Apply did perceive the nested JSON object as a nested record, but this shape of table cannot represent all the nested properties in the record object in a single column. So we need to further expand the record the nesting by selecting the individual record fields explicitly to divide it into separate columns.

Change in Query

Let’s change the query to get data from array.

Delete table storage from Azure Storage Explorer
Stop Azure Stream Analytic job
Switch to QUERY tab
Change the query to match the example. Notice the nested properties in the JSON record array are enumerated in the SELECT clause using a two-period notation Label.ArrayValue.property for the named nested record properties that we fetch using GetElements() function.

SELECT e.Itemid as Itemid, e.Itemtype as Itemtype, e.Itemname as Itemname, e.Itemppu as Itemppu,
toppings.ArrayIndex AS Bindex, toppings.ArrayValue.Tid as toppingid, toppings.Arrayvalue.Ttype as toppingvalue
INTO output FROM input AS e
CROSS APPLY GetElements (e.Itemtopping) as toppings

Click SAVE

Click START to start ASA job

Final Output

After processing the data, switch to output and watch records in Azure Storage Explorer. Notice all the data details including the nested JSON array of record objects lands successfully in table storage as separate columns and rows.

I hope this will be a useful post. Thanks for reading it. Have a nice time and happy learning!

↧

Why is my spark application running out of disk space?

August 12, 2015, 10:56 am

≫ Next: How to Access HDInsight Linux Web UI's using SSH Dynamic Tunneling

≪ Previous: Using cross/outer apply in Azure Stream Analytics

In your zeppelin notebook you have scala code that loads parquet data from two folders that is compressed with snappy. You use SparkSQL to register one table named shutdown and another named census. You then use the SQLContext to join the two tables in a query and show the output. Below is the zeppelin notebook and code.

import org.apache.spark.sql._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val dfshutdown = sqlContext.load("wasb://data@eastuswcarrollstorage.blob.core.windows.net/parquet/abnormal_shutdown_2_parquet_tbl/", "parquet")

val dfcensus = sqlContext.load("wasb://data@eastuswcarrollstorage.blob.core.windows.net/parquet/census_fact_parquet_tbl/dt=2015-07-22-02-04/", "parquet")

dfshutdown.registerTempTable("shutdown")

dfcensus.registerTempTable("census")

val q1 = sqlContext.sql("SELECT c.osskuid, s.deviceclass, COUNT(DISTINCT c.sqmid) as Cnt FROM shutdown s LEFT OUTER JOIN census c on s.sqmid = c.sqmid GROUP BY s.deviceclass, c.osskuid")

q1.show()

However, during the execution, exceptions are raised in the Spark Dashboard and returned to the zeppelin notebook and eventually after retries the job fails. What is going on?

Exceptions:

org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3066934057 - discarded

at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)

at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)

at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)

at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3066934057 - discarded

at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)

at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)

at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)

at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)

at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)

at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)

at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)

at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)

at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

... 1 more

java.io.IOException: There is not enough space on the disk

at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:74)

at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)

at sun.nio.ch.IOUtil.write(IOUtil.java:51)

at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)

at sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)

at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)

at org.apache.spark.util.Utils$.copyStream(Utils.scala:326)

at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:736)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:734)

at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

What's going on?

First let's review some background information. The chart below describes attributes of different compression formats. Notice that Snappy is not splitable. In the example the parquet files are compressed with snappy. When a file is loaded from disk it will try to split the file into blocks in order to distribute across the cluster's worker nodes. Because Snappy and Gzip is not splitable, there is no marker within the file to show where the file can be broken up into blocks, therefore the whole file is in one split.

Compression Format	Tool	Algorithm	File Extension	Splitable
Gzip	Gzip	DEFLATE	.gz	No
Bzip2	Bizp2	Bzip2	.bz2	Yes
LZO	Lzop	LZO	.lzo	Yes, if indexed
Snappy	n/a	Snappy	.snappy	No

When the load method of the SQLContext is executed a resilient distributed dataset (RDD) is created. A RDD is a collection of objects that are distributed across the cluster and partitioned. Because the snappy file is not splitable a RDD is created with only one partition. If the file was splitable the RDD would be created with multiple partitions.

So what is going on? When the shuffle sort operation occurs. Shuffle data is written to local disk on the worker nodes. The size of the data file to be sorted exceeds the available disk space on the worker node. You can see the first exception where the fetch for the shuffle exceed 2g. Then the next exception where there is not enough disk space. If the RDD had more partitions the shuffle operations would be done against smaller datasets that are under 2g.

So how do we resolve the problem? Fortunately we can repartition the RDD to create more partitions after the load method and before the shuffle operation. You can do tests to increase the number of partitions to find the right number for your cluster and data. So we can add the code .repartition(100), which creates a RDD with 100 partitions and allows the shuffle sort to succeed for this workload.

The new code looks like:

import org.apache.spark.sql._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val dfshutdown = sqlContext.load("wasb://data@eastuswcarrollstorage.blob.core.windows.net/parquet/abnormal_shutdown_2_parquet_tbl/", "parquet")

dfshutdown.repartition(100)

val dfcensus = sqlContext.load("wasb://data@eastuswcarrollstorage.blob.core.windows.net/parquet/census_fact_parquet_tbl/dt=2015-07-22-02-04/", "parquet")

dfcensus.repartition(100)

dfshutdown.registerTempTable("shutdown")

dfcensus.registerTempTable("census")

val q1 = sqlContext.sql("SELECT c.osskuid, s.deviceclass, COUNT(DISTINCT c.sqmid) as Cnt FROM shutdown s LEFT OUTER JOIN census c on s.sqmid = c.sqmid GROUP BY s.deviceclass, c.osskuid")

q1.show()

More Information:

https://forums.databricks.com/questions/111/why-am-i-seeing-fetchfailedexception-adjusted-fram.html

I hope this helps your Spark on HDInsight experience,

Bill

↧

How to Access HDInsight Linux Web UI's using SSH Dynamic Tunneling

August 12, 2015, 1:02 pm

≫ Next: Troubleshooting Hive query performance in HDInsight Hadoop cluster

≪ Previous: Why is my spark application running out of disk space?

Scenario

One of the most important feature of Azure HDInsight Linux (currently on preview), is the feature available on the portal, called Ambari Web. If you open up Azure Portal and select your HDI linux cluster, you will see the AMBARI WEB, a web UI for cluster management and monitoring, on the bottom pane as below:

On the new azure portal (https://ms.portal.azure.com/?r=1), the AMBARI WEB, is linked from the "Dashboard" link of the HDInisght Linux cluster as below:

Once you click on the link, the first authentication will require your https user (admin) and the corresponding password. Then you will be on the main Ambari Web UI page with the prompt for user id and password (admin/admin). Once you enter this, it will enter into the main page as below:

One of the issue you will face though is that, the other UI pages you want to browse from this main Ambari page, in Windows for example, will error out as below

Try browsing, without any changes made on the browser, to "NameNode UI"

Error message:

Notice the URL it is trying to browse to:

http://headnode0.meerhdplinux-ssh.j7.internal.cloudapp.net:30070/

The problem is that browser, in this case Chrome, cannot resolve the hostname (or FQDN) to forward the call to.

To get around the issue, you will have to setup an SSH tunneling and configure proxy for the browser to use the tunnel to reach to the headnode. SSH Tunneling along with other Linux HDInsight is being discussing here, in greater depth, . I am setting up the tunneling using MobaXterm , as below to show how we can set this up using MobaXterm and Chrome, from your windows client.

Setting up tunneling in MobaXterm

To setup Tunnel, from MobaXterm (open it up, if not already opened) button toolbar, click on "Tunneling"

To create a new tunnel, click on "New SSH tunnel". I already have one so I will show what have set to create my tunnel, so I will click on "Edit the Tunnel" button on my already created tunnel

The settings window pops up

Notice the settings for my tunnel

Dynamic port forwarding (SOCKS proxy) has to be chosen
Local clients will go through port 9876. This is an arbitrary port, you can set different one, I just set 9876
For the remote destination, I entered following information:

Host: DNS name of my Linux cluster - meerhdplinux-ssh.azurehdinsight.net
User: hdpmeer
Port: 22 (ssh port)

Once this is set, you can start your tunnel, click on the start (play button)

Setting up the Proxy in the Browser

To demonstrate how to set proxy, I chose to use Chrome browser and Foxy Proxy. This is not available for IE for now, hence is the route.

Click on "Customize and control Google Chrome"->More Tools -> Extension

I already have this installed, if I did not have it, I would click on "Get More Extensions" on the extensions page
Then I would , on search the Store box, type "Foxy" and select the "FoxyProxy Standard"

Thus it would install the FoxyProxy, proxy tool on the browser
Upon successful installation, back to the Extensions page, you would see your FoxyProxy being enabled
To create a new proxy, click on "Options" link

Since I already have one setup, I will display the properties for it. To add a new proxy though, you click on the "Add New Proxy"

First I select my existing proxy and then click on "Edit Selection"
1. On the General Tab, Proxy Name is set to "localhost:9876". This was set automatically, once I set the other settings
2. On the "Proxy Details" page, make sure "SOCKS proxy?" is checked. Provider the client name where you are browsing, using localhost. Provide the port number you used when you configured your SSH tunnel in mobaxterm
3. On the "URL Patterns" tab, since I have the pattern created, I simple select the pattern and click on Edit selection. Otherwise I would click on "Add new pattern"
4. Note the settings for my pattern:
  1. The pattern is enabled
  2. URL pattern: *headnode*. This is because I am using the wildcard * and any name with headnode on it.
  3. Also, I must Whitelist the URLs matching this pattern.
Once I have these setup, my proxy is ready to be turn on.

Now Before I start browsing the Ambari Web, I can turn on the proxy as below:

This will route my requested URL to the SSH tunnel in the mobaXTerm and I will be able to browse through the Web UI for different Hadoop components (services)

↧

Troubleshooting Hive query performance in HDInsight Hadoop cluster

August 13, 2015, 4:03 pm

≫ Next: Some things to consider for your Spark on HDInsight workload

≪ Previous: How to Access HDInsight Linux Web UI's using SSH Dynamic Tunneling

One of the common support requests we get from customers using Apache Hive is –my Hive query is running slow and I would like the job/query to complete much faster – or in more quantifiable terms, my Hive query is taking 8 hours to complete and my SLA is 2 hours. Improving or tuning hive query performance is a huge area. Depending on the data size and type and complexity of the query we may have to look from different perspectives and play with many different configuration settings in yarn, MapReduce, hive, Tez etc. To discuss all those options in details, it will definitely take much more than a blog. So today in this blog I will actually discuss the steps we followed to troubleshoot a hive query performance issue that I have worked on recently and I hope it will give you some pointes on how to troubleshoot performance or slowness of a hive query. I will talk about the logs we captured, the questions we tried to answer from the logs and how that led us to the right cluster size and configurations to reduce the execution time significantly.

What was the issue?

This customer was running a Hive query and the query was taking three and half hours but the customer's SLA expectation was within an hour or so. To reduce the execution time they scaled up their 60 node HDInsight cluster data node size from A3 (7 GB RAM and 4 Core) to A6 (28 GB RAM and 4 Core) but kept the number of nodes the same. However, running the query in the new A6 data node cluster didn't yield any performance gain.

What logs did we capture?

To troubleshoot the issue we collected the following logs/files after they run the job on a 60 A6 data node cluster.

Hive query and explain extended for the hive query.
Hive log after running the Hive query in DEBUG mode.
To switch to DEBUG mode we had the customer change the following line in "C:\hdp\hive¬*\conf\log4j.properties" file:
From: hive.root.logger=INFO,DRFA
To: hive.root.logger=ALL,DRFA
Yarn log by runing the following command from Hadoop command line with the application ID to capture the yarn log.
yarn logs -applicationId <ApplicationId> -appOwner <cluster_user_id> > <path>\Yarnlogs.txt
job_jobid_conf.xml file(s) from the WASB folder 'mapred/history/done/<year>/<month>/<execution date of the failed job>/'
Configuration files hive-site.xml, mapered-site.xml, yarn-site.xml

Where did the job spend the most of the time and was there any bottle neck?

First we wanted to find out where the job spent most of the time and if there was any bottle neck. We checked the hive query and explain extended output of the hive query. The query was very straight forward. They were inserting JSON data from one hive table to another as ORC format (the data was getting converted in the process). They had about 5000 files of different sizes, from few hundred KB to few MBs, with a total size of 400 GB. In explain extended output we didn't see anything that standout. Then we checked the hive log, same thing nothing unusual. Next step yarn log, from the yarn log we found there were 677 splits, so the job had 667 tasks.

2015-07-21 17:14:35,709 INFO [InputInitializer [Map 1] #0] tez.HiveSplitGenerator: Number of grouped splits: 677

The job was launched at around 2015-07-21 17:13:05. We checked few tasks to see the time taken by these tasks and when they were scheduled. Below is from the yarn log for one of the first tasks:

Container: container_1437433182798_0013_01_000010

LogType:syslog_attempt_1437433182798_0013_1_00_000001_0

Log Upload Time:Tue Jul 21 20:24:59 +0000 2015

LogLength:34576

Log Contents:

2015-07-21 17:15:07,159 INFO [main] task.TezChild: Refreshing UGI since Credentials have changed

2015-07-21 17:15:07,159 INFO [main] task.TezChild: Credentials : #Tokens=2, #SecretKeys=1

......

2015-07-21 17:37:30,172 INFO [TezChild] task.TezTaskRunner: Task completed, taskAttemptId=attempt_1437433182798_0013_1_00_000001_0, fatalErrorOccurred=false

This task took about 22 minutes (17:37:30 - 17:15:07) to compete. We saw at around 20:19, App Master got the completion report of all tasks.

2015-07-21 20:19:28,372 INFO [AsyncDispatcher event handler] impl.VertexImpl: Num completed Tasks for vertex_1437433182798_0013_1_00 [Map 1] : 677

The job finally completed at around 2015-07-21 20:41:47. Let's line up the time stamps to better understand where did the job spend most of the time.

Job started 17:13:05

Frist task launched at around 17:15:07

All tasks completed at around 20:19:28

The job finally completed at around 20:41:47

From start to end the job took about 3 hours and 28 minutes (20:41:47 - 17:13:05) to complete.
Before the 1^st task launched the driver program took only 2 minutes (17:15:07- 17:13:05)
The Task Execution Phase took about 3 hours 4 minutes (20:19:28 - 17:15:07) to complete all tasks.
Finally time between the all tasks completion and job completion is about 22 minutes (20:41:47 - 20:19:28).

It is clear that the job spent most of the time in the Task Execution Phase, about 3 hour 4 minutes and before and after the Task Execution Phase the job spent only (2 + 22) 24 minutes. Clearly the driver program was not the bottle neck. Therefore to improve performance we need to find out how we can reduce the execution time in the Task Execution Phase.

Why did the Task Execution Phase take so long?

Next we wanted to find out why the Task Execution phase took that long. Yarn log showed the tasks that were executed later were waiting for containers to become available. We checked one of the last tasks and found it was launched more than two hours later from first task (at 17:15:07 shown above).

LogType:syslog_attempt_1437433182798_0013_1_00_000676_0

Log Upload Time:Tue Jul 21 20:24:59 +0000 2015

LogLength:35060

Log Contents:

2015-07-21 19:37:14,225 INFO [main] task.TezChild: Refreshing UGI since Credentials have changed

2015-07-21 19:37:14,225 INFO [main] task.TezChild: Credentials : #Tokens=2, #SecretKeys=1

2015-07-21 19:37:14,225 INFO [main] task.TezChild: Localizing additional local resources for Task : {}

.....

2015-07-21 20:06:31,528 INFO [TezChild] task.TezTaskRunner: Task completed, taskAttemptId=attempt_1437433182798_0013_1_00_000676_0, fatalErrorOccurred=false

This task took 29 minutes to complete. AppMaster log showed the delay as well – the task had to wait to find available resources to run the task on a container.

2015-07-21 17:14:37,881 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1437433182798_0013_1_00_000676_0 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE

...

2015-07-21 19:37:10,863 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Assigning container to task, container=Container: [ContainerId: container_1437433182798_0013_01_000626, NodeId:

workernode0.asurionprodhdc.b8.internal.cloudapp.net:45454, NodeHttpAddress: workernode0.asurionprodhdc.b8.internal.cloudapp.net:30060, Resource: <memory:9216, vCores:1>, Priority: 2, Token: Token { kind:

ContainerToken, service: 100.72.182.58:45454 }, ], task=attempt_1437433182798_0013_1_00_000676_0, containerHost=workernode0.asurionprodhdc.b8.internal.cloudapp.net, localityMatchType=NonLocal,

matchedLocation=*, honorLocalityFlags=false, reusedContainer=false, delayedContainers=0, containerResourceMemory=9216, containerResourceVCores=1

2015-07-21 19:37:14,252 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1437433182798_0013_1_00_000676_0 TaskAttempt Transitioned from START_WAIT to RUNNING due to event TA_STARTED_REMOTELY

So there wasn't enough containers to run all 667 tasks at once and rather tasks were executed in multiple waves. To get some idea about how many containers were available we checked the MR memory settings in the configuration files which determine the container size and number.

In mapred-site.xml

<name>mapreduce.map.memory.mb</name>

<name>mapreduce.reduce.memory.mb</name>

So based on the above settings both map, reduce container size is set to 9 GB.

In yarn-site.xml

<name>yarn.nodemanager.resource.memory-mb</name>

Yarn-site.xml property yarn.nodemanager.resource.memory-mb defines the total amount of memory the Node-manager can use for containers. So with the above settings:

Container size for both map and reduce task= 9 GB

Total amount of memory node manager can use for containers = 18 GB

Number of containers per node 18/9 = 2 Containers

Total number of containers in a 60 node cluster 60*2 = 120 Containers

Earlier we saw there were about 667 tasks so with a total of 120 containers we will need about 667/120 ~ 6 waves of task execution to complete the job. The two tasks we checked so far took 22 and 29 minutes respectively. We checked few more and the time taken was about the same, so we didn't see any evidence of data skew. If we assume 30 minutes for each wave of tasks to complete then with 6 waves we would need about 3 hours which matches the time taken in the Task Execution Phase we have calculated earlier from the yarn log, 3 hours 4 minutes (20:19:28 - 17:15:07).

This also explains why the customer didn't see any improvement after scaling up from A3 to A6 data node. In A3 we have only 7 GB memory but the MR memory settings were also set to much smaller value thus moving to A6 didn't actually increase the number of containers.

How can we improve the performance of the Task Execution Phase of a Hive job?

Now to improve the performance of the Task Execution Phase of a Hive job we need to reduce the number of task execution waves by increasing the number of available containers. We had each container configured to have 9 GB RAM. Usually containers doesn't need to have more memory for big data size rather may need more memory based on the query plan when you have complex query and 2 GB container is good in most cases. We already know that the query is not complex and we didn't see anything unusual in the hive log or explain extended output. We considered the following two possible options of increasing the number of containers.

Scale up the cluster by increasing the available memory of each data node. Customer already did that by moving the data nodes from A3 (7 GB RAM and 4 core) to A6 (28 GB RAM and 4 core). But the container sizes were set too high and this is why they didn't see any increase in the number of available containers. We can just set the container size to a smaller value, for example to 2 GB, instead of 9 GB and that should increase the number of containers significantly.
Scale out the cluster by increasing the number of data nodes. For example if we increase the data nodes from 60 A3 to 120 A3 that would double the available containers.

To implement option #1 we can just set the mapreduce.map.memory.mb and mapreduce.reduce.memory.mb 2 GB (also the corresponding java.ops settings for heap to a smaller value) and that should increase the number of containers by more than 4 times. However, with this change each node will have 18/2=9 containers. Notice data node size A6 has 4 times more memory than data node size A3 but both has 4 cores. So if we have 9 containers in each node they will have to share the 4 available cores in the node. Which is more than 2 containers per core and this would not give us the best balance of resource utilization.

On the other hand to implement option #2 we need to go back to the pervious smaller data node size A3 ( 7 GB RAM and 4 core) and then scale out the cluster to 120 nodes instead of 60 nodes. This will double the amount of memory and at the same time will also double the available cores. With smaller data node we can try to set the container size to an even smaller value for example 1 GB since we have 4 cores available in each node. Just want to mention here in HDInsight you can quickly delete and recreate a Hadoop cluster. So recreating a cluster with different data node size and number is not a problem.

Which option would be better for our scenario?

For our issue smaller containers in A3 data node made more sense based on the data size and the query complexity. With A3 we will also be able to make sure most of the containers will have a separate core even with smaller container size. So we decide to scale out the cluster from 60 A3 data nodes to 120 A3 data nodes and set the container size to 1 GB. In A3 data node we have a total of 7 GB RAM and 4 cores. By default yarn.nodemanager.resource.memory-mb is set to 5.5 GB so we should have 5 containers in each node.

Usually Hive queries run faster using Tez engine compared to MR. So we wanted to recommend the customer to use Tez engine. But then we found out customer was already using Tez by setting hive.execution.engine=tez. When using Tez for Hive the container sizes are determined by Tez task's memory setting hive.tez.container.size instated MR memory settings (mapreduce.map.memory.mb and mapreduce.reduce.memory.mb). From the hive-site.xml file we found that the hive.tez.container.size was set to 9 GB, same as the MR memory settings. So the calculations we did above to find out the available containers based on the MR memory settings is still valid.

In hive-site.xml

<name>hive.tez.container.size</name>

Final recommendation and the results:

We recommended the customer to create a new HDInsight Hadoop cluster with 120 A3 data nodes and also add the following set command in the Hive query:

set hive.tez.container.size=1024;

set hive.tez.java.opts ="-Xmx819m";

Customer ran the same Hive query using the same data set and this time it took only 1 hour and 16 minutes. The Task Execution Phase took about an hour. So with the recommended cluster size and configuration changes we were able to reduce the execution time to one third of the previous execution time.

Conclusion

The specific changes we made in this case to improve the performance of the hive query may not be applicable to improve the performance of the hive quire that you are struggling with. For example in this case the number of tasks were more than the number of available containers, so to improve the performance we increased the number of available containers. However, in some other cases you may see just the opposite - the number of tasks is much smaller comparted to the number of available containers. In those situations we need to increase the number of tasks to improve the performance and you can set mapreduce.input.fileinputformat.split.maxsize to a smaller value to increase the number of splits and thus increase the number tasks. The split size is calculated by the formula:

max(minimumSize, min(maximumSize, blockSize))

By default minimumSize < blockSize < maximumSize, so by default split size is blockSize and in HDInsight default value for fs.azure.block.size is 512 MB.

I hope this blog has cleared up some of the basic concepts or guiding principles that you need to follow to improve the performance of a hive query in a Hadoop cluster especially in HDInsight. Hopefully by following the same approach you should also be able to find the right cluster size and configurations to optimize the performance of your hive query.

↧

Some things to consider for your Spark on HDInsight workload

August 19, 2015, 9:02 am

≫ Next: Troubleshooting Oozie or other Hadoop errors with DEBUG logging

≪ Previous: Troubleshooting Hive query performance in HDInsight Hadoop cluster

When it comes time to provision your Spark cluster on HDInsight we all want our workloads to execute fast. The Spark community has made some strong claims for better performance compared to mapreduce jobs. In this post I want to discuss two topics to consider when deploying your Spark application on an HDInsight cluster.

Azure VM Types

One of the first decisions you will have to make when provisioning a Spark Cluster is deciding which Azure VM Types to choose. This link discuss the various cloud services and specification sizes for VM's https://azure.microsoft.com/en-us/documentation/articles/cloud-services-sizes-specs/. The various Azure VM types also cost differently so review the pricing information also.

Recent research from the University of California at Berkeley indicates that CPU and stragglers are bigger bottlenecks than disk IO or network. http://www.eecs.berkeley.edu/~keo/publications/nsdi15-final147.pdf Internal testing at Microsoft also shows the importance of CPU. Spark speed improves as core count increases. Spark jobs are broken into stages and stages are broken into tasks. Each task uses its own thread within the executors on the worker nodes. More cores means better parallelization of task executions.

D-series VMs provide faster processors, a higher memory-to-core ratio, and a solid-state drive (SSD) for the temporary disk. Besides the faster processors the D-series has SSD for the temporary disk. Currently, Spark shuffle operations write intermediate data to disk on the VM's and not Windows Azure Blob Storage. Faster SSD's can help improve shuffle operations.

Sparks executors can cache data and this cached data is compressed. Some tests have shown a 4.5 compression ratio and even better if files are in the parquet format. This compressed data is distributed across the worker nodes. Your Spark cluster might not need as much memory as you think if your code is caching tables and using a columnar storage format like orc or parquet. You can review the Spark UI dashboard under the storage tab to see information on your cached RDD's. You might get better performance from increasing your core count that the size of the memory on the worker nodes.

Data Formats

The most common data formats for Spark are text (json, csv, tsv), orc and parquet. Orc and parquet are columnar formats and compress very well.

When choosing a file format, parquet has the best overall performance when compared to text or orc. Tests have shown a 3 times improvement on average over the other file formats. However once a file format is cached in Spark it is converted to its own columnar format in memory and all of the formats have the same performance. Caching and partitioning are important to Spark applications.

If you have not worked or used the parquet file format before, it is worth taking a look at for your project. Spark has parquetFile() and saveAsParquetFile() functions. Parquet is a self-describing file format. More details can be found at http://parquet.apache.org/documentation/latest/ and at http://spark.apache.org/

Conclusion

Spark has made a lot of promises for performance improvements for big data. However we all know that every workload is different and there are a lot of variables to consider to get optimal performance with big data projects. Testing with your code and your data is your best friend. However don't overlook the Azure VM types and the data file formats when it comes to your testing. This just might show big performance improvements for your project.

Bill

↧

Troubleshooting Oozie or other Hadoop errors with DEBUG logging

August 21, 2015, 5:35 pm

≫ Next: Collecting logs from Apache Storm cluster in HDInsight

≪ Previous: Some things to consider for your Spark on HDInsight workload

In troubleshooting Hadoop issues, we often need to review the logging of a specific Hadoop component. By default, the logging level is set to INFO or WARN for many Hadoop components like Oozie, Hive etc. and in many cases this level of logging is sufficient to trace the issue. However, in certain cases, INFO or WARN level logging may not be good enough to diagnose the issue, specifically when the error is a generic error, like this one- "JA009: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses" and when the error call stack does not provide a clue. Recently I worked with a customer who ran into this error while submitting an Oozie job to an HDInsight Linux cluster. As the error indicates, this is some sort of a configuration issue –but since there are quite a few places we can specify the same configuration, it is not always obvious where the misconfiguration is and it may sometimes become a tedious trial and error exercise. In this blog, I wanted to share an example of how DEBUG logging helped our troubleshooting and the steps to enable DEBUG logging for Oozie (we can use the same steps for other Hadoop components as well) on a Hadoop cluster. I used an HDInsight 3.2 Linux cluster, but these steps should apply to any Hadoop (2.x or later) cluster with Ambari 2.x.

The Issue and initial troubleshooting:

Customer attempted to submit an Oozie job, running Java Action, to an HDInsight 3.2 Linux cluster and the Java action failed with the error below –

org.apache.oozie.action.ActionExecutorException: JA009: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

.....

Caused by: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)

at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)

at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)

at org.apache.hadoop.mapred.JobClient.init(JobClient.java:485)

at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:464)

at org.apache.oozie.service.HadoopAccessorService$2.run(HadoopAccessorService.java:436)

at org.apache.oozie.service.HadoopAccessorService$2.run(HadoopAccessorService.java:434)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:434)

at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1178)

at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:927)

... 10 more

We reviewed the configurations specified in workflow.xml and Oozie REST payload (equivalent of job.properties or job.xml in Oozie/REST), but those appeared to be correct. We then decided to enable DEBUG level logging for Oozie.

Enabling Oozie DEBUG logging on HDInsight Linux:

For Oozie (same for other Hadoop components), logging properties are specified in the oozie-log4j.properties file. On HDInsight Linux Cluster, oozie-log4j.properties file is located under /etc./oozie/conf on headnode0. But we can enable DEBUG logging via Ambari web UI and enabling DEBUG logging is very simple via Ambari dashboard.

1. Login to 'Ambari web' from Azure portal

2. Select Oozie on the left hand side and then select the Configs tab –

3. Expand 'Advanced oozie-log4j' section and scroll to the section where we can see the logging levels defined-

Change the logging level to DEBUG for the logging type you are interested about. For example, I wanted to add more detailed logging in oozie-log, so I changed the following entries to DEBUG -

log4j.logger.org.apache.oozie=DEBUG, oozie, ETW

log4j.logger.org.apache.hadoop=DEBUG, oozie, ETW

After the change, click on Save to save the changes.

4. Ambari web UI will then indicate that services need to be restarted due to stale config, restart the Oozie services –

5. Now that DEBUG level logging is in place, we can reproduce the issue and collect the logs - on HDInsight Linux cluster, we can get the oozie-log via Oozie Web UI, by selecting a job and then clicking on 'Get logs' button, like below. To access Oozie Web UI, you will need to enable SSH tunneling, as shown in this blog

Alternatively, we can get all the Oozie logs from the directory /var/log/oozie.

Reviewing Debug Log

In my example, oozie-log (with DEBUG logging) had the following relevant entry just before the error we were troubleshooting-

2015-08-18 15:15:39,972 INFO Cluster:113 - SERVER[headnode0.MyCluster-ssh.c2.internal.cloudapp.net] Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: java.lang.reflect.InvocationTargetException

2015-08-18 15:15:39,973 DEBUG UserGroupInformation:1632 - SERVER[headnode0.MyCluster-ssh.c2.internal.cloudapp.net] PrivilegedActionException as:oozie (auth:PROXY) via oozie (auth:SIMPLE) cause:java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

We also noted the following entry in the debug log -

2015-08-18 15:14:25,754 DEBUG AzureNativeFileSystemStore:1915 - SERVER[headnode0.myCluster-ssh.c2.internal.cloudapp.net] Found blob as a directory-using this file under it to infer its properties http://mystoragescct.blob.core.windows.net/mycontainername/user/admin/qemp-workflow/config-default.xml

This made us look into config-default.xml, located under 'qemp-workflow', which was same as workflow application path. Review of the config-default.xml revealed that we had incorrect Jobtracker and NameNode specified in there. Looking back, we may wonder why we didn't look at the config-default.xml in the first place, but that's the whole point J there are quite a few places configurations can be present and the DEBUG logging helped to reveal the existence of the config file that we were not aware of initially.

On Another flavor of this error, DEBUG logging helped to point to the root cause of the error –

2015-08-17 15:02:06,817 DEBUG UserGroupInformation:1632 - SERVER[headnode0.AzimHdiLinuxAux-ssh.j10.internal.cloudapp.net] PrivilegedActionException as:admin (auth:PROXY) via oozie (auth:SIMPLE) cause:org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: wasbs

Reverting back to previous configuration-

The nice thing about Ambari Web UI is that it stores multiple versions of the Configs, like below, which enable us to compare between a certain version and current version and then revert back to one of the previous versions, like below –

So, after we are done with troubleshooting, we can revert back to INFO level debugging by selecting the previous version and making it Current. This will require restart of Oozie services, as prompted by Ambari Web.

NOTE: We strongly recommend that you revert back the logging level to default since DEBUG level logging is very verbose and may cause the disk to run out of space.

The Takeaways-

My example was for Oozie, but using the steps above, you can enable DEBUG logging for other Hadoop components (such as Hive, Pig etc. for which log4j logging settings can be changed via Ambari) for troubleshooting when default level of logging doesn't help.
The above steps are not limited to changing logging level - using the Ambari web UI, you can change various configurations of Hadoop components for quick testing and then reverting it back.

Enabling Oozie DEBUG logging on HDInsight Windows:

I wanted to quickly touch on how we can enable debug logging on an HDInsight windows cluster. On HDInsight Windows, we don't have Ambari 2.x, nor the Ambari Web UI - so the steps are different. On HDInsight Windows cluster, oozie-log4j.properties file is located under the folder %OOZIE_ROOT%\conf or 'C:\apps\dist\oozie-4.1.0.2.2.7.1-0004\oozie-win-distro\conf'.

1. On the active headnode, go to folder %OOZIE_ROOT%\conf or 'C:\apps\dist\oozie-4.1.0.2.2.7.1-0004\oozie-win-distro\conf' (oozie version will vary).

2. Manually edit the file oozie-log4j.properties , such as -

log4j.logger.org.apache.oozie=DEBUG, oozie, ETW

log4j.logger.org.apache.hadoop=DEBUG, oozie, ETW

3. Go to start -> Run -> Services.msc and restart the Apache Hadoop OozieService

4. After troubleshooting is done, change the log4j logging back to the default level and restart the Oozie service again.

Happy troubleshooting J

Azim

↧