How to use parameter substitution with Pig Latin and PowerShell
When running Pig in a production environment, you'll likely have one or more Pig Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to locate their input data based on...
View ArticleHow to use HBase Java API with HDInsight HBase cluster, part 1
Recently we worked with a customer, who was trying to use HBase Java API to interact with an HDInsight HBase cluster. Having worked with the customer and trying to follow our existing documentations...
View ArticleSome Commonly Used Yarn Memory Settings
We were recently working on an out of memory issue that was occurring with certain workloads on HDInsight clusters. I thought it might be a good time to write on this topic based on all the current...
View ArticleLoading data in HBase Tables on HDInsight using bult-in ImportTsv utility
Apache HBase can give random access to very large tables-- billions of rows X millions of columns. But the question is how do you upload that kind of data in the Hbase tables in the first place? HBase...
View ArticleProblems When Using a Shared Default Storage Container with Multiple...
We have seen several cases come in to Microsoft Support that ended up being caused by having multiple HDInsight clusters using the same Azure Blob Storage container for default storage. While we don't...
View ArticleAzure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight...
We recently pushed out the 0.8.14 release of Azure PowerShell. This release includes some updates to the following cmdlets to ensure that values passed in via the PowerShell pipeline, or via the...
View ArticleSqoop Job Performance Tuning in HDinsight (Hadoop)
OverviewApache Sqoop is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. HDInsight is Hadoop cluster deployed in Microsoft...
View ArticleUnderstanding HDInsight Custom Node VM Sizes
With the 02/18/2015 update to HDInsight and Azure Powershell 0.8.14 we introduced a lot more options for configuring custom Head Node VM size as well as Data Node VM size and Zookeper VM size. Some...
View ArticleWhy are the Hadoop services disabled on my HDInsight cluster
I came across this question while working with a few customers recently and thought I would share a few tips with others who may find it helpful. There are times when we may need to check the status of...
View ArticleHow to install Splunk on HDINSIGHT with a custom action script
Recently I worked with a customer that wanted to use Splunk Enterprise and Splunk Forwarder to monitor and manage their HDINSIGHT Storm cluster. You can learn more about Splunk at...
View ArticleHow to access Hive using JDBC on HDInsight
While following up on a customer question recently on this topic, I realized that we have seen the same question coming up from other users a few times and thought I would share a simple example here...
View ArticleSpark on Azure HDInsight is available
Spark on Azure HDInsight (public preview) is now available!The following components are included as part of a Spark cluster on Azure HDInsight.Spark 1.3.1 Comes with Spark Core, Spark SQL, Spark...
View ArticleAzure Data Factory JSON Changes in July 2015
Azure Data Factory factories are designed with a series of fairly simple JSON documents and uploaded to Azure using either the web interface, PowerShell, .Net, or Visual Studio. If you were using the...
View ArticleSpark or Hadoop
Spark is the most active Apache project and has a lot of media press in the big data world. So how do you know if Spark is right for your project and what is the difference between Spark and Hadoop...
View ArticleUsing cross/outer apply in Azure Stream Analytics
Recently I got involved in working with a problem where JSON data events contain an array of values. The goal was to read and process entire JSON data event including the array and the nested values...
View ArticleWhy is my spark application running out of disk space?
In your zeppelin notebook you have scala code that loads parquet data from two folders that is compressed with snappy. You use SparkSQL to register one table named shutdown and another named census....
View ArticleHow to Access HDInsight Linux Web UI's using SSH Dynamic Tunneling
ScenarioOne of the most important feature of Azure HDInsight Linux (currently on preview), is the feature available on the portal, called Ambari Web. If you open up Azure Portal and select your HDI...
View ArticleTroubleshooting Hive query performance in HDInsight Hadoop cluster
One of the common support requests we get from customers using Apache Hive is –my Hive query is running slow and I would like the job/query to complete much faster – or in more quantifiable terms, my...
View ArticleSome things to consider for your Spark on HDInsight workload
When it comes time to provision your Spark cluster on HDInsight we all want our workloads to execute fast. The Spark community has made some strong claims for better performance compared to mapreduce...
View ArticleTroubleshooting Oozie or other Hadoop errors with DEBUG logging
In troubleshooting Hadoop issues, we often need to review the logging of a specific Hadoop component. By default, the logging level is set to INFO or WARN for many Hadoop components like Oozie, Hive...
View Article