Quantcast
Viewing all 50 articles
Browse latest View live

Collecting logs from Apache Storm cluster in HDInsight

While running an Apache Storm topology in a multi node storm cluster different components of the topology log in different files that are saved in different nodes in the cluster, depending on where that component is running. Today in this blog I will discuss the log files that are available in a storm cluster and from where and how you can collect those.

Image may be NSFW.
Clik here to view.

In a storm cluster we mainly have the following three types of logs.

  1. Nimbus log
  2. Supervisor logs
  3. Worker process logs

Nimbus Log:

Nimbus runs on the head node and the Nimbus logs are present on the active head node in the logs folder under storm-home at C:\apps\dist\storm-<version>\logs.To get the nimbus log you need to connect or RDP to the head node and then copy the file.

Image may be NSFW.
Clik here to view.

Supervisor logs:

Each worker node runs a supervisor demon that handles the work assignments from the nimbus. Therefore each worker node has separate set of supervisor logs. Supervisor logs are present on the logs folder under storm-home on worker nodes. Supervisor logs are saved as a chain of 102 MB files so you may see multiple supervisor log files.

Image may be NSFW.
Clik here to view.

To collect to supervisor logs you need to RDP to each worker node. Once you RDP to the head node from there you can RDP to the worker nodes by specifying workernode0, workernode1, etc. When collecting the logs if not sure in which node you are at you can always run "hostname" from a command line and find out.

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

Worker process logs

Worker processes log into worker logs that are basically JVM processes and are stored in worker nodes too in the same location as the supervisor log. Workers process run on predefined ports on the worker nodes. A worker process belongs to a specific topology and all the executors (threads) and tasks within the same worker process log in the same file. However, like the supervisor log worker process logs are saved as a chain of 102 MB files so you may see multiple worker process log files for the same port.

 Image may be NSFW.
Clik here to view.

You can RDP to the worker nodes and collect the worker process logs. However there are other convenient ways to view and download worker process logs. You can view and download the worker process logs from a) Storm UI through Azure portal and b) Visual Studio.

Viewing and downloading Worker process logs from Storm UI

After selecting your HDInsight Storm cluster in the Azure portal click the "Storm Dashboard" button at the bottom to open up the Storm Dashboard.

 Image may be NSFW.
Clik here to view.

Now click "Storm UI" link at the top to open up Storm UI from Storm dashboard.

Image may be NSFW.
Clik here to view.

Once you are at the Storm UI page under the "Topology summary" click the topology for which you want to view or download the worker process logs.
Image may be NSFW.
Clik here to view.

It should take you to the details page for that topology. Now click a spout or bolt for which you want to view the logs under the "Spouts (All time)" and "Bolts (All time)" sections. It should take you to the details page for that spout or bolt.


Image may be NSFW.
Clik here to view.

 

Now under the "Executor (All time)" section you should see a line for each executor (thread) and a port number with the hyper link.

 Image may be NSFW.
Clik here to view.

Click the port number with the hyper link and it should open up the log file for the worker process that was running on that port. In the above screen short if you click port 6703 from the first line it will open up the worker process log on workernode0 that was running on port 6703. You can download the whole file by clicking the download link.

Image may be NSFW.
Clik here to view.

Viewing and downloading Worker process logs from Visual Studio

There is a similar option to view and download the worker process logs from Visual Studio. If you haven't tried to create a storm topology in Visual Studio yet check this Azure document. In Visual Studio connect to your Azure account in Server Explorer and then navigate to your storm cluster under HDInsight. Right click the cluster name and select "View Storm Topologies"

Image may be NSFW.
Clik here to view.

It should show you the list of the topologies that are running on your storm cluster.

Image may be NSFW.
Clik here to view.

Select any of the topologies for which you want to view or download the worker process logs. You should see the same exact view as in "Storm UI" and from there should be able to navigate to the worker process logs to view or the download them by following the same steps we detailed in the earlier section.

Image may be NSFW.
Clik here to view.

Conclusion

To collect the storm worker process logs you do not need to connect to each worker node rather you can easily view or download them from Storm UI or Visual studio. However, for the storm nimbus and supervisor logs you still need to connect or RDP to the respective nodes.

Hopefully this blog was able to give you some idea about the different logs that you can collect from a storm cluster, which storm component logs in which log file, where do they reside and how to collect them.

Image may be NSFW.
Clik here to view.

How to configure Hortonworks HDP to access Azure Windows Storage

 

Recently I was asked how to configure a Hortonworks HDP 2.3 cluster to access Azure Windows Storage. In this post we will go through the steps to accomplish this.

The first step is to create an Azure Storage account from the Azure portal. My storage account is named clouddatalake. I choose the "local redundant" replication option while creating the storage account. Under the "Manage Access Keys" button at the bottom of the screen you can copy and or regenerate your access keys. You will need the account name and access key to configure our HDP cluster in later steps.

 

Image may be NSFW.
Clik here to view.

 

Next I created a private container named mydata. That's all you need to do on the Azure side. Everything else is done on your Hortonworks HDP cluster.

 

Image may be NSFW.
Clik here to view.

 

Hortonworks HDP 2.3 comes with the azure-storage-2.2.0.jar which is located at C:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib. You need to add a property to your core-site.xml file which is located at C:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop. You need to modify the name and value to match your Azure storage account. Replace the clouddatalake below with your storage account name and the value with your access key which you can copy from the Azure portal under the "Manage Access Keys" button. Save the core-site.xml file.

<property>

<name>fs.azure.account.key.clouddatalake.blob.core.windows.net</name>

<value>n7GJ2twVyr+Ckpko7MkA4uRWJc/8A/eWFztZvUVPorF4ZiLNeAe0IabudXpuxfFtj9czt8GUFpyKgP4XRc6b7g==</value>

</property>

Next restart your hdp services. This causes the namenode and resourcemanager services to read the core-site.xml file and populate its memory with the configuration change. The syntax for Azure Storage is wasb://<container>@<storageaccountname>.blob.core.windows.net/<foldername>/<filename>. Next you can use the Hadoop fs –ls wasb://mydata@clouddatalake.blob.core.windows.net/ to list the files in the container. I also used the –mkdir option to create a folder1 in the mydata container of the clouddatalake storage account.

 

Image may be NSFW.
Clik here to view.

 

Now you can use Hadoop distcp <src> <dst> to copy files between your local HDFS and Azure Storage. The command I used was Hadoop distcp /prod/forex/ wasb://mydata@clouddatalake.blob.core.windows.net/folder1/. This runs a mapreduce job to copy the files.

 

Image may be NSFW.
Clik here to view.

 

You can see that the Hadoop job completed successfully from the Hadoop Yarn Status UI.

 

Image may be NSFW.
Clik here to view.

 

And there are the files in Azure Storage!

 

Image may be NSFW.
Clik here to view.

 

Using Azure storage to create a data lake is a great feature! This easy configuration change easily allows your Hortonworks HDP cluster to access Azure storage.

 

Bill

 

 

 

 

 

Image may be NSFW.
Clik here to view.

Dealing with RequestRateTooLarge errors in Azure DocumentDB and testing performance

In Azure DocumentDB support, one of the most common errors we have seen as reported by our customers is RequestRateTooLargeException or HTTP Status code 429. For example, from an application using DocumentDB .Net SDK, we may see an error like this –

System.AggregateException: One or more errors occurred. ---> Microsoft.Azure.Documents.DocumentClientException: Exception: Microsoft.Azure.Documents.RequestRateTooLargeException, message: {"Errors":["Request rate is large"]}, request URI: rntbd://xx.xx.xx.xx:xxxx/apps/1240113d-9858-49b9-90cb-1219f9e1df77/services/04a5d10f-f937-40b1-9c70-d12d7f30cd51/partitions/8557b0cc-e4b7-4fb0-8bdd-406bc987e4cb/replicas/130729572321838873p ActivityId: f6371eb8-ce57-4683-bdc4-97a16aa3fe35

What we have seen is, the error may lead to some confusion or in other words, it may not be always obvious why we are running into this error. In this blog, we will try to clarify the error and share some tips or steps you can take to deal with this error in your application.

What does the error 'RequestRateTooLarge' mean?

The error is by design, it means that an application is sending request to DocumentDB service at a rate that is higher than the 'reserved throughput' level for a collection tier. You may remember from our documentation that Azure DocumetDB has 3 collection tiers – S1, S2 and S3 with each having their 'Reserved Throughput' in Request Units per Second. For example, an S3 collection has 2500 Request Units/sec and an S1 collection has 250 Request Units/sec as Reserved Throughput. So, if you have an application using an S1 collection and the application is sending requests at a rate more than 250 Request units/sec, you will run into this error. This is explained nicely in this blog

"When a client attempts to exceed the reserved throughput for an account, there will be no performance degradation at the server and no use of throughput capacity beyond the reserved level. The server will preemptively end the request with RequestRateTooLarge (HTTP status code 429) and return the x-ms-retry-after-ms header indicating the amount of time, in milliseconds, that the user must wait before reattempting the request.

HTTP Status 429,

Status Line: RequestRateTooLarge

x-ms-retry-after-ms :100

"

You can verify your collection Tier from the Azure Portal, like below -

Image may be NSFW.
Clik here to view.

Dealing with 'RequestRateTooLarge' errors:

Measure RequestCharge:

When troubleshooting RequestRateTooLarge errors, a good place to start is to measure the overhead of those operations (create, read, update or delete) that are likely resulting the error and examine the x-ms-request-charge header (or the equivalent RequestCharge property in ResourceResponse<T> or FeedResponse<T> in the .NET SDK) to measure the number of Request Units (or RU) consumed by these operations. Here is an example -

 In my case, I am using a Student Document and the output is something like this -

Insert Operation, # of RUs: 9.14

So, If I was using an S1 collection (with 250 Request Units/sec as reserved throughput), I can expect to insert roughly 27 (250/9.14) Student documents within a second and if my application starts sending insert requests at a higher rate, I would expect to run into RequestRateTooLarge error. You can leverage Azure DocumentDB Studio tool to inspect the x-ms-request-charge request header and other useful testing.

Retry Operations:

If you are using .Net SDK with LINQ, it automatically retries the operation internally when faced with an http 429 (default retry is set to 3, as of today), as explained in this blog. But there are scenarios where default retry behavior from the SDK may not be sufficient – in such cases, the application can handle the RequestRateTooLargeException (http status code 429) and retry the request based on the 'x-ms-retry-after-ms' header to improve resiliency of the application. Here is an example of retry using the .Net SDK-

Here is a similar example using node.js

Move to higher Collection Tier:

After you have implemented retry policy like the example above, monitor your application to see how frequent you are running into http 429 errors (being handled by your retry policy) and if the retry is creating a significant latency in your application. If your application is constantly exceeding the reserved throughput of a collection tier, resulting a large number of http 429 errors and significant latency, you may want to consider a higher Collection Tier and test your application. For example, if your application is currently using S1 collection Tier, try an S2 collection tier and so on.

Cache Collection or Database Id/self-links:

In certain cases, the RequestRateTooLarge errors may not be coming from CRUD operations on the documents within a collection – it may happen while querying 'Master Resources' such as querying to find if a collection or Database exists (via APIs such as CreateDatabaseQuery or CreateDocumentCollectionQuery in the .Net SDK). Here is an example

We had a recent case, where a customer was using a third party node.js SDK and we noted that the application was querying to check the existence of the same database and collection over and over, thereby running into this error. To avoid such scenarios, we recommend that you cache the database and collection id or self-link and re-use.

Follow Performance best practices:

While dealing with this error, the question about performance and throughput always comes into picture. Please ensure that you have reviewed part1 and part2 of the blog on performance tuning by our Product team. For example, as recommended in part1, ensure that you are using Tcp protocol and Direct Mode in the ConnectionPolicy when throughput is a priority. 

Partition Data into multiple collections:

If you are using the highest collection Tier (S3, as of today) and have followed all the best practices including retry but you are still getting large number of RequestRateTooLarge errors and are having significant latency as a result of retry, this indicates that your application throughput requirement is higher than what a single S3 collection can handle and should consider partitioning your data into multiple S3 collections, with each collection getting a max of 2500 Request Units/sec. For guidance on using partitioning with DocumentDB, please review our documentation here and here. Ideally this should happen during the design, prototyping and testing phase of your application.

Testing DocumentDB Performance and Scalability:

One other relevant question/comment we have seen from customers is – "I am not reaching the RU levels that an S3 "promises" and am getting responses that indicate RequestRateTooLarge when it should not be" or "how do I get the maximum throughput promised from an S3 collection?". To take full advantage of maximum throughput offered by a collection tier, we may need some tweaking in the client application code, such as, sending requests in multiple threads instead of a single thread. The "trick" is to push hard enough to get a small rate of throttles (RequestRateTooLarge error) consistently, then backing off from there.

To show this, I have written a GitHub sample that tests the performance of various operations like insert, Read, BulkInsert etc. In the sample, I have tried to measure DocumentDB performance by sending a large number of (I have tested with 1000 or 10000 - we can go higher) inserts/reads/bulk-inserts and measured 'Average Request units/sec' and 'ops/second' (such as inserts/sec or reads/sec) by increasing the number of threads (thereby pushing DocumentDB harder) until I found an optimal number of threads that resulted the best performance.

I'm pretty sure there are better ways of coding this and measuring the average Request units/sec etc - I am just sharing some ideas. Please feel free to leverage/tweak the sample with your own Documents. Here are some sample results I got while testing insert performance with an S1 collection and Student document I showed above (with 9.14 Request units/sec for each insert) -

With 4 threads:

8/23/2015 12:23:13 AM Result for Collection: Students in Tier: S1
Result Summary:
******************************
Number of Threads used: 4
1000 Documents Inserted in 44.294 seconds
Operations per Second: 22.5764211857136
Average Request Units: 206.348489637424 per second
 

With 8 Threads:

8/22/2015 11:46:55 PM Result for Collection: Students in Tier: S1
Result Summary:
******************************
Number of Threads used: 8
1000 Documents Inserted in 34.907 seconds
Insert Operations per Second: 28.6475492021658
Average Request Units: 261.838599707797 per second
 

With 16 Threads:

8/23/2015 12:10:41 AM Result for Collection: Students in Tier: S1
Result Summary:
******************************
Number of Threads used: 16
1000 Documents Inserted in 77.278 seconds
Operations per Second: 12.9402934858563
Average Request Units: 118.274282460727 per second
 

As you can see - with inserts being done in 4 threads I still haven't got the maximum RU/sec (250) for S1 collection, with 8 threads I got the best RU/sec (even though there were a few RequestRateTooLarge errors). With 16 threads, I got too many RequestRateTooLarge errors which impacted the latency and resulted a poor performance. I wouldn't focus on the numbers I got from my test results, this will vary for specific documents being used and other factors like the setup of the tests (like the applications running on a workstation vs Azure VM in the same datacenter where DocumentDB account is etc), but the approach I am taking in testing performance.

I hope you find it helpful! Please send us your comments, questions or feedback :-)

-Azim

Image may be NSFW.
Clik here to view.

Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext

 

The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI. You can review the documentation for Spark 1.3.1 for SparkConf to get a complete list of parameters. SparkConf documentation

 

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")

 

Now that we have a SparkConf we can pass it into SparkContext so our driver application knows how to access the cluster.

 

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")

val sc = new SparkContext(conf)

 

You can review the documentation for Spark 1.3.1 for SparkContext to get a complete list of parameters. SparkContext documentation

Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors. Below is a diagram showing the relationship between the driver applications, the cluster resource manager and executors.

 

 

Image may be NSFW.
Clik here to view.

 

Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

One of Sparks's modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation

 

Image may be NSFW.
Clik here to view.

 

HDInsight provides multiple end user experiences to interact with your cluster. It has a Spark Job Server HDInsight spark job server that allows you to remotely copy and submit your jar to the cluster. This submission will run your driver application. In your jar you have to implement SparkConf, SparkContext and HiveContext. HDInsight also provides a notebook experience with Jupyter and Zeppelin. For the Zeppelin notebook it automatically creates the SparkContext and HiveContext for you. For Jupyter you must create them yourself. You can read more about Spark on HDInsight at HDInsight Spark Overview and HDInsight Spark Resource Management

 

I hope understanding the relationship between SparkConf, SparkContext, SQLContext, and HiveContext and how a Spark driver application uses them will makes your Spark on HDInsight project a successful one!

 

Image may be NSFW.
Clik here to view.

 

Bill

 

 

 

Image may be NSFW.
Clik here to view.

A KMeans example for Spark MLlib on HDInsight

 

Today we will take a look at Sparks's module for MLlib or its built-in machine learning library Sparks MLlib Guide . KMeans is a popular clustering method. Clustering methods are used when there is no class to be predicted but instances are divided into groups or clusters. The clusters hopefully will represent some mechanism at play that draws the instance to a particular cluster. The instances assigned to the cluster should have a strong resemblance to each other. A typical use case for KMeans is segmentation of data. For example suppose you are studying heart disease and you have a theory that individuals with heart disease are overweight. You have collected data from individuals with and without heart disease and measurements of their weight like their body mass index, waist-to-hip ratio, skinfold thickness, and actual weight. KMeans is used to cluster the data into groups for further analysis and to test the theory. You can find out more about KMeans on Wikipedia Wikipedia KMeans .

 

The data that we are going to use in today's example is stock market data with the ConnorsRSI indicator. You can learn more about ConnorsRSI at ConnorsRSI. Below is a sample of the data. ConnorsRSI is a composite indicator made up from RSI_CLOSE_3, PERCENT_RANK_100, and RSI_STREAK_2. We will use these attributes as well as the actual ConnorsRSI (CRSI) and RSI2 to pass into our KMeans algorithm. The calculation of this data is already normalized from 0 to 100. The other columns like ID, LABEL, RTN5, FIVE_DAY_GL, and CLOSE we will use to do further analysis once we cluster the instances. They will not be passed into the KMeans algorithm.

 

Sample Data (CSV): 1988 instances of SPY SPY

Column

Description

ID

Used to uniquely identify the instance. date:symbol

Label

If the close was up or down from the previous days close.

RTN5

The return from the past 5 days.

FIVE_DAY_GL

The return from the next 5 days.

Close

Closing price.

RSI2

Relative Strength Index (2 days).

RSI_CLOSE_3

Relative Strength Index (3 days).

RSI_STREAK_2

Relative Strength Index (2 days) for streak durations based on the closing price.

PERCENT_RANK_100

The percentage rank value over the last 100 days. This is a rank that compares todays return to the last 100 returns.

CRSI

The ConnorsRSI Indicator. (RSI_CLOSE_3 + RSI_STREAK_2 + PERCENT_RANK) / 3

 

ID

LABEL

RTN5

FIVE_DAY_GL

CLOSE

RSI2

RSI_CLOSE_3

PERCENT_RANK_100

RSI_STREAK_2

CRSI

2015-09-16:SPY

UP

2.76708

-3.28704

200.18

91.5775

81.572

84

73.2035

79.5918

2015-09-15:SPY

UP

0.521704

-2.29265

198.46

83.4467

72.9477

92

60.6273

75.1917

2015-09-14:SPY

DN

1.77579

0.22958

196.01

47.0239

51.3076

31

25.807

36.0382

2015-09-11:SPY

UP

0.60854

-0.65569

196.74

69.9559

61.0005

76

76.643

71.2145

2015-09-10:SPY

UP

0.225168

1.98111

195.85

57.2462

53.9258

79

65.2266

66.0508

2015-09-09:SPY

DN

1.5748

2.76708

194.79

42.8488

46.1728

7

31.9797

28.3842

2015-09-08:SPY

UP

-0.12141

0.521704

197.43

73.7949

64.0751

98

61.2696

74.4483

2015-09-04:SPY

DN

-3.35709

1.77579

192.59

22.4626

31.7166

6

28.549

22.0886

 

The KMeans algorithm needs to be told how many clusters (K) the instances should be grouped into. For our example let's start with two clusters to see if they have a relationship to the label, "UP" or "DN". The Apache Spark scala documentation has the details on all the methods for KMeans and KMeansModel at KMeansModel

 

Below is the scala code which you can run in a zeppelin notebook or spark-shell on your HDInsight cluster with Spark. HDInsight

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.sql.functions._

 

// load file and remove header

val data = sc.textFile("wasb:///data/spykmeans.csv")

val header = data.first

val rows = data.filter(l => l != header)

 

// define case class

case class CC1(ID: String, LABEL: String, RTN5: Double, FIVE_DAY_GL: Double, CLOSE: Double, RSI2: Double, RSI_CLOSE_3: Double, PERCENT_RANK_100: Double, RSI_STREAK_2: Double, CRSI: Double)

 

// comma separator split

val allSplit = rows.map(line => line.split(","))

 

// map parts to case class

val allData = allSplit.map( p => CC1( p(0).toString, p(1).toString, p(2).trim.toDouble, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toDouble, p(8).trim.toDouble, p(9).trim.toDouble))

 

// convert rdd to dataframe

val allDF = allData.toDF()

 

// convert back to rdd and cache the data

val rowsRDD = allDF.rdd.map(r => (r.getString(0), r.getString(1), r.getDouble(2), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

rowsRDD.cache()

 

// convert data to RDD which will be passed to KMeans and cache the data. We are passing in RSI2, RSI_CLOSE_3, PERCENT_RANK_100, RSI_STREAK_2 and CRSI to KMeans. These are the attributes we want to use to assign the instance to a cluster

val vectors = allDF.rdd.map(r => Vectors.dense( r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

vectors.cache()

 

//KMeans model with 2 clusters and 20 iterations

val kMeansModel = KMeans.train(vectors, 2, 20)

 

//Print the center of each cluster

kMeansModel.clusterCenters.foreach(println)

 

// Get the prediction from the model with the ID so we can link them back to other information

val predictions = rowsRDD.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._6, r._7, r._8, r._9, r._10) ))}

 

// convert the rdd to a dataframe

val predDF = predictions.toDF("ID", "CLUSTER")

 

The code imports some methods for Vector, KMeans and SQL that we need. It then loads the .csv file from disk and removes the header that have our column descriptions. We then define a case class, split the columns by comma and map the data into the case class. We then convert the RDD into a dataframe. Next we map the dataframe back to an RDD and cache the data. We then create an RDD for the 5 columns we want to pass to the KMeans algorithm and cache the data. We want the RDD cached because KMeans is a very iterative algorithm. The caching helps speed up performance. We then create the kMeansModel passing in the vector RDD that has our attributes and specifying we want two clusters and 20 iterations. We then print out the centers for all the clusters. Now that the model is created, we get our predictions for the clusters with an ID so that we can uniquely identify each instance with the cluster it was assigned to. We then convert this back to a dataframe to analyze.

 

Below is a subset of the allDF dataframe with our data.

Image may be NSFW.
Clik here to view.

 

Below is a subset of our predDF dataframe with the ID and the CLUSTER. We now have a unique identifier and which cluster the KMeans algoritm assigned it to. Also displayed is the mean for each of the attributes passed into the KMeans algorithm for each cluster. Cluster 0 and Cluster 1. You can see that the means are very close in each cluster. For Cluster 0 it is around 27 and for cluster 1 it is around 71.

 

Image may be NSFW.
Clik here to view.

 

Because the allDF and predDF dataframes have a common column we can join them and do more analysis.

 

// join the dataframes on ID (spark 1.4.1)

val t = allDF.join(prdDF, "ID")

 

Now we have all of our data combined with the CLUSTER that the KMeans algorithm assigned each instance to and we can continue our investigation.

Image may be NSFW.
Clik here to view.

 

Let's display a subset of each cluster. It looks like cluster 0 is mostly DN labels and has attributes averaging around 27 like the centers of the clusters indicated. Cluster 1 is mostly UP labels and the attributes average is around 71.

 

// review a subset of each cluster

t.filter("CLUSTER = 0").show()

t.filter("CLUSTER = 1").show()

 

Image may be NSFW.
Clik here to view.

 

Image may be NSFW.
Clik here to view.

 

Let's get descriptive statistics on each of our clusters. This is for all the instances in each cluster and not just a subset. This gives us the count, mean, stddev, min, max for all numeric values in the dataframe. We filter each by CLUSTER.

 

// get descriptive statistics for each cluster

t.filter("CLUSTER = 0").describe().show()

t.filter("CLUSTER = 1").describe().show()

 

Image may be NSFW.
Clik here to view.

 

So what can we infer from the output of our KMeans clusters?

  • Cluster 0 has lower ConnorsRSI (CRSI), with a mean of 27. Cluster 1 has higher CRSI, with a mean of 71. Could these be areas to initiate buy and sells signals?
  • Cluster 0 has mostly DN labels, and Cluster 1 has mostly UP labels.
  • Cluster 0 has a mean of .28 % gain five days later, while cluster 1 has a loss of .03 five days later.
  • Cluster 0 has a mean loss of 1.22% five days before and cluster 1 has a gain of 1.15% five days before. Does this suggest markets revert to their mean?
  • Both clusters have min\max of 5 day returns between positive 19.40% to a loss of 19.79%.

     

This is just the tip of the iceberg with further questions, but gives an example of using HDInsight and spark to start your own KMeans analysis. Spark MLlib has many algorithms to explore including SVMs, logistic regression, linear regression, naïve bayes, decision trees, random forests, basic statistics, and more. The implementation of these algorithms in spark MLlib is for distributed clusters so you can do machine learning on big data. Next I think I'll run the analysis on all data for AMEX, NASDAQ and NYSE stock exchanges and see if the pattern holds up!

 

Image may be NSFW.
Clik here to view.

 

Bill

Image may be NSFW.
Clik here to view.

Using Azure SDK for Python

 

Python is a great scripting tool with a large user base. In a recent support case I needed a way to constantly generate files with some random data in windows azure storage (wasb) in order to process them with Spark on HDInsight. Python, the Azure SDK for Python and a few lines of code did the trick. You can install the SDK from Azure SDK for Python . There is also a helpful article on how to use Azure blob storage from Python at Azure Blob Storage from Python . You could even incorporate this in to a bigger pipeline like scheduling the Python script to run with windows task scheduler or linux cron to collect data from a server and upload it to azure storage for analysis with HDInsight.

 

The script has only a few basic section.

Function Name

Description

 

Imports for packages, declare a blob_service object, set container and subfolder variables.

run()

The main driver function. Set configuration parameters like how many iterations or files to create. The interval to wait between creating files. The local temporary directory to create the file in which is uploaded to azure storage. The main loop that calls createLocalFile(), uploadFiletoWasb() and clearnupWasbFolder().

createLocalFile()

Function to create a file. This creates a new line with comma separated data. This file will be uploaded to azure storage. You could modify this to create any file format you want.

uploadFileToWasb()

Function to upload a file to a container and subfolder in azure storage using the BlobService.

cleanupWasbFolder()

Function to delete files from azure storage. This will only keep 10 files in the folder.

uniform()

Function with a simple random number generator. Used to randomly generate data from 0 to 100 which is placed in each record in the file which is uploaded.

main()

Entry point for script

 

The script.

 

import datetime

import time

import urllib2

import json

import random

from azure.storage.blob import BlobService

 

# replace with your storage account name and key

blob_service = BlobService(account_name='mystorage_account', account_key='mystorage_account_key')

 

# container and folder must already exist

container = "data"

subfolder = "/test"

 

def run():

"Driver function"

print("Running GenWasbData.py")

now = datetime.datetime.now()

print

print"Starting Applicaiton: " + now.isoformat()

 

# default configuration parameters

iterations = 100

interval = 5

number_of_records = 200

local_temp_dir = "c:\\Applications\\temp\\"

 

# loop

count = 1

while (count <= iterations):

print"Processing: " + str(count) + " of " + str(iterations)

count = count + 1

createLocalFile(local_temp_dir + "file.txt", number_of_records)

uploadFileToWasb(local_temp_dir + "file.txt", count-1)

cleanupWasbFolder(count-1)

time.sleep(interval)

 

def createLocalFile(fn, number_of_records):

"Create a local file to upload to wasb"

now = datetime.datetime.now()

filename = fn

target = open (filename, 'w') ## a will append, w will over-write

count = 1

while (count <= number_of_records):

line = str(count) + "," + "Device" + str(count) + "," + str(now) + "," + str(uniform(0,100))

target.write(line)

target.write("\n")

count = count + 1

target.close

 

def uploadFileToWasb(fn, count):

"Upload file to wasb"

new_filename = "file" + str(count) + ".txt"

blob_service.put_block_blob_from_path(container + subfolder, new_filename, fn)

return

 

def cleanupWasbFolder(count):

"Remove files from wasb"

if (count >= 11):

blob_service.delete_blob(container + subfolder, "file" + str(count-10) + ".txt")

return

 

def uniform(a, b):

"Get a random number in the range (a, b)"

return round(random.uniform(a, b), 4) ## four dicimal places

 

 

if __name__ == '__main__':

    run()

 

 

 

Hope this helps you incorporate Python and the Azure SDK for Python in your next project.

 

Image may be NSFW.
Clik here to view.

 

Bill

Image may be NSFW.
Clik here to view.

Multi-Stream support in SCP.NET Storm Topology

Streams are in the core of Apache Storm. In most cases topologies are based on a single input stream, however there are situations when one may need to start the topology with two or more input steams.

User code to emit or receive from distinct streams at the same time is supported in SCP. To support multiple streams, Emit method of the Context object takes an optional stream ID parameter.

SCP.NET provides .Net C# programmability against Apache Storm on Azure HDInsight clusters.Since Microsoft.SCP.Net.SDK version 0.9.4.283 multi-stream is supported in SCP.NE  Two methods in the SCP.NET Context object have been added. They are used to emit Tuple or Tuples to specify StreamId. The StreamId is a string and it needs to be consistent in both C# and the Topology Definition Spec. The emitting to a non-existing stream will cause runtime exceptions.

/* Emit tuple to the specific stream. */

public abstract void Emit(string streamId, List<object> values);

/* for non-transactional Spout only */

public abstract void Emit(string streamId, List<object> values, long seqId);

 

There is a sample SCP.NET storm topology that uses multi-stream in GitHub. You can download the hdinsight-storm-examples from github and navigate to the HelloWorldHostModeMultiSpout sample under SCPNetExamplesfolder.

It is pretty straight forward. Below I am copy pasting the code from the program.cs file of the sample where it shows how to define topology using the TopologyBuilder when you have multiple spouts.

// Use TopologyBuilder to define a Non-Tx topology

// And define each spouts/bolts one by one

TopologyBuilder topologyBuilder = new TopologyBuilder("HelloWorldHostModeMultiSpout");

// Set a User customized config (SentenceGenerator.config) for the SentenceGenerator

topologyBuilder.SetSpout(

"SentenceGenerator",

SentenceGenerator.Get,

newDictionary<string, List<string>>()

{

{SentenceGenerator.STREAM_ID, newList<string>(){"sentence"}}

},

1,

"SentenceGenerator.config");

 

topologyBuilder.SetSpout(

"PersonGenerator",

PersonGenerator.Get,

newDictionary<string, List<string>>()

{

{PersonGenerator.STREAM_ID, newList<string>(){"person"}}

},

1);

 

topologyBuilder.SetBolt(

"displayer",

Displayer.Get,

newDictionary<string, List<string>>(),

1)

.shuffleGrouping("SentenceGenerator", SentenceGenerator.STREAM_ID)

.shuffleGrouping("PersonGenerator", PersonGenerator.STREAM_ID);

To test the topology, open the HelloWorldHostModeMultiSpout.sln file in Visual Studio and then deploy. For this sample topology I didn't have to make any changes to deploy in my cluster. To deploy in Solution Explorer, right-click the project HelloWorldHostModeMultiSpout, and select Submit to Storm on HDInsight.


Image may be NSFW.
Clik here to view.

You will see a pop-up window with dropdown list. Select your Storm on HDInsight cluster from the Storm Cluster drop-down list, and then select Submit. You can monitor whether the submission is successful by using the Output window.

Image may be NSFW.
Clik here to view.

Once the topology has been successfully submitted you will see it listed under your storm cluster in the Server Explorer from where we can view more information about the running topologies.

For more information on how to develop and deploy storm topologies in Visual Studio using SCP.NET please check Develop C# topologies for Apache Storm on HDInsight using Hadoop tools for Visual Studio and Deploy and manage Apache Storm topologies on Windows-based HDInsight Azure Documentation articles.

Image may be NSFW.
Clik here to view.

How to allow Spark to access Microsoft SQL Server

 

Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft
SQL Server JDBC

Spark needs to know the path to the sqljdbc4.jar. There are multiple ways to add the path to Spark's classpath. Spark has two runtime environment properties that can do this spark.driver.extraClassPath and spark.executor.extraClassPath. To review all the properties available, see Spark's Configuration - Spark 1.4.1 Documentation.

If you use spark-shell or spark-submit you can pass these properties with –conf. I like to add the properties to Spark's default configuration file at /etc/spark/conf/spark-defaults.conf. A third option is to include the sqljdbc.jar in your assembly jar. This same technique works for other jars that your Spark application might need. Whichever technique you choose, Spark needs to know where to find the sqljdbc4.jar for both the driver application and the executors.

You can check the environment tab in the Spark Properties section to verify the properties are set.

Image may be NSFW.
Clik here to view.

 

Spark's API is very dynamic and changes are being made with each new release, especially around JDBC. If you are going to use Spark with JDBC I would suggest reviewing Spark's API documentation for the version of Spark you are using Spark 1.4.1 API to make sure the methods are still valid and the same behavior exists. Depending on the release there are a few places to look for methods involving JDBC, which include SQLContext, DataFrame, and JdbcRDD. Also notice that some methods are marked experimental and or deprecated. Make sure you test your code.

Some issue to consider are:

  • Make sure firewall ports are open for port 1433.
  • If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark's code is not creating the primary key so the table creation fails.

 

Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.

//Spark 1.4.1

//Insert data from DataFrame

case class Conf(mykey: String, myvalue: String)

val data = sc.parallelize( Seq(Conf("1", "Delaware"), Conf("2", "Virginia"), Conf("3", "Maryland"), Conf("4", "South Carolina") ))

val df = data.toDF()

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val table = "t2"

df.insertIntoJDBC(url, table, true)

 

//Load from database using SqlContext

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";

val tbl = { sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbtable" -> "t2", "partitionColumn" -> "mykey", "lowerBound" -> "0", "upperBound" -> "100", "numPartitions" -> "1" ))}

tbl.show()

 

Image may be NSFW.
Clik here to view.

 

If you run a Microsoft SQL Server profiler trace while running the spark-shell you can see the table being created, data inserted and then data being read.

Image may be NSFW.
Clik here to view.

 

HDInsight and Spark is a great platform to process and analyze your data, but often data resided in a relational database system like Microsoft SQL Server. Allowing Spark to read and write data from Microsoft SQL Server allows you to create a richer pipeline.

 

Hope this helps,

 

Image may be NSFW.
Clik here to view.

Bill

Image may be NSFW.
Clik here to view.

Incremental data load from Azure Table Storage to Azure SQL using Azure Data Factory

 

Azure Data Factory is a cloud based data integration service. The service not only helps to move data between cloud services but also helps to move data from/to on-premises.  For example, moving data from Azure blob storage to Azure SQL etc. You can find supported data stores here.

Many business scenario starts with an initial data load, then continual incremental data loads, either hourly/daily/monthly etc. The focus of this post is to explain ADF creation and how incremental data could be loaded.

Prerequisite

  • Azure Subscription
  • Azure Table Storage which hosts the source data
  • Azure SQL DB which hold table to store data 
  • PowerShell/.Net SDK if you want to create ADF solution from said platform. In this blog we’ll be using Azure portal without this requirement.

 

Setting up the initial scenario

Everyday new data lands in an Azure Table, and needs to be updated on Azure SQL Database on some periodic basis, daily/hourly/monthly depending on the schedule desired. We don’t want to upload the entire table daily rather just want to add new data at destination. In this example I will focus on Employee data defined in next section.

 

 

 

Azure Table storage

Here is the sample data that will be our Copy Activity source. In this example we will be using Azure Storage Explorer.

Image may be NSFW.
Clik here to view.
image

Here is the schema defined for records.

Image may be NSFW.
Clik here to view.
image

PartitionKey, RowKey and Timestamp are automatically included for every entity. PartitionKey and RowKey value will be user/developer responsibility to fill in whereas Timestamp value manage by server which can’t be changed/modified. So if schema doesn’t have any property to define when record added then Timestamp property can be used. For more information about Table service model please click here.

As shown in above screenshot, Azure storage explorer 6.0.0 doesn’t show Timestamp property.  

Azure SQL Server

Below is the table definition defined in Azure SQL Database that will be our Copy Activity Sink (destination).

Image may be NSFW.
Clik here to view.
image

Azure Data Factory Solution

In this section, we will be creating an Azure Data Factory.

Creating ADF Service

Image may be NSFW.
Clik here to view.
image

  • Enter Azure Data Factory Name
  • Select Subscription
  • Select/Create new Resource group name

Image may be NSFW.
Clik here to view.
image

  • Select Pin to Dashboard
  • Click Create
  • Once ADF created, there will be an tile added at home screen.

Image may be NSFW.
Clik here to view.
image

  • Click Author and deploy

Image may be NSFW.
Clik here to view.
image

 

Creating Data Sources

Azure Storage

    • Click New data store

             Image may be NSFW.
Clik here to view.
image

    • Select Azure storage
    • Provide storage account name and account key

             Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Azure SQL

    • Click New data store
    • Click Azure SQL

             Image may be NSFW.
Clik here to view.
image

    • Provide Server name
    • Provide Initial Catalog (database name)
    • Provide user id and password

            Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Creating Dataset

 

Azure Table

    • Click New dataset

             Image may be NSFW.
Clik here to view.
image

    • Select Azure table
    • Refer article Datasets to understand properties like Structure, Published, type,  typeProperties etc.

            Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Azure SQL

    • Click New dataset

             Image may be NSFW.
Clik here to view.
image

    • Select Azure SQL

             Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Creating Pipeline

  • Click New pipeline

Image may be NSFW.
Clik here to view.
image

  • Click Add activity and select Copy activity 

Image may be NSFW.
Clik here to view.
image

  • Add the code for data transformation. It will be look like below once done.

Image may be NSFW.
Clik here to view.
image

  • Click Deploy. The diagram will look like below.

Image may be NSFW.
Clik here to view.
image

Once this pipeline is deployed, it will start processing the data based on the start and end time. In this blog we define in scheduler frequency is Day and interval is 1. That means it will run daily once. We also define start and end time for this pipeline and as per the definition it will run only for one day. For more information on Pipeline execution please refer Scheduling and Execution with Data Factory.

Go back to the Data Factory which we created in the web portal. Click Pipelines and select the pipeline which we created just now.

 

Image may be NSFW.
Clik here to view.
image

On summary page click Datasets, Consumed. Notice the Pipeline Activities details shows it executed once.

Image may be NSFW.
Clik here to view.
image

Go back and click Datasets, Produced. Notice the Slice is in the “Ready” state, meaning data has been transferred. Sometime it will show Pending Execution means it’s waiting to execute. Is progress means it is still running. You may notice Error state in case there is any error while designing/execution of pipeline.

Image may be NSFW.
Clik here to view.
image

Checking Execution

You can use SQL Server Management studio to confirm if the data arrived or not. Do a select query on destination table to confirm.

Image may be NSFW.
Clik here to view.
image
 

Incremental load

Now that we have all the data loaded in destination, the next step is that we want to move incremental data on a daily basis rather than deleting and inserting the whole set of data. In this example, we will be adding some additional records with different dates and will insert data for specific date. Below is sample data added in Azure Table storage.

Image may be NSFW.
Clik here to view.
image

As an example, for the next date, we’ll move data only for 9-January-2016. To do this first change we will do in pipeline is to fetch specific data from source. To do we’ll add the line below in JSON property azureTableSourceQuery for the data factory pipeline.

"azureTableSourceQuery": "$$Text.Format('RecordAddedDate eq datetime\\'{0:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart)" 

The above query will filter record based on the slice Start Time on RecordAddedDate property. $$ is used to invoke data factory macro functions. Since RecordAddedDate is a date-time property we need to add datetime prefix to cast it. In this example we will filter records where RecordAddedDate equals to SliceStart date. We can also define SliceStart, SliceEnd, WindowStart or WindowEnd as parameter.   

*Note:- if RecordAddedDate property is not define in table entity, we can use Timestamp property. Please refer Azure Table Storage section at the beginning of this blog.

Add changes and deploy

Image may be NSFW.
Clik here to view.
image

Once the JSON is deployed, and the Produced dataset state is in Ready status, you can query the SQL Server table to see the second copy activity’s output. Notice it has only transferred data for 09-Jan-2016.

Image may be NSFW.
Clik here to view.
image

Generally, in such defined scheduler, ADF pipeline executes at 12:00:00 AM daily. In case if required to run pipeline in different time instead of default (12:00:00 AM) let say 6:00:00 AM then add "offset": "06:00:00" property.

Thanks my colleague Jason for reviewing this post.

Hope it’s helpful.

Image may be NSFW.
Clik here to view.

Encoding the Hive query file in Azure HDInsight

Today at Microsoft we were using Azure Data Factory to run Hive Activities in Azure HDInsight on a schedule. Things were working fine for a while, but then we got an error that was hard to understand. I've simplified the scenario to illustrate the key points. The key is that Hive did not like the Byte Order Mark (first 3 bytes) in the hive .hql file, and failed with an error. Be careful which text editor and text encoding you choose when saving your Hive Query Language (HQL) command into a text file.

We are using Azure Data Factory (Feb 2016) with a linked service to Microsoft Azure HDInsight Hadoop on Linux (version 3.2.1000.0) which is the distribution from Hortonworks Data Platform version 2.2.7.1-36 and includes Hive 0.14.0.

Error message:

Hive script failed with exit code '64'. See 'wasb://adfjobs@mycontainer.blob.core.windows.net/HiveQueryJobs/92994500-5dc2-4ba3-adb8-4fac51c4d959/05_02_2016_05_49_34_516/Status/stderr' for more details.

To drill into that stderror logging, In the Data Factory activity blade, we see the error text, and see the log files below. We click on the stderr file and read it in the log blade.

Stderr says:

WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in jar:file:/mnt/resource/hadoop/yarn/local/filecache/11/hive.tar.gz/hive/lib/hive-common-1.2.1.2.3.3.1-1.jar!/hive-log4j.properties

FAILED: ParseException line 1:0 character '' not supported here

That's strange, because the Hive query file looks fine to the naked eye.

Image may be NSFW.
Clik here to view.

 

If you use a binary editor though you can see these two text files are not the same.

I have Visual Studio handy, and it can show the hex representation of these files.

File > Open > File…

Image may be NSFW.
Clik here to view.

 

Next to the open button there is a drop down – choose "Open with…"

Image may be NSFW.
Clik here to view.

Choose Binary Editor.

Image may be NSFW.
Clik here to view.

 

Notice that file2.hql shown below has 3 extra bytes in the beginning. This is called a byte-order mark (BOM).

Image may be NSFW.
Clik here to view.

The Standards Controversy

There is some controversy about whether UTF-8 text files should have a byte order mark or not.

The standards folks say don't use it in most cases, but also don't remove it if you already have a BOM. We software developers sometimes forget that and are inconsistent -Microsoft Windows vs. Linux vs. Apache Hive

Citing the crowdsourced Wikipedia information here for neutrality:

            https://en.wikipedia.org/wiki/Byte_order_mark

UTF-8 byte order mark is EF BB BF in hex.

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.

The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.

In Linux distributions since around 2004 the convention is to generate UTF-8 files without a BOM.

Microsoft compilers[10] and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and interpret as UTF-8 on reading only when the BOM is present.

 

The Litmus Test

So let's try the two .hql text files in Hive console directly outside of Azure Data Factory to see if they work in Hive alone. I am using SSH to connect to my Linux head node in HDInsight. I have uploaded the two files using SFTP (easy to do in MobaXTerm if you like that 3rd party tool – I can't endorse any specific tool, but seems nice). The source command here is running the query file from within the hive app.

1. First I launch Hive on my SSH session.

Cd /bin 
Hive 

 

2. Then I run the script from the first file. It works well and returns a list of my Hive tables from my metastore.

source /home/sshuser/file1.hql; 

 

OK 

 

3. Then I run the script from the second file. It fails with a parsing error

source /home/sshuser/file2.hql; 
FAILED: ParseException line 1:0 character '' not supported here 

 Image may be NSFW.
Clik here to view.

 

 

The lesson learned

So the lesson learned is that Hive doesn't like the .hql text files when they are encoded in UTF-8 with a Byte-order mark up front in the hex/binary.

UTF-8 is the encoding of choice, but the byte-order mark is not desired.

Action Required!

Check your favorite text editor to see if it is marking the BOM or not.

1. Windows Notepad seems to encode UTF-8 files to include a Byte Order Mark, so avoid that one.

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

 

2. Visual Studio 2013 / 2015 gives you the option to save "With Encoding" on the drop down beside the Save button, then you can pick which Encoding you want.

Image may be NSFW.
Clik here to view.

 

First trial - I picked Unicode (UTF-8 with signature) – Codepage 65001. That "signature" means it adds a BOM, which is undesirable for Hive to interpret.

Image may be NSFW.
Clik here to view.

 

Second trial – I picked Unicode (UTF-8 without signature) - Codepage 65001. Looks better now. That's the one without the BOM in the first three bytes, so Hive should be OK to read this one.

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

3. Linux's common Vi text editor

Image may be NSFW.
Clik here to view.

The hex representation does not have a BOM, so we are good to use this with Hive.

Image may be NSFW.
Clik here to view.

 

Hive can run this one OK

Image may be NSFW.
Clik here to view.

4. Nano – another easy text editor on Linux

Image may be NSFW.
Clik here to view.

Looks good as well, no BOM

Image may be NSFW.
Clik here to view.

Hive can run this one OK too.

Image may be NSFW.
Clik here to view.

 

To be continued... There are many more text editors, so please let me know if you have trouble with encoding using another text editor when saving Hive queries to text.

 

Hope this helps someone out there. Let us know if it does, or if you still get stuck, post a comment below, or try the Azure forums for help.

Happy Hadooping! Jason 

Image may be NSFW.
Clik here to view.

Encoding 101 - Exporting from SQL Server into flat files, to create a Hive external table

Today in Microsoft Big Data Support we faced the issue of how to correctly move Unicode data from SQL Server into Hive via flat text files. The main issue faced was encoding special Unicode characters from the source database, such as the degree sign (Unicode 00B0) and other complex Unicode characters outside of A-Z 0-9.

The goal was to get Hive to read those same strings SQL Server had saved out to text files and represent them equally to the Hive consumer. We could have used Sqoop if there was a connection between Hadoop and the SQL Server, but that was not possible, as this was across company boundaries and shipping files was the easier approach.

It was tricky to do, but we found a couple of solutions. When the Unicode string value is exported from Microsoft SQL Server via SSIS or the Import Export wizard, they look fine to the naked eye, but SELECT * FROM HiveTable; and the data looks different.

Microsoft SQL Server: zyx°°° Looks good

Hive: zyx��� Uh-oh! We've got trouble.

 

That Unicode string (NVARCHAR) value in SQL appears as zyx°°° We Export that data to a flat file using SSIS or the Import Export Wizard, then copy the files into Azure Blob Storage.Next using Azure HDInsight, when a Hive table is created atop those files, then the same characters look garbled black question marked - zyx��� as if the characters are unknown to Hive's interpretation.

Linux and Hive default to text files encoded to UTF-8 format. That differs from the SSIS Flat File Destination's Unicode output.

I found two ways we found to make them compatible.

  1. Change the export options in SQL Server SSIS Flat File Destination to uncheck the "Unicode" checkmark and select code page 65001 (UTF-8) instead.
  2. Keep the Unicode encoding as is in SSIS, but tell Hive to interpret the data differently using serdeproperties ('serialization.encoding'='ISO-8859-1');

 

Time for a little trial and error

1. Setup an example Database

These are the steps I used to see the issue. You can use SQL Server (any version from 2005-2016).

Using SQL Management Studio, I create a database TestHive and table T1 in SQL Server. Insert some data into the NVarchar column, including some special characters. NChar and NVarChar and NVarChar(max) are the double-byte Unicode data type in SQL Server columns, that are used for global language support.

CREATEDATABASE TestHive

GO

USE TestHive

GO

CREATETABLE T1(Col1 Int, Col2 NVarchar(255));

GO

INSERTINTO T1 VALUES

(1,N'abcdef'),

(2,N'zyx°°°'),

(3,N'123456' )

GO

SELECT*FROM T1

 

2. Run the SQL Server Import Export Wizard from the Start menu to copy the rows into a text file.

 

On the Source – point to your SQL Server instance, and select the database and table you want to pull from.

On the Destination – choose Flat File Destination, point to a file path on the local disk, and Select the Unicode checkbox on the locale.

Image may be NSFW.
Clik here to view.

You could use the same Flat File Destination from within an Integration Services (SSIS) package design and run that if you prefer more control on how the data is transformed in the Data Flow Task, but the Import Export Wizard does the simple copy that we need here.

This is the output – so far looks good.

Image may be NSFW.
Clik here to view.

3. Connect to my Azure HDInsight cluster in the cloud to upload the file.

 

I used a Linux based HDInsight Hadoop cluster, so I will use SSH to connect to the head node.

Then I create a directory, and upload the first file into that location in blob store. This is much like saving the file from the local disk into HDFS for the purposes of Hadoop outside of Azure.

hadoop fs -mkdir -p /tutorials/usehive/import1/

hadoop fs -copyFromLocal tableexport_ssisunicode.txt wasb:///tutorials/usehive/import1/tableexport_ssisunicode.txt

You could also upload the files directly to Azure Blob Storage from tools such as Visual Studio Azure VS 2015 / VS 2013 or Microsoft Azure Storage Explorer / Azure Storage Explorer or Cloud Explorer for Visual Studio 2013

 

4. Run Hive to read that first folder

Now that the text file is ready in Blob storage, I can run Hive and create a table, and query from that file saved into the import1 folder. From my SSH session I simply run hive.

cd /bin

hive

{

DROP TABLE Import1;

CREATE EXTERNAL TABLE Import1(col1 string, col2 string) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tutorials/usehive/import1/' tblproperties ("skip.header.line.count"="1");

SELECT * FROM Import1;

}

 

Notice the strange characters ��� where I expected my degree signs.

Image may be NSFW.
Clik here to view.

Just to be sure it's not something special with my SSH client (MobaXTerm here) I am trying from the Ambari web dashboard for my Azure HDInsight Cluster, and using the Hive View from the menu icon in the upper right.

Image may be NSFW.
Clik here to view.

 

5. OK let's try exporting again – this time changing the flat file encoding setting.

In the SSIS Import Export Wizard (SSIS Flat File Destination) choose code page UTF-8.

- Uncheck the "Unicode" checkmark.

- Choose code page 65001 (UTF-8)

Image may be NSFW.
Clik here to view.

 

6. Upload that second file to Linux (SFTP) and then copy into HDFS or Azure Blob Storage.

I made a new folder, so I could compare my trials side-by-side.

hadoop fs -mkdir -p /tutorials/usehive/import2/

hadoop fs -copyFromLocal tableexport_ssisunicode2.txt wasb:///tutorials/usehive/import2/tableexport_ssisunicode2.txt

 

7. Now test the Hive table again with the UTF-8 encoded file in the second folder

Run Hive

cd /bin

hive

{

DROP TABLE Import2;

CREATE EXTERNAL TABLE Import2(col1 string, col2 string) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tutorials/usehive/import2/' tblproperties ("skip.header.line.count"="1");

SELECT * FROM Import2;

QUIT;

}

My zyx°°° looks normal now! Success!

Image may be NSFW.
Clik here to view.

 

The Hive View in Ambari likes this data too.
Image may be NSFW.
Clik here to view.
 

8. An alternative- Tell hive to encode / decode the external files differently

Perhaps you don't want to change the file format to UTF-8 (most universal in Linux and Hadoop), or maybe you cannot change the format at all, because the files come from an outsider.

Starting in Hive 0.14 version and above, Hive has a simple way to change the encoding of serialization (for interpreting the bytes encoding in text files for example).

The change was explained here https://issues.apache.org/jira/browse/HIVE-7142

 

1. You can Create from scratch with this code page serializer and override the serialization.encoding property to a code page that best matches your source data encoding.

CREATE TABLE person(id INT, name STRING, desc STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='ISO-8859-1');

 

2. Or if you have an existing table, this can be adjusted after the fact. You need to carefully match your encoding to whatever kind of files will be presented in the storage underneath this Hive table.

Choose one, or make your own…

ALTER TABLE Import1 SET serdeproperties ('serialization.encoding'='US-ASCII');
ALTER TABLE Import1 SET serdeproperties ('serialization.encoding'='ISO-8859-1');
ALTER TABLE Import1 SET serdeproperties ('serialization.encoding'='UTF-8');

 

To find out which tokens to list next to the equal sign, refer to the Charset code ID is listed in Java documentation
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

I am guessing the right one is this one to match SQL's "Unicode" but it needs to be tested further to be totally sure that ALL characters are interpreted as expected.

ISO-8859-1  

ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1

 

Other tips we didn't have time to try yet:

Someone made a tool to help convert files if it is not possible for them to change the format in SSIS or BCP exports to text files. https://code.msdn.microsoft.com/windowsdesktop/UTF8WithoutBOM-Converter-7a8218af

 

Hope this helps someone out there. Let us know if it does, or if you still get stuck, post a comment below, or try the Azure forums for help.

 

Happy Hadooping! Jason 

Image may be NSFW.
Clik here to view.

How to call a Azure Machine Learning Web Service from NodeJS

Azure machine learning allows data scientists and developers to embed predictive analytics into applications. To learn more about Azure machine learning visit Azure machine learning documentation . A simplified process flow for Azure machine learning is:

  • Create an Azure machine learning workspace that has an associated Azure storage account.
  • Login to your Azure machine learning workspace and then login into ML Studio. ML Studio is where you can create your machine learning experiments and publish them as a web services. You need to create an experiment which produces the machine learning model, and then publish it as a web service.
  • Once you publish your web service, the portal has a page with information on how to call the web service and example code in C#, Python, and R.
  • Calling the web service allows you to embed your predictive analytics into applications. You will pass new input feature data to the web service. The new feature data will be run through the machine learning model and the web service will send back the prediction.

     

Recently I had a request on how to call an Azure machine learning web service from NodeJS. The portal did not have an example. In this post I will show the basic code in NodeJS to call an Azure machine learning web service.

Below is a simple experiment I created that tries to predict the percent gain ten days from now for the SP 500 Index. At the bottom is a button to set up your web service once you are comfortable with the model.

Image may be NSFW.
Clik here to view.

 

Below is the web service predictive experiment page after you have published it.

Image may be NSFW.
Clik here to view.

 

Below is the information for the published web service. You will need information from this page to call the web service from NodeJS. One piece of information that you will need is the API key. You can also test out your web service here. Under the REQUEST/RESPONSE link is more information that you will need like the POST Request URI.

Image may be NSFW.
Clik here to view.

 

Below you will see the post request URI. You will need this. Also on this page are code examples in C#, Python and R. Now that we have created an experiment published it as a web service and have the Post Request URI and API Key we can start to write the code in NodeJS to call the web service and get a prediction.

Image may be NSFW.
Clik here to view.

 

I am assuming you are somewhat familiar with NodeJS and have downloaded it and installed it. Here is the download link Download NodeJS and web site. You can create the maml-server.js file in notepad or your favorite IDE.

From the POST Request URI you can get the information for the host and path variables. The headers will need the API key. The options variable has this information and is created in the getPred function. In the buildFeatureInput function, I just hard coded the new feature data, but you could easily get this information from a web form or a RDMS call. I wanted to keep the example simple.

 

//maml-server.js

var http = require("http");

var https = require("https");

var querystring = require("querystring");

var fs = require('fs');

 

function getPred(data) {

    console.log('===getPred()===');

    var dataString = JSON.stringify(data)

    var host = 'ussouthcentral.services.azureml.net'

    var path = '/workspaces/fda91d2e52b74ee2ae68b1aac4dba8b9/services/1b2f5e6f99574756a8fde751def19a0a/execute?api-version=2.0&details=true'

    var method = 'POST'

    var api_key = 'vKKR78dSdQeSc9qdMaDmu2Z5bcFqb4TfkZdNgSxzcIjGV9p5OP2uy4k1HfJes1T4Ws3St+EBgQTX/N8vqCs4zg=='

    var headers = {'Content-Type':'application/json', 'Authorization':'Bearer ' + api_key};

    

    var options = {

        host: host,

        port: 443,

        path: path,

        method: 'POST',

        headers: headers

    };

    

    console.log('data: ' + data);

    console.log('method: ' + method);

    console.log('api_key: ' + api_key);

    console.log('headers: ' + headers);

    console.log('options: ' + options);

        

    var reqPost = https.request(options, function (res) {

        console.log('===reqPost()===');

        console.log('StatusCode: ', res.statusCode);

        console.log('headers: ', res.headers);

 

        res.on('data', function(d) {

            process.stdout.write(d);

        });

    });

    

    // Would need more parsing out of prediction from the result

    reqPost.write(dataString);

    reqPost.end();

    reqPost.on('error', function(e){

        console.error(e);

    });

    

}

 

//Could build feature inputs from web form or RDMS. This is the new data that needs to be passed to the web service.

function buildFeatureInput(){

    console.log('===performRequest()===');

    var data = {

        "Inputs": {

            "input1": {

     "ColumnNames": ["gl10", "roc20", "uo", "ppo", "ppos", "macd", "macds", "sstok", "sstod", "pmo", "pmos", "wmpr"],

     "Values": [ [ "0", "-1.3351", "50.2268", "-0.2693", "-0.2831", "-5.5310", "-5.8120", "61.9220", "45.3998", "-0.0653", "-0.0659", "-30.3005" ], ]

             },

            },

    "GlobalParameters": {}

        }

    getPred(data);

}

 

 

function send404Reponse(response) {

    response.writeHead(404, {"Context-Type": "text/plain"});

    response.write("Error 404: Page not Found!");

    response.end();

}

 

function onRequest(request, response) {

    if(request.method == 'GET' && request.url == '/' ){

        response.writeHead(200, {"Context-Type": "text/plain"});

        fs.createReadStream("./index.html").pipe(response);

    }else {

        send404Reponse(response);

    }

}

 

http.createServer(onRequest).listen(8050);

console.log("Server is now running on port 8050");

buildFeatureInput();

 

 

Once you have saved the code to disk you can run it in NodeJS by issuing, "Node <path to maml-server.js>". I am also displaying the status code from the POST and the results from calling the web service. You would need to parse out the JSON to get the "Scored Labels" column that has the prediction. The model is predicting that the SP 500 will be down 4.71% ten days from now. The SP 500 closed yesterday (2016-02-17) at 1926.82. We can check back on 2016-03-04 and see how well the model did! Machine learning is a big topic but Azure machine learning makes it easy to embed analytics in your applications.

 

Image may be NSFW.
Clik here to view.

 

I hope this helps some one. I will have changed my API key, so if you try to run the example you might get an authorization error.

Bill

Image may be NSFW.
Clik here to view.

HDInsight Hive Metastore fails when the database name has dashes or hyphens

Working in Azure HDInsight support today, we see a failure when trying to run a Hive query on a freshly created HDInsight cluster. Its brand new and fails on the first try, so what could be wrong?

Our Hive client app fails with this kind of error.

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
 at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:445)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:619)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
 at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1483)
 at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:63)
 at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
 at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2743)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2762)
 at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:426)
 ... 8 more
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1481)
 ... 13 more
Caused by: javax.jdo.JDOUserException: Could not create "increment"/"table" value-generation container meta-store-database.dbo.SEQUENCE_TABLE since autoCreate flags do not allow it.
NestedThrowables:
org.datanucleus.exceptions.NucleusUserException: Could not create "increment"/"table" value-generation container meta-store-database.dbo.SEQUENCE_TABLE since autoCreate flags do not allow it.
 at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:549)
 at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)
 at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
 at org.apache.hadoop.hive.metastore.ObjectStore.createDatabase(ObjectStore.java:499)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:98)
 at com.sun.proxy.$Proxy9.createDatabase(Unknown Source)
 at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:578)
 at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:598)
 at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:436)
 at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
 at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
 at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5509)
 at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:178)
 at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
 ... 18 more
Caused by: org.datanucleus.exceptions.NucleusUserException: Could not create "increment"/"table" value-generation container meta-store-database.dbo.SEQUENCE_TABLE since autoCreate flags do not allow it.
 at org.datanucleus.store.rdbms.valuegenerator.TableGenerator.createRepository(TableGenerator.java:261)
 at org.datanucleus.store.rdbms.valuegenerator.AbstractRDBMSGenerator.obtainGenerationBlock(AbstractRDBMSGenerator.java:162)
 at org.datanucleus.store.valuegenerator.AbstractGenerator.obtainGenerationBlock(AbstractGenerator.java:197)
 at org.datanucleus.store.valuegenerator.AbstractGenerator.next(AbstractGenerator.java:105)
 at org.datanucleus.store.rdbms.RDBMSStoreManager.getStrategyValueForGenerator(RDBMSStoreManager.java:2005)
 at org.datanucleus.store.AbstractStoreManager.getStrategyValue(AbstractStoreManager.java:1386)
 at org.datanucleus.ExecutionContextImpl.newObjectId(ExecutionContextImpl.java:3827)
 at org.datanucleus.state.JDOStateManager.setIdentity(JDOStateManager.java:2571)
 at org.datanucleus.state.JDOStateManager.initialiseForPersistentNew(JDOStateManager.java:513)
 at org.datanucleus.state.ObjectProviderFactoryImpl.newForPersistentNew(ObjectProviderFactoryImpl.java:232)
 at org.datanucleus.ExecutionContextImpl.newObjectProviderForPersistentNew(ExecutionContextImpl.java:1414)
 at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2218)
 at org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065)
 at org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913)
 at org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217)
 at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727)
 ... 34 more

We found out the problem is very simple. The Hive metastore database in HDInsight usually lives in an Azure SQL Database, and the name of that database can be customized. In this case it was a name "hive-meta-store" database name.

The error gives us a hint, but the error isn't clear about what caused the failure.

Caused by: org.datanucleus.exceptions.NucleusUserException: Could not create "increment"/"table" value-generation container meta-store-database.dbo.SEQUENCE_TABLE since autoCreate flags do not allow it.

Apparently in HiveServer2  there is a bug where if the metastore database name is not always are escaped by tick marks 'hive-meta-store' or brackets [hive-meta-store] so instead, Hive gets an incorrect syntax error. On such clusters, the Hiveserver2 wouldn’t start. HiveServer2 wouldn’t start because of table “meta-store-database.dbo.SEQUENCE_TABLE” table 

2016-01-16 02:14:10,228 DEBUG [main]: Datastore.Native (Log4JLogger.java:debug(58)) - SELECT NEXT_VAL FROM meta-store-database.SEQUENCE_TABLE WHERE SEQUENCE_NAME=<'org.apache.hadoop.hive.metastore.model.MDatabase'>

2016-01-16 02:14:10,233 INFO  [main]: DataNucleus.ValueGeneration (Log4JLogger.java:info(77)) - Error encountered allocating block of IDs : Couldnt obtain a new sequence (unique id) : Incorrect syntax near '-'.

 

The known issue

The issue is being tracked here https://issues.apache.org/jira/browse/HIVE-6113 and in Hortonworks' internal Jira item HIVE-433.

Hortonworks is aware of the issue, and it already fixed in a later build as per reading HIVE-6113 has already upgraded the datanucleus version to 4.x. This upgrade will be available as part of Hive 2.0, which is not out yet in Azure HDInsight.

Until that is done, avoid dashes or hyphens in your custom metastore, else HiveServer2 will not start.

Image may be NSFW.
Clik here to view.

A KMeans example for Spark MLlib on HDInsight

 

Today we will take a look at Sparks’s module for MLlib or its built-in machine learning library Sparks MLlib Guide . KMeans is a popular clustering method. Clustering methods are used when there is no class to be predicted but instances are divided into groups or clusters. The clusters hopefully will represent some mechanism at play that draws the instance to a particular cluster. The instances assigned to the cluster should have a strong resemblance to each other. A typical use case for KMeans is segmentation of data. For example suppose you are studying heart disease and you have a theory that individuals with heart disease are overweight. You have collected data from individuals with and without heart disease and measurements of their weight like their body mass index, waist-to-hip ratio, skinfold thickness, and actual weight. KMeans is used to cluster the data into groups for further analysis and to test the theory. You can find out more about KMeans on Wikipedia Wikipedia KMeans .

 

The data that we are going to use in today’s example is stock market data with the ConnorsRSI indicator. You can learn more about ConnorsRSI at ConnorsRSI. Below is a sample of the data. ConnorsRSI is a composite indicator made up from RSI_CLOSE_3, PERCENT_RANK_100, and RSI_STREAK_2. We will use these attributes as well as the actual ConnorsRSI (CRSI) and RSI2 to pass into our KMeans algorithm. The calculation of this data is already normalized from 0 to 100. The other columns like ID, LABEL, RTN5, FIVE_DAY_GL, and CLOSE we will use to do further analysis once we cluster the instances. They will not be passed into the KMeans algorithm.

 

Sample Data (CSV): 1988 instances of SPY SPY

Column Description
ID Used to uniquely identify the instance. date:symbol
Label If the close was up or down from the previous days close.
RTN5 The return from the past 5 days.
FIVE_DAY_GL The return from the next 5 days.
Close Closing price.
RSI2 Relative Strength Index (2 days).
RSI_CLOSE_3 Relative Strength Index (3 days).
RSI_STREAK_2 Relative Strength Index (2 days) for streak durations based on the closing price.
PERCENT_RANK_100 The percentage rank value over the last 100 days. This is a rank that compares todays return to the last 100 returns.
CRSI The ConnorsRSI Indicator. (RSI_CLOSE_3 + RSI_STREAK_2 + PERCENT_RANK) / 3

 

ID LABEL RTN5 FIVE_DAY_GL CLOSE RSI2 RSI_CLOSE_3 PERCENT_RANK_100 RSI_STREAK_2 CRSI
2015-09-16:SPY UP 2.76708 -3.28704 200.18 91.5775 81.572 84 73.2035 79.5918
2015-09-15:SPY UP 0.521704 -2.29265 198.46 83.4467 72.9477 92 60.6273 75.1917
2015-09-14:SPY DN 1.77579 0.22958 196.01 47.0239 51.3076 31 25.807 36.0382
2015-09-11:SPY UP 0.60854 -0.65569 196.74 69.9559 61.0005 76 76.643 71.2145
2015-09-10:SPY UP 0.225168 1.98111 195.85 57.2462 53.9258 79 65.2266 66.0508
2015-09-09:SPY DN 1.5748 2.76708 194.79 42.8488 46.1728 7 31.9797 28.3842
2015-09-08:SPY UP -0.12141 0.521704 197.43 73.7949 64.0751 98 61.2696 74.4483
2015-09-04:SPY DN -3.35709 1.77579 192.59 22.4626 31.7166 6 28.549 22.0886

 

The KMeans algorithm needs to be told how many clusters (K) the instances should be grouped into. For our example let’s start with two clusters to see if they have a relationship to the label, “UP” or “DN”. The Apache Spark scala documentation has the details on all the methods for KMeans and KMeansModel at KMeansModel

 

Below is the scala code which you can run in a zeppelin notebook or spark-shell on your HDInsight cluster with Spark. HDInsight

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.sql.functions._

// load file and remove header

val data = sc.textFile(“wasb:///data/spykmeans.csv”)

val header = data.first

val rows = data.filter(l => l != header)

// define case class

case class CC1(ID: String, LABEL: String, RTN5: Double, FIVE_DAY_GL: Double, CLOSE: Double, RSI2: Double, RSI_CLOSE_3: Double, PERCENT_RANK_100: Double, RSI_STREAK_2: Double, CRSI: Double)

// comma separator split

val allSplit = rows.map(line => line.split(“,”))

// map parts to case class

val allData = allSplit.map( p => CC1( p(0).toString, p(1).toString, p(2).trim.toDouble, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toDouble, p(8).trim.toDouble, p(9).trim.toDouble))

// convert rdd to dataframe

val allDF = allData.toDF()

// convert back to rdd and cache the data

val rowsRDD = allDF.rdd.map(r => (r.getString(0), r.getString(1), r.getDouble(2), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

rowsRDD.cache()

// convert data to RDD which will be passed to KMeans and cache the data. We are passing in RSI2, RSI_CLOSE_3, PERCENT_RANK_100, RSI_STREAK_2 and CRSI to KMeans. These are the attributes we want to use to assign the instance to a cluster

val vectors = allDF.rdd.map(r => Vectors.dense( r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

vectors.cache()

//KMeans model with 2 clusters and 20 iterations

val kMeansModel = KMeans.train(vectors, 2, 20)

//Print the center of each cluster

kMeansModel.clusterCenters.foreach(println)

// Get the prediction from the model with the ID so we can link them back to other information

val predictions = rowsRDD.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._6, r._7, r._8, r._9, r._10) ))}

// convert the rdd to a dataframe

val predDF = predictions.toDF(“ID”, “CLUSTER”)

The code imports some methods for Vector, KMeans and SQL that we need. It then loads the .csv file from disk and removes the header that have our column descriptions. We then define a case class, split the columns by comma and map the data into the case class. We then convert the RDD into a dataframe. Next we map the dataframe back to an RDD and cache the data. We then create an RDD for the 5 columns we want to pass to the KMeans algorithm and cache the data. We want the RDD cached because KMeans is a very iterative algorithm. The caching helps speed up performance. We then create the kMeansModel passing in the vector RDD that has our attributes and specifying we want two clusters and 20 iterations. We then print out the centers for all the clusters. Now that the model is created, we get our predictions for the clusters with an ID so that we can uniquely identify each instance with the cluster it was assigned to. We then convert this back to a dataframe to analyze.

Below is a subset of the allDF dataframe with our data.

Image may be NSFW.
Clik here to view.

Below is a subset of our predDF dataframe with the ID and the CLUSTER. We now have a unique identifier and which cluster the KMeans algoritm assigned it to. Also displayed is the mean for each of the attributes passed into the KMeans algorithm for each cluster. Cluster 0 and Cluster 1. You can see that the means are very close in each cluster. For Cluster 0 it is around 27 and for cluster 1 it is around 71.

Image may be NSFW.
Clik here to view.

Because the allDF and predDF dataframes have a common column we can join them and do more analysis.

// join the dataframes on ID (spark 1.4.1)

val t = allDF.join(prdDF, “ID”)

Now we have all of our data combined with the CLUSTER that the KMeans algorithm assigned each instance to and we can continue our investigation.

Image may be NSFW.
Clik here to view.

Let’s display a subset of each cluster. It looks like cluster 0 is mostly DN labels and has attributes averaging around 27 like the centers of the clusters indicated. Cluster 1 is mostly UP labels and the attributes average is around 71.

// review a subset of each cluster

t.filter(“CLUSTER = 0″).show()

t.filter(“CLUSTER = 1″).show()

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

Let’s get descriptive statistics on each of our clusters. This is for all the instances in each cluster and not just a subset. This gives us the count, mean, stddev, min, max for all numeric values in the dataframe. We filter each by CLUSTER.

// get descriptive statistics for each cluster

t.filter(“CLUSTER = 0″).describe().show()

t.filter(“CLUSTER = 1″).describe().show()

Image may be NSFW.
Clik here to view.

So what can we infer from the output of our KMeans clusters?

  • Cluster 0 has lower ConnorsRSI (CRSI), with a mean of 27. Cluster 1 has higher CRSI, with a mean of 71. Could these be areas to initiate buy and sells signals?
  • Cluster 0 has mostly DN labels, and Cluster 1 has mostly UP labels.
  • Cluster 0 has a mean of .28 % gain five days later, while cluster 1 has a loss of .03 five days later.
  • Cluster 0 has a mean loss of 1.22% five days before and cluster 1 has a gain of 1.15% five days before. Does this suggest markets revert to their mean?
  • Both clusters have min\max of 5 day returns between positive 19.40% to a loss of 19.79%.

This is just the tip of the iceberg with further questions, but gives an example of using HDInsight and spark to start your own KMeans analysis. Spark MLlib has many algorithms to explore including SVMs, logistic regression, linear regression, naïve bayes, decision trees, random forests, basic statistics, and more. The implementation of these algorithms in spark MLlib is for distributed clusters so you can do machine learning on big data. Next I think I’ll run the analysis on all data for AMEX, NASDAQ and NYSE stock exchanges and see if the pattern holds up!

Image may be NSFW.
Clik here to view.

Bill

spykmeans.csv

Using Azure SDK for Python

 

Python is a great scripting tool with a large user base. In a recent support case I needed a way to constantly generate files with some random data in windows azure storage (wasb) in order to process them with Spark on HDInsight. Python, the Azure SDK for Python and a few lines of code did the trick. You can install the SDK from Azure SDK for Python . There is also a helpful article on how to use Azure blob storage from Python at Azure Blob Storage from Python . You could even incorporate this in to a bigger pipeline like scheduling the Python script to run with windows task scheduler or linux cron to collect data from a server and upload it to azure storage for analysis with HDInsight.

The script has only a few basic section.

Function Name Description
Imports for packages, declare a blob_service object, set container and subfolder variables.
run() The main driver function. Set configuration parameters like how many iterations or files to create. The interval to wait between creating files. The local temporary directory to create the file in which is uploaded to azure storage. The main loop that calls createLocalFile(), uploadFiletoWasb() and clearnupWasbFolder().
createLocalFile() Function to create a file. This creates a new line with comma separated data. This file will be uploaded to azure storage. You could modify this to create any file format you want.
uploadFileToWasb() Function to upload a file to a container and subfolder in azure storage using the BlobService.
cleanupWasbFolder() Function to delete files from azure storage. This will only keep 10 files in the folder.
uniform() Function with a simple random number generator. Used to randomly generate data from 0 to 100 which is placed in each record in the file which is uploaded.
main() Entry point for script

The script.

import datetime

import time

import urllib2

import json

import random

from azure.storage.blob import BlobService

# replace with your storage account name and key

blob_service = BlobService(account_name=‘mystorage_account’, account_key=‘mystorage_account_key’)

# container and folder must already exist

container = “data”

subfolder = “/test”

def run():

“Driver function”

print(“Running GenWasbData.py”)

now = datetime.datetime.now()

print

print “Starting Applicaiton: “ + now.isoformat()

# default configuration parameters

iterations = 100

interval = 5

number_of_records = 200

local_temp_dir = “c:\\Applications\\temp\\”

# loop

count = 1

while (count <= iterations):

print “Processing: “ + str(count) + ” of “ + str(iterations)

count = count + 1

createLocalFile(local_temp_dir + “file.txt”, number_of_records)

uploadFileToWasb(local_temp_dir + “file.txt”, count-1)

cleanupWasbFolder(count-1)

time.sleep(interval)

def createLocalFile(fn, number_of_records):

“Create a local file to upload to wasb”

now = datetime.datetime.now()

filename = fn

target = open (filename, ‘w’) ## a will append, w will over-write

count = 1

while (count <= number_of_records):

line = str(count) + “,” + “Device” + str(count) + “,” + str(now) + “,” + str(uniform(0,100))

target.write(line)

target.write(“\n”)

count = count + 1

target.close

def uploadFileToWasb(fn, count):

“Upload file to wasb”

new_filename = “file” + str(count) + “.txt”

blob_service.put_block_blob_from_path(container + subfolder, new_filename, fn)

return

def cleanupWasbFolder(count):

“Remove files from wasb”

if (count >= 11):

blob_service.delete_blob(container + subfolder, “file” + str(count-10) + “.txt”)

return

def uniform(a, b):

“Get a random number in the range (a, b)”

return round(random.uniform(a, b), 4) ## four dicimal places

if __name__ == ‘__main__’:

    run()

Hope this helps you incorporate Python and the Azure SDK for Python in your next project.

Image may be NSFW.
Clik here to view.

Bill


Multi-Stream support in SCP.NET Storm Topology

Streams are in the core of Apache Storm. In most cases topologies are based on a single input stream, however there are situations when one may need to start the topology with two or more input steams.

User code to emit or receive from distinct streams at the same time is supported in SCP. To support multiple streams, Emit method of the Context object takes an optional stream ID parameter.

SCP.NET provides .Net C# programmability against Apache Storm on Azure HDInsight clusters.Since Microsoft.SCP.Net.SDK version 0.9.4.283 multi-stream is supported in SCP.NE  Two methods in the SCP.NET Context object have been added. They are used to emit Tuple or Tuples to specify StreamId. The StreamId is a string and it needs to be consistent in both C# and the Topology Definition Spec. The emitting to a non-existing stream will cause runtime exceptions.

/* Emit tuple to the specific stream. */

public abstract void Emit(string streamId, List<object> values);

/* for non-transactional Spout only */

public abstract void Emit(string streamId, List<object> values, long seqId);

 

There is a sample SCP.NET storm topology that uses multi-stream in GitHub. You can download the hdinsight-storm-examples from github and navigate to the HelloWorldHostModeMultiSpout sample under SCPNetExamples folder.

It is pretty straight forward. Below I am copy pasting the code from the program.cs file of the sample where it shows how to define topology using the TopologyBuilder when you have multiple spouts.

// Use TopologyBuilder to define a Non-Tx topology

// And define each spouts/bolts one by one

TopologyBuilder topologyBuilder = new TopologyBuilder(“HelloWorldHostModeMultiSpout”);

// Set a User customized config (SentenceGenerator.config) for the SentenceGenerator

topologyBuilder.SetSpout(

“SentenceGenerator”,

SentenceGenerator.Get,

new Dictionary<string, List<string>>()

{

{SentenceGenerator.STREAM_ID, new List<string>(){“sentence”}}

},

1,

“SentenceGenerator.config”);

 

topologyBuilder.SetSpout(

“PersonGenerator”,

PersonGenerator.Get,

new Dictionary<string, List<string>>()

{

{PersonGenerator.STREAM_ID, new List<string>(){“person”}}

},

1);

 

topologyBuilder.SetBolt(

“displayer”,

Displayer.Get,

new Dictionary<string, List<string>>(),

1)

.shuffleGrouping(“SentenceGenerator”, SentenceGenerator.STREAM_ID)

.shuffleGrouping(“PersonGenerator”, PersonGenerator.STREAM_ID);

To test the topology, open the HelloWorldHostModeMultiSpout.sln file in Visual Studio and then deploy. For this sample topology I didn’t have to make any changes to deploy in my cluster. To deploy in Solution Explorer, right-click the project HelloWorldHostModeMultiSpout, and select Submit to Storm on HDInsight.

Image may be NSFW.
Clik here to view.

You will see a pop-up window with dropdown list. Select your Storm on HDInsight cluster from the Storm Cluster drop-down list, and then select Submit. You can monitor whether the submission is successful by using the Output window.

Image may be NSFW.
Clik here to view.

Once the topology has been successfully submitted you will see it listed under your storm cluster in the Server Explorer from where we can view more information about the running topologies.

For more information on how to develop and deploy storm topologies in Visual Studio using SCP.NET please check Develop C# topologies for Apache Storm on HDInsight using Hadoop tools for Visual Studio and Deploy and manage Apache Storm topologies on Windows-based HDInsight Azure Documentation articles.

How to allow Spark to access Microsoft SQL Server

 

Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft
SQL Server JDBC

Spark needs to know the path to the sqljdbc4.jar. There are multiple ways to add the path to Spark’s classpath. Spark has two runtime environment properties that can do this spark.driver.extraClassPath and spark.executor.extraClassPath. To review all the properties available, see Spark’s Configuration – Spark 1.4.1 Documentation.

If you use spark-shell or spark-submit you can pass these properties with –conf. I like to add the properties to Spark’s default configuration file at /etc/spark/conf/spark-defaults.conf. A third option is to include the sqljdbc.jar in your assembly jar. This same technique works for other jars that your Spark application might need. Whichever technique you choose, Spark needs to know where to find the sqljdbc4.jar for both the driver application and the executors.

You can check the environment tab in the Spark Properties section to verify the properties are set.

Image may be NSFW.
Clik here to view.

 

Spark’s API is very dynamic and changes are being made with each new release, especially around JDBC. If you are going to use Spark with JDBC I would suggest reviewing Spark’s API documentation for the version of Spark you are using Spark 1.4.1 API to make sure the methods are still valid and the same behavior exists. Depending on the release there are a few places to look for methods involving JDBC, which include SQLContext, DataFrame, and JdbcRDD. Also notice that some methods are marked experimental and or deprecated. Make sure you test your code.

Some issue to consider are:

  • Make sure firewall ports are open for port 1433.
  • If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark’s code is not creating the primary key so the table creation fails.

 

Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.

//Spark 1.4.1

//Insert data from DataFrame

case class Conf(mykey: String, myvalue: String)

val data = sc.parallelize( Seq(Conf(“1″, “Delaware”), Conf(“2″, “Virginia”), Conf(“3″, “Maryland”), Conf(“4″, “South Carolina”) ))

val df = data.toDF()

val url = “jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser”

val table = “t2″

df.insertIntoJDBC(url, table, true)

//Load from database using SqlContext

val url = “jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser”

val driver = “com.microsoft.sqlserver.jdbc.SQLServerDriver”;

val tbl = { sqlContext.load(“jdbc”, Map( “url” -> url, “driver” -> driver, “dbtable” -> “t2″, “partitionColumn” -> “mykey”, “lowerBound” -> “0”, “upperBound” -> “100”, “numPartitions” -> “1” ))}

tbl.show()

 

Image may be NSFW.
Clik here to view.

 

If you run a Microsoft SQL Server profiler trace while running the spark-shell you can see the table being created, data inserted and then data being read.

Image may be NSFW.
Clik here to view.

 

HDInsight and Spark is a great platform to process and analyze your data, but often data resided in a relational database system like Microsoft SQL Server. Allowing Spark to read and write data from Microsoft SQL Server allows you to create a richer pipeline.

 

Hope this helps,

 

Image may be NSFW.
Clik here to view.

Bill

Incremental data load from Azure Table Storage to Azure SQL using Azure Data Factory

 

Azure Data Factory is a cloud based data integration service. The service not only helps to move data between cloud services but also helps to move data from/to on-premises.  For example, moving data from Azure blob storage to Azure SQL etc. You can find supported data stores here.

Many business scenario starts with an initial data load, then continual incremental data loads, either hourly/daily/monthly etc. The focus of this post is to explain ADF creation and how incremental data could be loaded.

Prerequisite

  • Azure Subscription
  • Azure Table Storage which hosts the source data
  • Azure SQL DB which hold table to store data
  • PowerShell/.Net SDK if you want to create ADF solution from said platform. In this blog we’ll be using Azure portal without this requirement.

 

Setting up the initial scenario

Everyday new data lands in an Azure Table, and needs to be updated on Azure SQL Database on some periodic basis, daily/hourly/monthly depending on the schedule desired. We don’t want to upload the entire table daily rather just want to add new data at destination. In this example I will focus on Employee data defined in next section.

Azure Table storage

Here is the sample data that will be our Copy Activity source. In this example we will be using Azure Storage Explorer.

Image may be NSFW.
Clik here to view.
image

Here is the schema defined for records.

Image may be NSFW.
Clik here to view.
image

PartitionKey, RowKey and Timestamp are automatically included for every entity. PartitionKey and RowKey value will be user/developer responsibility to fill in whereas Timestamp value manage by server which can’t be changed/modified. So if schema doesn’t have any property to define when record added then Timestamp property can be used. For more information about Table service model please click here.

As shown in above screenshot, Azure storage explorer 6.0.0 doesn’t show Timestamp property.

Azure SQL Server

Below is the table definition defined in Azure SQL Database that will be our Copy Activity Sink (destination).

Image may be NSFW.
Clik here to view.
image

Azure Data Factory Solution

In this section, we will be creating an Azure Data Factory.

Creating ADF Service

Image may be NSFW.
Clik here to view.
image

  • Enter Azure Data Factory Name
  • Select Subscription
  • Select/Create new Resource group name

Image may be NSFW.
Clik here to view.
image

  • Select Pin to Dashboard
  • Click Create
  • Once ADF created, there will be an tile added at home screen.

Image may be NSFW.
Clik here to view.
image

  • Click Author and deploy

Image may be NSFW.
Clik here to view.
image

 

Creating Data Sources

Azure Storage

    • Click New data store

             Image may be NSFW.
Clik here to view.
image

    • Select Azure storage
    • Provide storage account name and account key

             Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Azure SQL

    • Click New data store
    • Click Azure SQL

             Image may be NSFW.
Clik here to view.
image

    • Provide Server name
    • Provide Initial Catalog (database name)
    • Provide user id and password

            Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Creating Dataset

Azure Table

    • Click New dataset

             Image may be NSFW.
Clik here to view.
image

    • Select Azure table
    • Refer article Datasets to understand properties like Structure, Published, type,  typeProperties etc.

            Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Azure SQL

    • Click New dataset

             Image may be NSFW.
Clik here to view.
image

    • Select Azure SQL

             Image may be NSFW.
Clik here to view.
image

    • Click Deploy

Creating Pipeline

  • Click New pipeline

Image may be NSFW.
Clik here to view.
image

  • Click Add activity and select Copy activity

Image may be NSFW.
Clik here to view.
image

  • Add the code for data transformation. It will be look like below once done.

Image may be NSFW.
Clik here to view.
image

  • Click Deploy. The diagram will look like below.

Image may be NSFW.
Clik here to view.
image

Once this pipeline is deployed, it will start processing the data based on the start and end time. In this blog we define in scheduler frequency is Day and interval is 1. That means it will run daily once. We also define start and end time for this pipeline and as per the definition it will run only for one day. For more information on Pipeline execution please refer Scheduling and Execution with Data Factory.

Go back to the Data Factory which we created in the web portal. Click Pipelines and select the pipeline which we created just now.

 

Image may be NSFW.
Clik here to view.
image

On summary page click Datasets, Consumed. Notice the Pipeline Activities details shows it executed once.

Image may be NSFW.
Clik here to view.
image

Go back and click Datasets, Produced. Notice the Slice is in the “Ready” state, meaning data has been transferred. Sometime it will show Pending Execution means it’s waiting to execute. Is progress means it is still running. You may notice Error state in case there is any error while designing/execution of pipeline.

Image may be NSFW.
Clik here to view.
image

Checking Execution

You can use SQL Server Management studio to confirm if the data arrived or not. Do a select query on destination table to confirm.

Image may be NSFW.
Clik here to view.
image

Incremental load

Now that we have all the data loaded in destination, the next step is that we want to move incremental data on a daily basis rather than deleting and inserting the whole set of data. In this example, we will be adding some additional records with different dates and will insert data for specific date. Below is sample data added in Azure Table storage.

Image may be NSFW.
Clik here to view.
image

As an example, for the next date, we’ll move data only for 9-January-2016. To do this first change we will do in pipeline is to fetch specific data from source. To do we’ll add the line below in JSON property azureTableSourceQuery for the data factory pipeline.

“azureTableSourceQuery”: “$$Text.Format(‘RecordAddedDate eq datetime\\'{0:yyyy-MM-ddTHH:mm:ss}\\”, SliceStart)”

The above query will filter record based on the slice Start Time on RecordAddedDate property. $$ is used to invoke data factory macro functions. Since RecordAddedDate is a date-time property we need to add datetime prefix to cast it. In this example we will filter records where RecordAddedDate equals to SliceStart date. We can also define SliceStart, SliceEnd, WindowStart or WindowEnd as parameter.

*Note:- if RecordAddedDate property is not define in table entity, we can use Timestamp property. Please refer Azure Table Storage section at the beginning of this blog.

Add changes and deploy

Image may be NSFW.
Clik here to view.
image

Once the JSON is deployed, and the Produced dataset state is in Ready status, you can query the SQL Server table to see the second copy activity’s output. Notice it has only transferred data for 09-Jan-2016.

Image may be NSFW.
Clik here to view.
image

Generally, in such defined scheduler, ADF pipeline executes at 12:00:00 AM daily. In case if required to run pipeline in different time instead of default (12:00:00 AM) let say 6:00:00 AM then add “offset”: “06:00:00″ property.

Thanks my colleague Jason for reviewing this post.

Hope it’s helpful.

Encoding the Hive query file in Azure HDInsight

Today at Microsoft we were using Azure Data Factory to run Hive Activities in Azure HDInsight on a schedule. Things were working fine for a while, but then we got an error that was hard to understand. I’ve simplified the scenario to illustrate the key points. The key is that Hive did not like the Byte Order Mark (first 3 bytes) in the hive .hql file, and failed with an error. Be careful which text editor and text encoding you choose when saving your Hive Query Language (HQL) command into a text file.

We are using Azure Data Factory (Feb 2016) with a linked service to Microsoft Azure HDInsight Hadoop on Linux (version 3.2.1000.0) which is the distribution from Hortonworks Data Platform version 2.2.7.1-36 and includes Hive 0.14.0.

Error message:

Hive script failed with exit code ’64’. See ‘wasb://adfjobs@mycontainer.blob.core.windows.net/HiveQueryJobs/92994500-5dc2-4ba3-adb8-4fac51c4d959/05_02_2016_05_49_34_516/Status/stderr’ for more details.

To drill into that stderror logging, In the Data Factory activity blade, we see the error text, and see the log files below. We click on the stderr file and read it in the log blade.

Stderr says:

WARNING: Use “yarn jar” to launch YARN applications.

Logging initialized using configuration in jar:file:/mnt/resource/hadoop/yarn/local/filecache/11/hive.tar.gz/hive/lib/hive-common-1.2.1.2.3.3.1-1.jar!/hive-log4j.properties

FAILED: ParseException line 1:0 character ” not supported here

That’s strange, because the Hive query file looks fine to the naked eye.

Image may be NSFW.
Clik here to view.

 

If you use a binary editor though you can see these two text files are not the same.

I have Visual Studio handy, and it can show the hex representation of these files.

File > Open > File…

Image may be NSFW.
Clik here to view.

 

Next to the open button there is a drop down – choose “Open with…”

Image may be NSFW.
Clik here to view.

Choose Binary Editor.

Image may be NSFW.
Clik here to view.

 

Notice that file2.hql shown below has 3 extra bytes in the beginning. This is called a byte-order mark (BOM).

Image may be NSFW.
Clik here to view.

The Standards Controversy

There is some controversy about whether UTF-8 text files should have a byte order mark or not.

The standards folks say don’t use it in most cases, but also don’t remove it if you already have a BOM. We software developers sometimes forget that and are inconsistent -Microsoft Windows vs. Linux vs. Apache Hive

Citing the crowdsourced Wikipedia information here for neutrality:

            https://en.wikipedia.org/wiki/Byte_order_mark

UTF-8 byte order mark is EF BB BF in hex.

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.

The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.

In Linux distributions since around 2004 the convention is to generate UTF-8 files without a BOM.

Microsoft compilers[10] and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and interpret as UTF-8 on reading only when the BOM is present.

 

The Litmus Test

So let’s try the two .hql text files in Hive console directly outside of Azure Data Factory to see if they work in Hive alone. I am using SSH to connect to my Linux head node in HDInsight. I have uploaded the two files using SFTP (easy to do in MobaXTerm if you like that 3rd party tool – I can’t endorse any specific tool, but seems nice). The source command here is running the query file from within the hive app.

1. First I launch Hive on my SSH session.

Cd /bin 
Hive 

 

2. Then I run the script from the first file. It works well and returns a list of my Hive tables from my metastore.

source /home/sshuser/file1.hql; 

OK 

 

3. Then I run the script from the second file. It fails with a parsing error

source /home/sshuser/file2.hql; 
FAILED: ParseException line 1:0 character '' not supported here 

Image may be NSFW.
Clik here to view.

 

 

The lesson learned

So the lesson learned is that Hive doesn’t like the .hql text files when they are encoded in UTF-8 with a Byte-order mark up front in the hex/binary.

UTF-8 is the encoding of choice, but the byte-order mark is not desired.

Action Required!

Check your favorite text editor to see if it is marking the BOM or not.

1. Windows Notepad seems to encode UTF-8 files to include a Byte Order Mark, so avoid that one.

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

 

2. Visual Studio 2013 / 2015 gives you the option to save “With Encoding” on the drop down beside the Save button, then you can pick which Encoding you want.

Image may be NSFW.
Clik here to view.

 

First trial – I picked Unicode (UTF-8 with signature) – Codepage 65001. That “signature” means it adds a BOM, which is undesirable for Hive to interpret.

Image may be NSFW.
Clik here to view.

 

Second trial – I picked Unicode (UTF-8 without signature) – Codepage 65001. Looks better now. That’s the one without the BOM in the first three bytes, so Hive should be OK to read this one.

Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

3. Linux’s common Vi text editor

Image may be NSFW.
Clik here to view.

The hex representation does not have a BOM, so we are good to use this with Hive.

Image may be NSFW.
Clik here to view.

 

Hive can run this one OK

Image may be NSFW.
Clik here to view.

4. Nano – another easy text editor on Linux

Image may be NSFW.
Clik here to view.

Looks good as well, no BOM

Image may be NSFW.
Clik here to view.

Hive can run this one OK too.

Image may be NSFW.
Clik here to view.

 

To be continued… There are many more text editors, so please let me know if you have trouble with encoding using another text editor when saving Hive queries to text.

Hope this helps someone out there. Let us know if it does, or if you still get stuck, post a comment below, or try the Azure forums for help.

Happy Hadooping! Jason 

Encoding 101 – Exporting from SQL Server into flat files, to create a Hive external table

Today in Microsoft Big Data Support we faced the issue of how to correctly move Unicode data from SQL Server into Hive via flat text files. The main issue faced was encoding special Unicode characters from the source database, such as the degree sign (Unicode 00B0) and other complex Unicode characters outside of A-Z 0-9.

The goal was to get Hive to read those same strings SQL Server had saved out to text files and represent them equally to the Hive consumer. We could have used Sqoop if there was a connection between Hadoop and the SQL Server, but that was not possible, as this was across company boundaries and shipping files was the easier approach.

It was tricky to do, but we found a couple of solutions. When the Unicode string value is exported from Microsoft SQL Server via SSIS or the Import Export wizard, they look fine to the naked eye, but SELECT * FROM HiveTable; and the data looks different.

Microsoft SQL Server: zyx°°° Looks good

Hive: zyx��� Uh-oh! We’ve got trouble.

 

That Unicode string (NVARCHAR) value in SQL appears as zyx°°° We Export that data to a flat file using SSIS or the Import Export Wizard, then copy the files into Azure Blob Storage. Next using Azure HDInsight, when a Hive table is created atop those files, then the same characters look garbled black question marked – zyx��� as if the characters are unknown to Hive’s interpretation.

Linux and Hive default to text files encoded to UTF-8 format. That differs from the SSIS Flat File Destination’s Unicode output.

I found two ways we found to make them compatible.

  1. Change the export options in SQL Server SSIS Flat File Destination to uncheck the “Unicode” checkmark and select code page 65001 (UTF-8) instead.
  2. Keep the Unicode encoding as is in SSIS, but tell Hive to interpret the data differently using serdeproperties (‘serialization.encoding’=’ISO-8859-1′);

 

Time for a little trial and error

1. Setup an example Database

These are the steps I used to see the issue. You can use SQL Server (any version from 2005-2016).

Using SQL Management Studio, I create a database TestHive and table T1 in SQL Server. Insert some data into the NVarchar column, including some special characters. NChar and NVarChar and NVarChar(max) are the double-byte Unicode data type in SQL Server columns, that are used for global language support.

CREATE DATABASE TestHive

GO

USE TestHive

GO

CREATE TABLE T1(Col1 Int, Col2 NVarchar(255));

GO

INSERT INTO T1 VALUES

(1,N‘abcdef’),

(2,N‘zyx°°°’),

(3,N‘123456’ )

GO

SELECT * FROM T1

 

2. Run the SQL Server Import Export Wizard from the Start menu to copy the rows into a text file.

 

On the Source – point to your SQL Server instance, and select the database and table you want to pull from.

On the Destination – choose Flat File Destination, point to a file path on the local disk, and Select the Unicode checkbox on the locale.

Image may be NSFW.
Clik here to view.

You could use the same Flat File Destination from within an Integration Services (SSIS) package design and run that if you prefer more control on how the data is transformed in the Data Flow Task, but the Import Export Wizard does the simple copy that we need here.

This is the output – so far looks good.

Image may be NSFW.
Clik here to view.

3. Connect to my Azure HDInsight cluster in the cloud to upload the file.

 

I used a Linux based HDInsight Hadoop cluster, so I will use SSH to connect to the head node.

Then I create a directory, and upload the first file into that location in blob store. This is much like saving the file from the local disk into HDFS for the purposes of Hadoop outside of Azure.

hadoop fs -mkdir -p /tutorials/usehive/import1/

hadoop fs -copyFromLocal tableexport_ssisunicode.txt wasb:///tutorials/usehive/import1/tableexport_ssisunicode.txt

You could also upload the files directly to Azure Blob Storage from tools such as Visual Studio Azure VS 2015 / VS 2013 or Microsoft Azure Storage Explorer / Azure Storage Explorer or Cloud Explorer for Visual Studio 2013

 

4. Run Hive to read that first folder

Now that the text file is ready in Blob storage, I can run Hive and create a table, and query from that file saved into the import1 folder. From my SSH session I simply run hive.

cd /bin

hive

{

DROP TABLE Import1;

CREATE EXTERNAL TABLE Import1(col1 string, col2 string) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE LOCATION ‘/tutorials/usehive/import1/’ tblproperties (“skip.header.line.count”=”1″);

SELECT * FROM Import1;

}

 

Notice the strange characters ��� where I expected my degree signs.

Image may be NSFW.
Clik here to view.

Just to be sure it’s not something special with my SSH client (MobaXTerm here) I am trying from the Ambari web dashboard for my Azure HDInsight Cluster, and using the Hive View from the menu icon in the upper right.

Image may be NSFW.
Clik here to view.

 

5. OK let’s try exporting again – this time changing the flat file encoding setting.

In the SSIS Import Export Wizard (SSIS Flat File Destination) choose code page UTF-8.

- Uncheck the “Unicode” checkmark.

- Choose code page 65001 (UTF-8)

Image may be NSFW.
Clik here to view.

 

6. Upload that second file to Linux (SFTP) and then copy into HDFS or Azure Blob Storage.

I made a new folder, so I could compare my trials side-by-side.

hadoop fs -mkdir -p /tutorials/usehive/import2/

hadoop fs -copyFromLocal tableexport_ssisunicode2.txt wasb:///tutorials/usehive/import2/tableexport_ssisunicode2.txt

 

7. Now test the Hive table again with the UTF-8 encoded file in the second folder

Run Hive

cd /bin

hive

{

DROP TABLE Import2;

CREATE EXTERNAL TABLE Import2(col1 string, col2 string) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE LOCATION ‘/tutorials/usehive/import2/’ tblproperties (“skip.header.line.count”=”1″);

SELECT * FROM Import2;

QUIT;

}

My zyx°°° looks normal now! Success!

Image may be NSFW.
Clik here to view.

 

The Hive View in Ambari likes this data too.
Image may be NSFW.
Clik here to view.

8. An alternative- Tell hive to encode / decode the external files differently

Perhaps you don’t want to change the file format to UTF-8 (most universal in Linux and Hadoop), or maybe you cannot change the format at all, because the files come from an outsider.

Starting in Hive 0.14 version and above, Hive has a simple way to change the encoding of serialization (for interpreting the bytes encoding in text files for example).

The change was explained here https://issues.apache.org/jira/browse/HIVE-7142

 

1. You can Create from scratch with this code page serializer and override the serialization.encoding property to a code page that best matches your source data encoding.

CREATE TABLE person(id INT, name STRING, desc STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’ISO-8859-1′);

 

2. Or if you have an existing table, this can be adjusted after the fact. You need to carefully match your encoding to whatever kind of files will be presented in the storage underneath this Hive table.

Choose one, or make your own…

ALTER TABLE Import1 SET serdeproperties (‘serialization.encoding’=’US-ASCII’);
ALTER TABLE Import1 SET serdeproperties (‘serialization.encoding’=’ISO-8859-1′);
ALTER TABLE Import1 SET serdeproperties (‘serialization.encoding’=’UTF-8′);

 

To find out which tokens to list next to the equal sign, refer to the Charset code ID is listed in Java documentation
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

I am guessing the right one is this one to match SQL’s “Unicode” but it needs to be tested further to be totally sure that ALL characters are interpreted as expected.

ISO-8859-1   ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1

 

Other tips we didn’t have time to try yet:

Someone made a tool to help convert files if it is not possible for them to change the format in SSIS or BCP exports to text files. https://code.msdn.microsoft.com/windowsdesktop/UTF8WithoutBOM-Converter-7a8218af

 

Hope this helps someone out there. Let us know if it does, or if you still get stuck, post a comment below, or try the Azure forums for help.

 

Happy Hadooping! Jason

Viewing all 50 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>