If you'd be running HBase on the same cluster as Hadoop, you'd really cut down the memory available for MapReduce jobs. You don't really need random read/update capability of HBase for an OLAP system. You can load your data into Hadoop cluster using Flume or manually. The equipment monitoring data lends itself to partitioning by time, for example by calendar date. After you load your data into a directory structure that can be mapped to a partitioned Hive table, you can query it using HiveQL. For the most tricky analysis you can either write MapReduce jobs in Java or use Pig.
The problem is that responses would not come instantaniously. This is OK for ad-hock analysis, but might be frustrating if you trying to look at some commonly used pre-determined metrics. In the later case you should consider precalculating such metrics and loading results into a memory cache or even in a relational database. I have seen such frequently used results cached in HBase, I just cannot get over wasting half of the available RAM on a cluster for that purpose.
manpreet
Best Answer
2 years ago
I want to write an application that is able to generate reports and enable interactive data analysis (OLAP-like) from monitoring data from a large production system. (I know, there are some problematic trade-off decisions ahead, but let's keep them aside for now.)
I identified the following possibilities for the basic tech stack:
Based on my research I tend to believe that Hadoop/HBase/Hive would be the most common combination. But this is only based on a number of forum questions and product presentations.
Can someone else share his general opinion on the subject?
Or to be more specific answer the following questions: