BigData analysis choose technology stack

General Tech Technology & Software 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Technology & Software related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

I want to write an application that is able to generate reports and enable interactive data analysis (OLAP-like) from monitoring data from a large production system. (I know, there are some problematic trade-off decisions ahead, but let's keep them aside for now.)
I identified the following possibilities for the basic tech stack:

  • Hadoop: for the distributed file system and MapReduce framework
  • Database: HBase or Cassandra to enable random reads
  • Analysis: Hive or Pig for advanced analysis

Based on my research I tend to believe that Hadoop/HBase/Hive would be the most common combination. But this is only based on a number of forum questions and product presentations.
Can someone else share his general opinion on the subject?
Or to be more specific answer the following questions:

  • Is HBase in general a more suitable store for big data analysis than Cassandra (write vs. read performance)?
  • Is it worth it to use a database or should I found my analysis layer directly on Hadoop?
  • Which database/analysis tool combinations are the most "natural"?
  • Did I miss any cool stuff?
profilepic.png
manpreet 2 years ago

If you'd be running HBase on the same cluster as Hadoop, you'd really cut down the memory available for MapReduce jobs. You don't really need random read/update capability of HBase for an OLAP system. You can load your data into Hadoop cluster using Flume or manually. The equipment monitoring data lends itself to partitioning by time, for example by calendar date. After you load your data into a directory structure that can be mapped to a partitioned Hive table, you can query it using HiveQL. For the most tricky analysis you can either write MapReduce jobs in Java or use Pig.

The problem is that responses would not come instantaniously. This is OK for ad-hock analysis, but might be frustrating if you trying to look at some commonly used pre-determined metrics. In the later case you should consider precalculating such metrics and loading results into a memory cache or even in a relational database. I have seen such frequently used results cached in HBase, I just cannot get over wasting half of the available RAM on a cluster for that purpose.

 

0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.