BigData analysis choose technology stack

Posted on 16 Aug 2022, this text provides information on Technology & Software related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

If you'd be running HBase on the same cluster as Hadoop, you'd really cut down the memory available for MapReduce jobs. You don't really need random read/update capability of HBase for an OLAP system. You can load your data into Hadoop cluster using Flume or manually. The equipment monitoring data lends itself to partitioning by time, for example by calendar date. After you load your data into a directory structure that can be mapped to a partitioned Hive table, you can query it using HiveQL. For the most tricky analysis you can either write MapReduce jobs in Java or use Pig.

The problem is that responses would not come instantaniously. This is OK for ad-hock analysis, but might be frustrating if you trying to look at some commonly used pre-determined metrics. In the later case you should consider precalculating such metrics and loading results into a memory cache or even in a relational database. I have seen such frequently used results cached in HBase, I just cannot get over wasting half of the available RAM on a cluster for that purpose.

manpreet Best Answer 3 years ago

I want to write an application that is able to generate reports and enable interactive data analysis (OLAP-like) from monitoring data from a large production system. (I know, there are some problematic trade-off decisions ahead, but let's keep them aside for now.)
I identified the following possibilities for the basic tech stack:

Hadoop: for the distributed file system and MapReduce framework
Database: HBase or Cassandra to enable random reads
Analysis: Hive or Pig for advanced analysis

Based on my research I tend to believe that Hadoop/HBase/Hive would be the most common combination. But this is only based on a number of forum questions and product presentations.
Can someone else share his general opinion on the subject?
Or to be more specific answer the following questions:

Is HBase in general a more suitable store for big data analysis than Cassandra (write vs. read performance)?
Is it worth it to use a database or should I found my analysis layer directly on Hadoop?
Which database/analysis tool combinations are the most "natural"?
Did I miss any cool stuff?

0 views

0 shares

manpreet 3 years ago

0 views 0 shares

Popular Categories

BigData analysis choose technology stack

Manpreet Singh

Answers (2)

manpreet Best Answer 3 years ago

manpreet 3 years ago

Similar Forum

Which operating system you favour and why?

What are the most popular tech portals in India?

What are best technologies available today for education / aiding learning?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important General Tech Links

Join Our Community Today