Popular Categories

Technology stack for linear regression on (not so) large dataset

General Tech Technology & Software 2 years ago

1.93K 1 0 0 0

Manpreet Singh

Previous Next

Posted on 16 Aug 2022, this text provides information on Technology & Software related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Answers (1)

Post Answer

manpreet Best Answer 2 years ago

While attending to the Coursera's Machine Learning Course, I figured out that I could use a database from the company I work for (~50MM records) to do some linear regression experiments.

But one of the steps involved on proposing this experiment, is to define the technology stack required for this task.

From my understanding the following tasks should be covered:

Read raw data and store it on an non-production database
Transform data to a "regression friendly" format
Store the transformed data on an intermediate database
Compute the actual regression

For #1 I can take some paths, like doing a custom .NET or Java program, or even use an ETL process (this is more to copy data to somewhere else and don't mess with production database).

On #2 the funny part begins: should I consider a specialized tool for a <100MM records database? If so, what would you suggest for transforming this data into a matrix-like representation?

I believe #3 is dependant on the #4: I see lots of samples (eg.: in R, or Matlab/Octave) based on text or csv files. Are these the standard formats for these computations? Or should I read from a database

For #4, from what I could understanding using R is the way to go, right?

Finally, should I consider a multi-gig multi-processors server, or considering it's an experiment in which spending some hours of computation is not a big issue, a 4GB machine will do the job?

I am aware that this question may be considered too broad, but I really would like to hear from you about what should I consider for it, and even if I am missing something (or going to a totally wrong path).

Regarding the data, you can consider it like the house pricing in Boston: it's a 30 features (columns) dataset, used to predict the value for one of these columns.

(question originally posted on Stack Overflow)

0 views

0 shares

No matter what stage you're at in your education or career, TuteeHUB will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Popular Categories

Technology stack for linear regression on (not so) large dataset

Manpreet Singh

Answers (1)

manpreet Best Answer 2 years ago

Similar Forum

Which operating system you favour and why?

What are the most popular tech portals in India?

What are best technologies available today for education / aiding learning?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important General Tech Links

Join Our Community Today