Popular Categories

Best technology to compare 2 large sets of data [closed]

General Tech Technology & Software 3 years ago

3.49K 1 0 0 0

Manpreet Singh

Previous Next

User submissions are the sole responsibility of contributors, with TuteeHUB disclaiming liability for accuracy, copyrights, or consequences of use; content is for informational purposes only and not professional advice.

Answers (1)

Post Answer

manpreet Best Answer 3 years ago

Problem

Every day we recieve a new set of data files from our backoffice application. This application is not able to produce an incremental changeset so all it can do is dump to a large file.

Currently every morning we drop our old MySQL tables and load the data into uor database.

One of the problems we have here is that we are unable to act on specific changes in the data and also we are using CQRS and would have quite some benefits here if we had an incremental list.

File format is currently CSV
Data size per file is up to 10GB
Number of rows per file is up to 40 million
Approximately 30 data files
On average less than 1% of rows is changed each day
Most files either have no primary key or a combined primary key. For many, the full row is the only thing that makes them unique.
The order of data is not fixed. Rows may switch positions

Desired situation

When we receive the new data we calculate the difference and push a message into Kafka for each changed (if a rowidentifier exists), added or removed row.

Technology

We use AWS and are able to use all technologies AWS offers
We are not limited to a certain amount of hardware. We can just start up some new servers in AWS
Cost is only a very limited factor. We have quite a large budget and the ability to have an incremental set offers us quite a lot of value.
We have a running Kubernetes cluster

Question

So the main question is, What would be the best way to compare these 2 large files and create an incremental set? We need it to be fast, preferably within the hour or close to that.

Are there database types that have this natively or are there technologies that can do this for us?

0 views

0 shares

No matter what stage you're at in your education or career, TuteeHUB will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Popular Categories

Best technology to compare 2 large sets of data [closed]

Manpreet Singh

Answers (1)

manpreet Best Answer 3 years ago

Problem

Desired situation

Technology

Question

Similar Forum

Which operating system you favour and why?

What are the most popular tech portals in India?

What are best technologies available today for education / aiding learning?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important General Tech Links

Join Our Community Today