I have been asked to build a reconciliation tool which could compare two large datasets (We may assume input source as two excels).
Each row in excel contains 40-50 columns and record to be compared at each column level. Each file contains close to 3 million of records or roughly 4-5 GB of data.[data may not be in sorted format]
I would appreciate if i could get some hint.
Can following technologies be a good fit
Apache Spark
Apache Spark + Ignite [assuming real time reconciliation in between time frames]
Apache Ignite + Apache Hadoop
Any suggestion to build out in-house tool.
I have also been working on the same-
You can load the csv files to temporary tables using Pyspark/Scala and query on top of the temp tables created.
First a Warning:
Writing a reconciliation tool contains lots of small annoyances and edge cases like date formats, number formats (commas in numbers, scientific notation etc), compound keys, thresholds, ignoring columns , ignoring headers/footers etc etc.
If you only have one file to rec with well defined inputs then consider doing it yourself.
However, if you are likely to try to extend it to be more generic then pay for an existing solution if you can because it will be cheaper in the long run.
Potential Solution:
The difficulty with a distributed process is how you match the keys in unsorted files.
The issue with running it all in a single process is memory.
The approach I took for a commercial rec tool was to save the CSV to tables in h2 and use SQL to do the diff.
H2 is much faster than Oracle for something like this.
If your data is well structured you can take advantage of the ability of h2 to load directly from CSV and if you save the result in a table you can also write the output to CSV too or you can use other Frameworks to write a more structured output or stream the result to a web page.
If your format is xls(x) and not CSV you should do a performance test of the various libraries to read the file as there are huge differences when dealing with that size.
I have been working on the above problem and here is the solution.
https://github.com/tharun026/SparkDataReconciler
The prerequisites as of now are
Both datasets should have the same number of columns
For now, the solution accepts only PARQUETS.
The tool gives you match percentage for each column, so you could understand which transformation went wrong.
Related
A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement:
We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.
To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.
If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.
Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.
The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:
SELECT DISTINCT ?owner
WHERE {
?owner :owns ?thing
?thing rdf:type/rdfs:subClassOf :Automobile
?thing :hasColor "red"#en
}
In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).
I´m writing Parquet files using the ParquetDatasetStoreWriterclass and the performance I get is really bad.
Normally the flow followed is this:
// First write
dataStoreWriter.write(entity #1);
dataStoreWriter.write(entity #2);
...
dataStoreWriter.write(entity #N);
// Then close
dataStoreWriter.close()
The problem is, as you might know, that my dataStoreWriter is one a facade and the real writing work is done by a taskExecutor and a taskScheduler. This work can be seen by these messages prompted to the standard output:
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 685B for [localId] BINARY: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 75B for [factTime] INT64: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [period] INT32: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 6,304B for [objectType] BINARY: 300,000 values, ...
As you can see, I am writing 300K objects per Parquet file, which results in files of around 700K in disk. Nothing really big...
However, after one or two writes, I get fewer and fewer messages like these ones and the process stalls...
Any idea about what could be happening? Everything is green in Cloudera...
Versions used:
Cloudera 5.11
Java 8
Spring Integration 4.3.12.RELEASE
Spring Data Hadoop 2.2.0.RELEASE
Edit: Actually, I isolated the writing of the Parquet files using the Kite Dataset CLI tool and the problem is the performance of the SKD itself. Using the csv-import command and loading the data from a CSV, I see that we are writing at a rate of 400.000 records per minute, which is way below than the 15.0000.000 records per minute that we are writing, hence the stalling...
Can you recommend any way of improving this writing rate? Thanks!
I am working with a database that is divided into a few dozen text files, each containing two columns and are 200 lines long.
Currently, I only load up one of the text files and read the data from it into two arrays. I could simply go through the handful of text files and load the data one after the other but I wanted to know what would be the approach to manage a "database" of this size and what would be the "standard" of the format of the database if it were to be included in the end application.
I could simply have a single text file that would hold all the data and would end up 250 000 lines long - while this would work, I just do not know better if it at all seems professional and practical. A much better approach would be if I could have a single file and then via code specify which table (the sub-text files are basically two column tables, hence a few dozens of them) I would like the data from to be read into two arrays.
Why not use a real database?
You could use some in-memory-database.
I need to order a huge csv file (10+ million records) with several algorithms in Java but I've some problem with memory amount.
Basically I have a huge csv file where every record has 4 fields, with different type (String, int, double).
I need to load this csv into some structure and then sort it by all fields.
What was my idea: write a Record class (with its own fields), start read csv file line by line, make a new Record object for every line and then put them into an ArrayList. Then call my sorter algorithms for each field.
It doesn't work.. I got and OutOfMemoryException when I try lo load all Record object into my ArrayList.
In this way I create tons of object and I think that is not a good idea.
What should I do when I have this huge amount of data? Which method/data structure can ben less expensive in terms of memory usage?
My point is just to use sort algs and look how they work with big set of data, it's not important save the result of sorting into a file.
I know that there are some libs for csv, but I should implements it without external libs.
Thank you very much! :D
Cut your file into pieces (depending on the size of the file) and look into merge sort. That way you can sort even big files without using a lot of memory, and it's what databases use when they have to do huge sorts.
I would use an in memory database such as h2 in in-memory-mode (jdbc:h2:mem:)
so everything stays in ram and isn't flushed to disc (provided you have enough ram, if not you might want to use the file based url). Create your table in there and write every row from the csv. Provided you set up the indexes properly sorting and grouping will be a breeze with standard sql
I have two large XML files (3GB, 80000 records). One is updated version of another. I want to identify which records changed (were added/updated/deleted). There are some timestamps in the files, but I am not sure they can be trusted. Same with order of records within the files.
The files are too large to load into memory as XML (even one, never mind both).
The way I was thinking about it is to do some sort of parsing/indexing of content offset within the first file on record-level with in-memory map of IDs, then stream the second file and use random-access to compare those records that exist in both. This would probably take 2 or 3 passes but that's fine. But I cannot find easy library/approach that would let me do it. vtd-xml with VTDNavHuge looks interesting, but I cannot understand (from documentation) whether it supports random-access revisiting and loading of records based on pre-saved locations.
Java library/solution is preferred, but C# is acceptable too.
Just parse both documents simultaneously using SAX or StAX until you encounter a difference, then exit. It doesn't keep the document in memory. Any standard XML library will support S(t)AX. The only problem would be if you consider different order of elements to be insignificant...