Hadoop Parquet Datastorewriter bad writing performance - java

I´m writing Parquet files using the ParquetDatasetStoreWriterclass and the performance I get is really bad.
Normally the flow followed is this:
// First write
dataStoreWriter.write(entity #1);
dataStoreWriter.write(entity #2);
...
dataStoreWriter.write(entity #N);
// Then close
dataStoreWriter.close()
The problem is, as you might know, that my dataStoreWriter is one a facade and the real writing work is done by a taskExecutor and a taskScheduler. This work can be seen by these messages prompted to the standard output:
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 685B for [localId] BINARY: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 75B for [factTime] INT64: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [period] INT32: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 6,304B for [objectType] BINARY: 300,000 values, ...
As you can see, I am writing 300K objects per Parquet file, which results in files of around 700K in disk. Nothing really big...
However, after one or two writes, I get fewer and fewer messages like these ones and the process stalls...
Any idea about what could be happening? Everything is green in Cloudera...
Versions used:
Cloudera 5.11
Java 8
Spring Integration 4.3.12.RELEASE
Spring Data Hadoop 2.2.0.RELEASE
Edit: Actually, I isolated the writing of the Parquet files using the Kite Dataset CLI tool and the problem is the performance of the SKD itself. Using the csv-import command and loading the data from a CSV, I see that we are writing at a rate of 400.000 records per minute, which is way below than the 15.0000.000 records per minute that we are writing, hence the stalling...
Can you recommend any way of improving this writing rate? Thanks!

Related

j2ee download a file issues if same file used in backend?

Webapp, in my project to provide download CSV file functionality based on a search by end user, is doing the following:
A file is opened "download.csv" (not using File.createTempFile(String prefix,
String suffix, File directory); but always just "download.csv"), writing rows of data from a Sql recordset to it and then using FileUtils to copy that file's content to the servlet's OutputStream.
The recordset is based on a search criteria, like 1st Jan to 30th March.
Can this lead to a potential case where the file has contents of 2 users who make different date ranges/ other filters and submit at the same time so JVM processes the requests concurrently ?
Right now we are in dev and there is very little data.
I know we can write automated tests to test this, but wanted to know the theory.
I suggested to use the OutputStream of the Http Response (pass that to the service layer as a vanilla OutputSteam and directly write to that or wrap in a Buffered Writer and then write to it).
Only down side is that the data will be written slower than the File copy.
As if there is more data in the recordset it will take time to iterate thru it. But the total time of request should be less? (as the time to write to output stream of file will be same + time to copy from file to servlet output stream).
Anyone done testing around this and have test cases or solutions to share?
Well that is a tricky question if you really would like to go into the depth of both parts.
Concurrency
As you wrote this "same name" thing could lead to a race condition if you are working on a multi thread system (almost all of the systems are like that nowadays). I have seen some coding done like this and it can cause a lot of trouble. The result file could have not only lines from both of the searches but merged characters as well.
Examples:
Thread 1 wants to write: 123456789\n
Thread 2 wants to write: abcdefghi\n
Outputs could vary in the mentioned ways:
1st case:
123456789
abcdefghi
2nd case:
1234abcd56789
efghi
I would definitely use at least unique (UUID.randomUUID()) names to "hot-fix" the problem.
Concurrency
Having disk IO is a tricky thing if you go in-depth. The speads could vary in a vide range. In the JVM you can have blocking and non-blocking IO as well. The blocking one could wait until the data is really on the disk and the other will do some "magic" to flush the file later. There is a good read in here.
TL.DR.: As a rule of thumb it is better to have things in the memory (if it could fit) and not bother with the disk. If you use thread memory for that purpose as well you can avoid the concurrency problem as well. So in your case it could be better to rewrite the given part to utilize the memory only and write to the output.

Handling large XML files with relational data - Java

There are tons of questions around this topic, but none of them seem to have a clear-cut answer to this specific problem.
I get a large dump of XML files from a data provider and put together these files add up to about a few GBs (ranging between 6-20 GB). There is a master XML file with a bunch of references to other files, which in turn can reference other files as well.
Example:
master.xml
...
<region>123</region>
...
region.xml
...
<region>
<id>123</region>
<zone>345<zone>
...
</region>
...
zone.xml
...
<zone>
<id>345</id>
<name>Zone 1</name>
<top_trade>ABC123</top_trade>
...
</zone>
...
trade.xml
...
<trade>
<id>12334</id>
<alias>ABC123</alias>
<name>Insurance</name>
...
</trade>
...
Final output:
<region>
<id>123</id>
<zone>Zone 1</zone>
<top_trade>Insurance</top_trade>
</region>
Now to answer the obvious question - why not use an RDBMS to query and spit out the required data? There are a few reasons:
The DB will not be used beyond the initial transformation and I'd like to avoid introducing transient components in the architecture
The input is text and the output is text (I'll be exporting this to a JSON file which will be used to seed a few systems) and going through the protocol change, adding a DB engine on top and running queries seems like an overkill on an efficient file system like Linux (I'm not considering running this application on Windows)
Though the example provided looks perfectly relational, the data is not as clean. For instance there may be multi-valued tags for a field, each of which references one of the other files - introducing a many-many mapping and that means introducing additional tables (on top of the already bloated set) to support the data structure
The most recommended solution in Java (for similar problems) is using a HashMap and constructing each row in incremental loops, but it seems to be pretty inefficient IMHO (there are about 8-10 large files) and given the size of the files, the HashMap could get fairly large.
Questions:
Is there any other way to handle this problem efficiently in Java?
If not, is it better to handle this using Linux features like sed or awk? * If sed or awk (or a similar Linux feature) is my best bet, given that this execution needs to be event-driven (REST or AMQP), is it a good practice to use runtime exec in my Java code?

Reconciliation tool [comparing two large data set of records]

I have been asked to build a reconciliation tool which could compare two large datasets (We may assume input source as two excels).
Each row in excel contains 40-50 columns and record to be compared at each column level. Each file contains close to 3 million of records or roughly 4-5 GB of data.[data may not be in sorted format]
I would appreciate if i could get some hint.
Can following technologies be a good fit
Apache Spark
Apache Spark + Ignite [assuming real time reconciliation in between time frames]
Apache Ignite + Apache Hadoop
Any suggestion to build out in-house tool.
I have also been working on the same-
You can load the csv files to temporary tables using Pyspark/Scala and query on top of the temp tables created.
First a Warning:
Writing a reconciliation tool contains lots of small annoyances and edge cases like date formats, number formats (commas in numbers, scientific notation etc), compound keys, thresholds, ignoring columns , ignoring headers/footers etc etc.
If you only have one file to rec with well defined inputs then consider doing it yourself.
However, if you are likely to try to extend it to be more generic then pay for an existing solution if you can because it will be cheaper in the long run.
Potential Solution:
The difficulty with a distributed process is how you match the keys in unsorted files.
The issue with running it all in a single process is memory.
The approach I took for a commercial rec tool was to save the CSV to tables in h2 and use SQL to do the diff.
H2 is much faster than Oracle for something like this.
If your data is well structured you can take advantage of the ability of h2 to load directly from CSV and if you save the result in a table you can also write the output to CSV too or you can use other Frameworks to write a more structured output or stream the result to a web page.
If your format is xls(x) and not CSV you should do a performance test of the various libraries to read the file as there are huge differences when dealing with that size.
I have been working on the above problem and here is the solution.
https://github.com/tharun026/SparkDataReconciler
The prerequisites as of now are
Both datasets should have the same number of columns
For now, the solution accepts only PARQUETS.
The tool gives you match percentage for each column, so you could understand which transformation went wrong.

How can I improve the write performance of Apache Drill

I'm using CTAS statement to create a parquet file from a csv in Apache Drill.
I've tried multiple experiments changing various configuration parameters, even trying to write to tmpfs.
My tests always take the same amount of time. I'm not IO bound. I may be CPU bound, consistently one java thread is at 100% most of the time.
Experiments tried:
store.parquet.compression=none
store.parquet.page-size=8192
planner.slice_target=10000
store.parquet.block-size=104857600
store.text.estimated_row_size_bytes=4k
I've reached the conclusion that perhaps Drill is single threaded for writing, can anybody confirm this?
With a 12 core server I have plenty of headroom available that is not being utilised.
Is it possible to run multiple drillbits on a single server?
Update:
It appears that the performance is the same whether the CTAS output format is csv or parquet, so the limitation appears to the ability Drill to write data in general.
Update 2:
Switching from using a csv file as input to the CTAS statement without a header, using a statement of the form:
CREATE TABLE (col1, col2, col3, ...) AS SELECT columns[0], columns[1], column[2] from filename;
to using a CSV file with header, ie changing the statement to something like:
CREATE TABLE (name1, name2, name3, ...) AS SELECT name1, name2, name3 from filename;
Where name1, name2 etc are defined in the header line made a significant difference in performance, from a consistent 13 minutes to execute overall process to 9 minutes.
You cannot run multiple drillibits on a single server.
Yes, In my observation also drill uses lots of Process power many times the CPU usage goes to 300-400% when we're computing on large set of data & i think it uses single thread for parquet file.

How to globally read in an auxiliary data file for a MapReduce application?

I've written a MapReduce application that checks whether a very large set of test points (~3000 sets of x,y,x coordinates) fall within a set of polygons. The input files are formatted as follows:
{Polygon_1 Coords} {TestPointSet_1 Coords}
{Polygon_2 Coords} {TestPointSet_1 Coords}
...
{Polygon_1 Coords} {TestPointSet_2 Coords}
{Polygon_2 Coords} {TestPointSet_2 Coords}
...
There is only 1 input file per MR job, and each file ends up being about 500 MB in size. My code works great and the jobs run within seconds. However, there is a major bottleneck - the time it takes to transfer hundreds of these input files to my Hadoop cluster. I could cut down on the file size significantly if I could figure out a way to read in an auxiliary data file that contains one copy of each TestPointSet and then designate which set to use in my input files.
Is there a way to read in this extra data file and store it globally so that it can be accessed across multiple mapper calls?
This is my first time writing code in MR or Java, so I'm probably unaware of a very simple solution. Thanks in advance!
It can be achieved using hadoop's distributedcache feature. DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.Google it and you can find code example.

Categories