I'm using CTAS statement to create a parquet file from a csv in Apache Drill.
I've tried multiple experiments changing various configuration parameters, even trying to write to tmpfs.
My tests always take the same amount of time. I'm not IO bound. I may be CPU bound, consistently one java thread is at 100% most of the time.
Experiments tried:
store.parquet.compression=none
store.parquet.page-size=8192
planner.slice_target=10000
store.parquet.block-size=104857600
store.text.estimated_row_size_bytes=4k
I've reached the conclusion that perhaps Drill is single threaded for writing, can anybody confirm this?
With a 12 core server I have plenty of headroom available that is not being utilised.
Is it possible to run multiple drillbits on a single server?
Update:
It appears that the performance is the same whether the CTAS output format is csv or parquet, so the limitation appears to the ability Drill to write data in general.
Update 2:
Switching from using a csv file as input to the CTAS statement without a header, using a statement of the form:
CREATE TABLE (col1, col2, col3, ...) AS SELECT columns[0], columns[1], column[2] from filename;
to using a CSV file with header, ie changing the statement to something like:
CREATE TABLE (name1, name2, name3, ...) AS SELECT name1, name2, name3 from filename;
Where name1, name2 etc are defined in the header line made a significant difference in performance, from a consistent 13 minutes to execute overall process to 9 minutes.
You cannot run multiple drillibits on a single server.
Yes, In my observation also drill uses lots of Process power many times the CPU usage goes to 300-400% when we're computing on large set of data & i think it uses single thread for parquet file.
Related
Webapp, in my project to provide download CSV file functionality based on a search by end user, is doing the following:
A file is opened "download.csv" (not using File.createTempFile(String prefix,
String suffix, File directory); but always just "download.csv"), writing rows of data from a Sql recordset to it and then using FileUtils to copy that file's content to the servlet's OutputStream.
The recordset is based on a search criteria, like 1st Jan to 30th March.
Can this lead to a potential case where the file has contents of 2 users who make different date ranges/ other filters and submit at the same time so JVM processes the requests concurrently ?
Right now we are in dev and there is very little data.
I know we can write automated tests to test this, but wanted to know the theory.
I suggested to use the OutputStream of the Http Response (pass that to the service layer as a vanilla OutputSteam and directly write to that or wrap in a Buffered Writer and then write to it).
Only down side is that the data will be written slower than the File copy.
As if there is more data in the recordset it will take time to iterate thru it. But the total time of request should be less? (as the time to write to output stream of file will be same + time to copy from file to servlet output stream).
Anyone done testing around this and have test cases or solutions to share?
Well that is a tricky question if you really would like to go into the depth of both parts.
Concurrency
As you wrote this "same name" thing could lead to a race condition if you are working on a multi thread system (almost all of the systems are like that nowadays). I have seen some coding done like this and it can cause a lot of trouble. The result file could have not only lines from both of the searches but merged characters as well.
Examples:
Thread 1 wants to write: 123456789\n
Thread 2 wants to write: abcdefghi\n
Outputs could vary in the mentioned ways:
1st case:
123456789
abcdefghi
2nd case:
1234abcd56789
efghi
I would definitely use at least unique (UUID.randomUUID()) names to "hot-fix" the problem.
Concurrency
Having disk IO is a tricky thing if you go in-depth. The speads could vary in a vide range. In the JVM you can have blocking and non-blocking IO as well. The blocking one could wait until the data is really on the disk and the other will do some "magic" to flush the file later. There is a good read in here.
TL.DR.: As a rule of thumb it is better to have things in the memory (if it could fit) and not bother with the disk. If you use thread memory for that purpose as well you can avoid the concurrency problem as well. So in your case it could be better to rewrite the given part to utilize the memory only and write to the output.
I have been asked to build a reconciliation tool which could compare two large datasets (We may assume input source as two excels).
Each row in excel contains 40-50 columns and record to be compared at each column level. Each file contains close to 3 million of records or roughly 4-5 GB of data.[data may not be in sorted format]
I would appreciate if i could get some hint.
Can following technologies be a good fit
Apache Spark
Apache Spark + Ignite [assuming real time reconciliation in between time frames]
Apache Ignite + Apache Hadoop
Any suggestion to build out in-house tool.
I have also been working on the same-
You can load the csv files to temporary tables using Pyspark/Scala and query on top of the temp tables created.
First a Warning:
Writing a reconciliation tool contains lots of small annoyances and edge cases like date formats, number formats (commas in numbers, scientific notation etc), compound keys, thresholds, ignoring columns , ignoring headers/footers etc etc.
If you only have one file to rec with well defined inputs then consider doing it yourself.
However, if you are likely to try to extend it to be more generic then pay for an existing solution if you can because it will be cheaper in the long run.
Potential Solution:
The difficulty with a distributed process is how you match the keys in unsorted files.
The issue with running it all in a single process is memory.
The approach I took for a commercial rec tool was to save the CSV to tables in h2 and use SQL to do the diff.
H2 is much faster than Oracle for something like this.
If your data is well structured you can take advantage of the ability of h2 to load directly from CSV and if you save the result in a table you can also write the output to CSV too or you can use other Frameworks to write a more structured output or stream the result to a web page.
If your format is xls(x) and not CSV you should do a performance test of the various libraries to read the file as there are huge differences when dealing with that size.
I have been working on the above problem and here is the solution.
https://github.com/tharun026/SparkDataReconciler
The prerequisites as of now are
Both datasets should have the same number of columns
For now, the solution accepts only PARQUETS.
The tool gives you match percentage for each column, so you could understand which transformation went wrong.
I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.
Friends,
In my application, i came across an scenario, where the user may request for an Report download as a flat file, which may have max of 17 Lakhs records (around 650 MB) of Data. During this request either my application server stops serving other threads or occurs out of memory exception.
As of now i am iterating through the result set and printing it to the file.
When i Google out for this, i came across an API named OpenCSV. I tried that too but i didn't see any improvement in the performance.
Please help me out on this.
Thanks for the quick response guys, Here i added my code snap
try {
response.setContentType("application/csv");
PrintWriter dout = response.getWriter();
while(rs.next()) {
dout.print(data row); // Here i am printing my ResultSet tubles into flat file.
dout.print("\r\n");
dout.flush();
}
OpenCSV will cleanly deal with the eccentricities of the CSV format, but a large report is still a large report. Take a look at the specific memory error, it sounds like you need to increase the Heap or Max Perm Gen space (it will depend of the error to be sure). Without any adjusting the JVM will only occupy s fixed amount of RAM (my experience is this number is 64 MB).
If you only stream the data from resultset to file without using big buffers this should be possible, but maybe you are first collecting the data in a growing list before sending to file? So you should investigate this issue.
Please specify your question more otherwise we have to speculate.
CSV format aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.
The situation is that:
I have a csv file with records (usually 10k but up to 1m records)
I will process each record (very basic arithmetic with 5 basic select queries to the DB for every record)
Each record (now processed) will then be written to a file BUT not the same file every time. A record CAN be written to another file instead.
Basically I have 1 input file but several possible output files (around 1-100 possible output files).
The process itself is basic so I am focusing on how I should handle the records.
Which option is appropriate for this situation?
Store several List s that will represent per possible output file, and then write each List one by one in the end?
To avoid several very large Lists, every after processing each record, I will immediately write it to its respective output file. But this will require that I have streams open at a time.
Please enlighten me on this. Thanks.
The second option is ok: create the file output streams on demand, and keep them open as long as it takes (track them in a Map for example).
The operating system may have a restriction on how many open file handles it allows, but those numbers are usually well beyond a couple hundreds of files.
A third option:
You could also just append to files, FileOutputStream allows that option in the constructor:
new FileOutputStream(File file, boolean append)
This is less performant than keeping the FileOutputStreams open, but works as well.