Java- Writing huge data to csv

Java- Writing huge data to csv - java

I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?

Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...

Related

How to speed up excel reading/writing

I am using Apache POI to read/write to an excel file for my company as an intern here. My program goes through the excel file which is a big square with top rows computer names and left column user names. 240 computers and 342 users. the sheet[computer][user] is 0 in all spaces and the program calls PSLoggedon for each computer and takes the username(s) currently logged on and increments their 0 so after running it after a month, it shows who is logged in the most to each computer. So far it runs in about 25 minutes since I used a socket to check socket.connect before actually calling PSLoggedon.
Without reading or writing at all to the excel file, just calling all the PSLoggedon calls to each computer, takes about 9 minutes. So, the reading and writing apparently takes 10-15 minutes. The thing is, I am calling PSLoggedon on the computer, then opening the excel to find the [x][y] spot of the [computer][user] and then writing to it a +=1 then closing it. So the reason it is taking this long I suppose is because it opens and closes the file so much? I could be completely wrong. But I can't think of a way to make this faster by opening and reading/writing all at once and only opening and closing the file once. Any ideas?

Normally Apache-POI is very fast, if you are running into some issue then you might need to check below points:
POI's logging might be on, you need to turn them off:
You can add one of these –D to your JVM settings to do this:
-Dorg.apache.poi.util.POILogger=org.apache.poi.util.NullLogger
You may be setting your VM heap to low value, try to increase.
Prefer XLS over XLSX.

Get HSQLDB (or another in-process database, but this is what I've used in the past). Add it to your build.
You can now create either a file-based or in-memory database (I would use file-based, as it lets you persist state between runs) simply by using JDBC. Create a table with the columns User, Computer, Count
In your reading thread(s), INSERT or UPDATE your table whenever you find a user with PSLoggedon
Once your data collection is complete, you can SELECT Computer, User, Count from Data ORDER BY Computer, User (or switch the order depending on your excel file layout), loop through the ResultSet and write the results directly.

This is an old question, but from what I see:
Since you are sampling and using Excel, is it safe to assume that consistency and atomicity isn't critical? You're just estimating fractional usage and don't care if a user logged in and logged out between observations.
Is the Excel file stored over a slow network link? Opening and closing a file 240 times could bring significant overhead. How about the following:
You need to open the Excel file once to get the list of computers. At that time, just snapshot the entire contents of the matrix into a Map<ComputerName, Map<UserName, Count>>. Also get a List<ComputerName> and List<UserName> to remember the row/column headings. The entire spreadsheet has less than 90,000 integers --- no need to bring in heavy database machinery.
9 minutes for 240 computers, single-threaded, is roughly 2.25 seconds per computer. Is that the expected throughput of PSLoggedOn? Can you create a thread pool and query all 240 computers at once or in a small number of rounds?
Then, parse the results, increment your map and dump it back to the Excel file. Is there a possibility that you might see new users that were not previously in the Excel? Those will need to be added to the Map and List<UserName>.

measuring statistics in java simulation

I have a group of nodes who send measurements to a bootstrap server. In the end I want the bootstrap server to sum all the measurements and write it to a file. One way to do that is to over-write the data to the file each time a measurement message is received(after summing up the current measurements). But this would be very inefficient. I want to store the measurement data and write it to file only once after the simulation is completed.
But the problem is that the simulator code that I am using is not under my control, its a library that I am using. So, I cant tell when exactly the simulation is going to end (and hence I cant tell which measurement message will be the last one).
I naively tried to store the measurement data in a static class but this data is not accessible when the simulation terminates. Is there any other way that I can do this ?
Thanks,

I would find the last message using a timeout.
Write to disk if you have new data but you haven't got anything for a while e.g. a second.

If you cannot store the data you need in the process (which it seems you can't, since the static class failed), you need to persist the data some other way. To an on-disk file is one option, and another common one would be to a database.

How to reduce the number of file writes when there are multiple threads?

here's the situation.
In a Java Web App i was assigned to mantain, i've been asked to improve the general response time for the stress tests during QA. This web app doesn't use a database, since it was supposed to be light and simple. (And i can't change that decision)
To persist configuration, i've found that everytime you make a change to it, a general object containing lists of config objects is serialized to a file.
Using Jmeter i've found that in the given test case, there are 2 requests taking up the most of the time. Both these requests add or change some configuration objects. Since the access to the file must be sinchronized, when many users are changing config, the file must be fully written several times in a few seconds, and requests are waiting for the file writing to happen.
I have thought that all these serializations are not necessary at all, since we are rewriting the most of the objects again and again, the changes in every request are to one single object, but the file is written as a whole every time.
So, is there a way to reduce the number of real file writes but still guarantee that all changes are eventually serialized?
Any suggestions appreciated

One option is to do changes in memory and keep one thread on the background, running at given intervals and flushing the changes to the disk. Keep in mind, that in the case of crash you'll lost data that wasn't flushed.
The background thread could be scheduled with a ScheduledExecutorService.
IMO, it would be better idea to use a DB. Can't you use an embedded DB like Java DB, H2 or HSQLDB? These databases support concurrent access and can also guarantee the consistency of data in case of crash.

If you absolutely cannot use a database, the obvious solution is to break your single file into multiple files, one file for each of config objects. It would speedup serialization and output process as well as reduce lock contention (requests that change different config objects may write their files simultaneously, though it may become IO-bound).

One way is to to do what Lucene does and not actually overwrite the old file at all, but to write a new file that only contains the "updates". This relies on your updates being associative but that is usually the case anyway.
The idea is that if your old file contains "8" and you have 3 updates you write "3" to the new file, and the new state is "11", next you write "-2" and you now have "9". Periodically you can aggregate the old and the updates. Any physical file you write is never updated, but may be deleted once it is no longer used.
To make this idea a bit more relevant consider if the numbers above are records of some kind. "3" could translate to "Add three new records" and "-2" to "Delete these two records".
Lucene is an example of a project that uses this style of additive update strategy very successfully.

User matching with current data

I have a database full of two different types of users (Mentors and Mentees), whereby I want the second group (Mentees) to be able to "search" for people in the first group (Mentors) who match their profile. Mentors and Mentees can both go in and change items in their profile at any point in time.
Currently, I am using Apache Mahout for the user matching (recommender.mostSimilarIDs()). The problem I'm running into is that I have to reload the user data every single time anyone searches. By itself, this doesn't take that long, but when Mahout processes the data it seems to take a very long time (14 minutes for 3000 Mentors and 3000 Mentees). After processing, matching takes mere seconds. I also get the same INFO message over and over again while it's processing ("Processed 2248 users"), while looking at the code shows that the message should only be outputted every 10000 users.
I'm using the GenericUserBasedRecommender and the GenericDataModel, along with the NearestNUserNeighborhood, AveragingPreferenceInferrer and PearsonCorrelationSimilarity. I load mentors from the database, add the mentee to the list of POJOs and convert them to a FastByIDMap to give to the DataModel.
Is there a better way to be doing this? The product owner needs the data to be current for every search.

(I'm the author.)
You shouldn't need to ask it to reload the data every time, why's that?
14 minutes sounds way, way too long to load such a small amount of data too, something's wrong. You might follow up with more info at user#mahout.apache.org.
You are seeing log messages from a DataModel, which you can disable in your logging system of choice. It prints one final count. This is nothing to worry about.
I would advise you against using a PreferenceInferrer unless you absolutely know you want it. Do you actually have ratings here? I might suggest LogLikelihoodSimilarity if not.

Designing a process

I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
Via the routing framework I recieve a file.
Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
Once a year we have 1-3 gb of data instead.
Running on an "appserver"
I can fork subprocesses as I like. (within reason)
I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
We're using java.
Requirements
I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
It must be able to complete within two days
Nice to have
Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top.
...
How would you design this?
I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilename and status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'.
When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.

Simplify as much as possible.
The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.
Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)
Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.

When you receive the file, parse it and put the information in the database.
Make one table with a record per line that will need a getPerson request.
Have one or more threads get records from this table, perform the request and put the completed record back in the table.
Once all records are processed, generate the complete file and return it.

if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.
By splitting the larger file into smaller files then you could also use more servers to process the files.
You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.
The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.