I'm dealing with a program that reads in items from a .csv file, and writes them to a remote database. I'm trying to multithread the program, and to that end I have created 2 process threads with distinct connections. To this end, the .csv file is read into a buffered reader, and the contents of the buffered reader are processed. However, it seems that the threads keep replicating the data (writing two copies of every tuple into the database).
I've been trying to figure out how to mutex a buffer in Java, and the closest thing I could come up with is a priority queue.
My question is, can you use a buffered reader to read a file into a priority queue line by line? I.E.
public void readFile(Connection connection) {
BufferedReader bufReader = null;
try{
bufReader = new BufferedReader(new FileReader(RECS_FILE));
bufReader.readLine(); //skip header line
String line;
while((line = bufReader.readLine()) != null) {
//extract fields from each line of the RECS_FILE
Pattern pattern = Pattern.compile( "\"([^\"]+)\",\"([^\"]+)\",\"([^\"]+)\",\"([^\"]+)\"");
Matcher matcher = pattern.matcher(line);
if(!matcher.matches()) {
System.err.println("Unexpected line in "+RECS_FILE+": \""+line+"\"");
continue;
}
String stockSymbol = matcher.group(1);
String recDateStr = matcher.group(2);
String direction = matcher.group(3);
String completeUrl = matcher.group(4);
//create recommendation object to populate required fields
// and insert it into the database
System.out.println("Inserting to DB!");
Recommendation rec = new Recommendation(stockSymbol, recDate, direction, completeUrl);
rec.insertToDb(connection);
}
} catch (IOException e) {
System.err.println("Unable to read "+RECS_FILE);
e.printStackTrace();
} finally {
if(bufReader != null) {
try{
bufReader.close();
} catch (IOException e) {
}
}
}
}
You'll see that a buffered reader is used to read in the .csv file. Is there a way to set up a priority queue outside the function such that the buffered reader is putting tuples in a priority queue, and each program thread then accesses the priority queue?
Buffered readers, or indeed any reader or stream are by their nature for single-thread use only. Priority queues are a completely separate structure which, depending on the actual implementation, may or may not be usable by multiple threads. So the short answer is: no, they're two completely unrelated concepts.
To address your original problem: you can't use streamed file access with multiple threads. You can use RandomAccessFile in theory, except that your lines aren't fixed width and therefore you can't seek() to the beginning of a line without reading everything in the file up to that point. Moreover, even if your data consists of fixed-with records, it might be impractical to read a file with two different threads.
The only thing you can parallelise is the database insert, with the obvious caveat that you lose transactionality, as you have to use separate transactions for each thread. (If you don't, you have to synchronise your database operations, which once again means that you haven't won anything.)
So a solution can be to read the lines from one thread and pass on the strings to a processing method invoked via an ExecutorService. That would scale well, but again there is a caveat: the increased overhead of database locking will probably nullify the advantage of using multiple threads.
The ultimate lesson is probably not to overcomplicate things: try the simple way and only look for a more complex solution if the simple one didn't work. The other lesson is perhaps that multithreading doesn't help I/O-bound programs.
#Biziclop's answer is spot on (+1) but I thought I'd add something about bulk database inserts.
In case you didn't know, turning off database auto-commit in most SQL databases is a big win during bulk inserts. Typically after each SQL statement, the database commits it to disk storage which updates indexes and makes all of the changes to the disk structures. By turning off this auto-commit, the database only has to make these changes when you call commit at the end. Typically you would do something like:
conn.setAutoCommit(false);
for (Recommendation rec : toBeInsertedList) {
rec.insertToDb(connection);
}
conn.setAutoCommit(true);
In addition, if auto-commit is not supported by your database, often wrapping the inserts in a transaction accomplishes the same thing.
Here are some another answers that may help:
Slow bulk insert for table with many indexes
Clarification of Java/SQLite batch and auto-commit
Related
I am currently working on a spring based API which has to transform csv data and to expose them as json.
it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each.
I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers.
Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines.
To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
public static final int NUMBER_OF_THREADS = 10;
public static List<List<String>> readCsv(InputStream inputStream) {
List<List<String>> rowList = new ArrayList<>();
ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
List<Future<List<String>>> listOfFutures = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
String line = null;
while ((line = reader.readLine()) != null) {
CallableLineReader callableLineReader = new CallableLineReader(line);
Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
listOfFutures.add(futureCounterResult);
}
reader.close();
pool.shutdown();
} catch (Exception e) {
//log Error reading csv file
}
for (Future<List<String>> future : listOfFutures) {
try {
List<String> row = future.get();
}
catch ( ExecutionException | InterruptedException e) {
//log Error CSV processing interrupted during execution
}
}
return rowList;
}
And the Callable implementation
public class CallableLineReader implements Callable<List<String>> {
private final String line;
public CallableLineReader(String line) {
this.line = line;
}
#Override
public List<String> call() throws Exception {
return Arrays.asList(line.replace("\"", "").split(","));
}
}
I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:
public static Stream<List<String>> readCsv(InputStream inputStream) {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
}
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.
Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).
For a tutorial see the following link.
Try using Spring batch and see if it helps your scenario.
Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/
I am developing a system which loads a huge CSV file (with more than 1 million lines) and saves into database. Also every line has more than one thousand field. A CSV file is considered as one batch and each line is considered as its child object. During the time of adding objects, every object will be saved in List of single batch and at some point I am running out of memory as the List will have more than 1 million objects being added. I cannot split the file into N numbers since there is dependency between the lines which are not in serial order(any line can have dependency to other lines).
Following is the general logic:
Batch batch = new Batch();
while (csvLine !=null ){
{
String[] values = csvLine.split( ",", -1 );
Transaction txn = new Transaction();
txn.setType(values[0]);
txn.setAmount(values[1]);
/*
There are more than one thousand transaction fields in one line
*/
batch.addTransaction (txn);
}
batch.save();
Is there any way we can handle this type of condition with the server having low memory?
In the old times, we used to process large quantities of data stored on sequential tapes with little memory and disk. But it took loooong time!
Basically, you build a buffer of lines than can fit in your memory, browse all file to resolve dependencies and fully process those lines. Then you iterate on next buffer until you have processed all file. If requires a full read of the file per each buffer, but allows to save memory.
There may be another problem here, because you want to store all records in a single batch. The batch will have to require enough memory to store all the records, so here again you have a risk to exhaust memory. But you can again use the good old methods, and save many batches of smaller size.
If you want to make sure that everything will be either fully inserted in database or everything will be rejected, you can simply use a transaction:
declare transaction at the beginning of your job
save all your batches inside this single transaction
commit the transaction when everithing is done
Professional grade databases (MySQL, PostgreSQL, Oracle, etc.) can use rollback segments on disk to be able to process one transaction without exhausting memory. Of course it is far slower than in memory operations (not speaking if for any reason you have to rollback such a transaction!) but at least it works unless you exhaust the available physical disk...
Dedicate a separate database table just for the CSV import. Maybe with additional fields for those cross-references you mentioned.
If you need to analize CSV fields in java, restrain the number of value instances by caching:
public class SharedStrings {
private Map<String, String> sharedStrings = new HashMap<>();
public String share(String s) {
if (s.length() <= 15) {
String t = sharedStrings.putIfAbsent(s, s); // Since java 8
if (t != null) {
s = t;
}
/*
// Older java:
String t = sharedString.get(s);
if (t == null) {
sharedString.put(s, s);
} else {
s = t;
}
*/
}
return s;
}
}
In your case, with long records, it might even make sence to GZipOutputStream the read line, as bytes, to a shorter byte array.
But then a database seems more logical.
The following will possibly not apply if you are using all fields of a csvLine.
String#split uses String#substring, which in turn does not create a new string but keeps the original string in memory and references the respective portion.
So this line would keep the original string in memory:
String a = "...very long and comma separated";
String[] split = a.split(",");
String b = split[1];
a = null;
So if your are not using all data of the csvLine you should wrap every entry of values in a new String, i.e. in the above example you would do
String b = new String(split[1]);
otherwise the gc is unable to free string a.
I ran into this while i was extracting one column of a csv file with millions of lines.
I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.
Consider that I have a data file storing rules in the following format:
//some header info
//more header info
//Rule: some_uuid_1234
rule "name"
data
data
data
end
//Rule: some_uuid_5678
rule "name2"
data
data
data
end
Now, what I would like is to be able to either read(id) or delete(id) a rule given the ID number. My question therefore is, how could I select and delete a rule (perhaps using a regex expression), and then delete this specific rule from the file, without altering anything else.
Simply replace <some_id> in your select/delete function with the actual true ID number.
//Rule: <some_id>.+?rule.+?end
NOTE: Don't forget SingleLine option.
There are 2 solutions I can think of and they have varied performance, so you can choose the one that suits you best.
Index the file
You could write an inverted index for this rule file and keep it updated for any operation that modifies the file. Of course your word index will be limited to one file and the only words in it will be the unique UUIDs. You can use a RandomAccess file to quickly read() from a given offset. The delete() operation can overwrite the target rule until it encounters the word 'end'. This solution requires more work, but you can retrieve values instantly.
Use a regex
You can alternatively read each line in the file and match it with a regex pattern that matches your rule UUID. Keep reading until you hit the 'end' of the rule and return it. A delete will involve over-writing the rule once you know the desired index. This solution is easy to write but the performance will suck. There is a lot of IO and it could become a bottleneck. (You could also load the entire file into memory and run a regex on the whole string, depending on how large the file / string is expected to be. This can get ugly real quick though.)
Whichever solution you choose you might also want to think about file level locks and how that affects CRUD operations. If this design has not been implemented yet, please consider moving the rules to a database.
I wouldn't use regular expressions to solve this particular problem - it would require loading the whole file in memory, processing it and rewriting it. That's not inherently bad, but if you have large enough files, a stream-based solution is probably better.
What you'd do is process your input file one line at a time and maintain a boolean value that:
becomes true when you find a line that matches the desired rule's declaration header.
becomes false when it's true and you find a line that matches end.
Discard all lines encountered while your boolean is set to true, write all other ones to a temporary output file (created, for example, with File#createTempFile).
For each line, if your boolean value is true, ignore it. Otherwise, write it to a temporary output file.
At the end of the process, overwrite your input file with your temporary output file using File#renameTo.
Note that this solution has the added advantage of being atomic: there is no risk for your input file to be partially written should an error occur in the middle of processing. It will either be overwritten entirely or not at all, which protects you against unexpected IOExceptions.
The following code demonstrates how you could implement that. It's not necessarily a perfect implementation, but it should illustrate the algorithm - lost somewhere in the middle of all that boilerplate code.
public void deleteFrom(String id, File file) throws IOException {
BufferedReader reader;
String line;
boolean inRule;
File temp;
PrintWriter writer;
reader = null;
writer = null;
try {
// Streams initialisation.
temp = File.createTempFile("delete", "rule");
writer = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp), "utf-8")));
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "utf-8"));
inRule = false;
// For each line in the file...
while((line = reader.readLine()) != null) {
// If we're parsing the rule to delete, we're only interested in knowing when we're done.
if(inRule) {
if(line.trim().equals("end"))
inRule = false;
}
// Otherwise, look for the beginning of the targetted rule.
else if(line.trim().equals("rule \"" + id + "\""))
inRule = true;
// Normal line, we want to keep it.
else
writer.println(line);
}
}
// Stream cleanup.
finally {
if(reader != null)
reader.close();
if(writer != null)
writer.close();
}
// We're done, copy the new file over the old one.
temp.renameTo(file);
}
I have a program which writes some 8 millions line of data in a flat file. As of now, the program is calling bufferedwriter.write for each of the record and I was planning to write in bulk with the following strategy
Keep a data structure (I used array) to hold a specific number of records.
write the details in a file using the array. here is the code snippet (array is the name of the Array which stores the record and threshold count is the kick off for writing process)
if (array.length==thresholdCount) {
writeBulk(array);
}
public void writeBulk(String[] inpArray) {
for (String line:inpArray) {
if (line!=null) {
try {
writer.write(line +"\n");
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
However I am not seeing much performance improvement. I want to know if there is a way to determine the optimal threshold count?
I was also planning to further tune the code so as to store each element in the array as a concatenation of some n number of records and then call the bulk method. For ex an array with length 5000 will actually contain 50000 records whereby each index in the array contains 10 records. however before doing so, I need the expert opinion.
Writes to files are already buffered in a similar fashion before they are pushed to disk (unless you flush -- which actually doesn't always do exactly that either). Thus pre-buffering the writes will not speed up the overall process. Note: that some IO Classes try to do immediate writes by inserting flush requests after each write. For those special cases pre-buffering can sometimes help, but usually you just use a Buffered version of the Class in the first place rather than manually buffer yourself.
If you were writing to somewhere other than the end of the file, then you could see an improvement as writing to the middle of a file wouldn't need to copy the contents of the already flushed entries sitting on your hard-disk.