Migrate list of huge XML files in parallel

Migrate list of huge XML files in parallel - java

I have the code:
final int numOfThreads = Runtime.getRuntime().availableProcessors() + 1;
final ExecutorService exec = Executors.newFixedThreadPool( numOfThreads );
final int numOfFiles = listOfAllFiles.size();
final BlockingQueue<File> queue = new ArrayBlockingQueue<File>( numOfFiles, false, listOfAllFiles );
for ( int i = 0; i < numOfThreads; i++ ) {
exec.execute( () -> {
File file = null;
while ( (file = queue.poll()) != null ) {
migrate( file );
}
} );
}
The fixed size ExecutorService polls (XML) files from a BlockingQueue in order to migrate them. Since most of the files are pretty big (multiple GBs), each thread is doing a lot of I/O.
Is the queue even necessary? Can't I just do:
final int numOfThreads = Runtime.getRuntime().availableProcessors() + 1;
final ExecutorService exec = Executors.newFixedThreadPool( numOfThreads );
for ( final File file : listOfAllFiles ) {
exec.execute( () -> migrate( file ) );
}
I am also wondering if the fixed thread pool is the ideal choice?

An ExecutorService with a fixed sized thread pool is a good choice. However, I think you should make the pool size a tuning parameter.
The problem is that we don't know if migrateFile is CPU intensive, I/O intensive, memory (heap size) intensive or some combination. The optimal thread could will depend on this. The best strategy is to do some experiments.

Given the fact that the numper is fixed, that queue does not provide any benefit. You don't need it. You would need it if new files kept coming in.
The number of threads looks valid, too. But it really depends on OS and JVM version to get the best number. You might rather do some experiments to be sure.

In your case queue in between not solving any purpose.Before jump into coding analyze the real bottleneck in processing large files.
Processing a file involves reading from the disk, processing (e.g. parsing an XML and transforming), and writing back to the disk.So it is a trade off in terms of better I/O, better CPU usage, and better memory usage. To know It is important to conduct profiling to monitor CPU usage, memory usage, and I/O efficiency.
Reading the data from the disk can be I/O-heavy.
Storing the read data in the Java heap memory to process them can be memory-
heavy.
Parsing & transforming the data can be CPU-heavy.
Writing the processed data back to the disk can be I/O-heavy.

Related

Peformance issues reading CSV files in a Java (Spring Boot) application

I am currently working on a spring based API which has to transform csv data and to expose them as json.
it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each.
I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers.
Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines.
To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
public static final int NUMBER_OF_THREADS = 10;
public static List<List<String>> readCsv(InputStream inputStream) {
List<List<String>> rowList = new ArrayList<>();
ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
List<Future<List<String>>> listOfFutures = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
String line = null;
while ((line = reader.readLine()) != null) {
CallableLineReader callableLineReader = new CallableLineReader(line);
Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
listOfFutures.add(futureCounterResult);
}
reader.close();
pool.shutdown();
} catch (Exception e) {
//log Error reading csv file
}
for (Future<List<String>> future : listOfFutures) {
try {
List<String> row = future.get();
}
catch ( ExecutionException | InterruptedException e) {
//log Error CSV processing interrupted during execution
}
}
return rowList;
}
And the Callable implementation
public class CallableLineReader implements Callable<List<String>> {
private final String line;
public CallableLineReader(String line) {
this.line = line;
}
#Override
public List<String> call() throws Exception {
return Arrays.asList(line.replace("\"", "").split(","));
}
}

I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:
public static Stream<List<String>> readCsv(InputStream inputStream) {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
}
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.

Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.
Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).
For a tutorial see the following link.

Try using Spring batch and see if it helps your scenario.
Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/

Groovy ASTBuilder bad performance with multiple threads

I'm using Groovy's ASTBuilder (version 2.5.5) in a project. It's being used to parse and analyze groovy expressions received via a REST API. This REST service receives thousands of requests, and the analysis is done on the fly.
I'm noticing some serious performance issues in a multithreaded environment. Below is a simulation, running 100 threads in parallel:
int numthreads = 100;
final Callable<Void> task = () -> {
long initial = System.currentTimeInMillis();
// Simple rule
new AstBuilder().buildFromString("a+b");
System.out.print(String.format("\n\nThread took %s ms.",
System.currentTimeInMillis() - initial));
return null;
};
final ExecutorService executorService = Executors.newFixedThreadPool(numthreads);
final List<Callable<Void>> tasks = new ArrayList<>();
while (numthreads-- > 0) {
tasks.add(task);
}
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
Im trying with different thread loads. The greater the number, the slower.
100 threads => ~1800ms
200 threads => ~2500ms
300 threads => ~4000ms
However, if I serialize the threads, (like setting the pool size to 1), I get much better results, around 10ms each thread. Can someone please help me understand why is this happening?

Performing multiple threaded code, computer shares threads between physical CPU cores. That means the more the number of threads exceeds number of cores, the less benefit you get from every thread. In your example the number of threads increases with number of tasks. So with growing up of the task number every CPU core forced to process the more and more threads. At the same time you may notice that difference between numthreads = 1 and numthreads = 4 is very small. Because in this case every core processes only few (or even just one) thread. Don't set number of threads much more than number of physical CPU threads because it doesn't make a lot of sense.
Additionally in your example you're trying to compare how different numbers of threads performs with different numbers of tasks. But in order to see the efficiency of multiple threaded code you have to compare how the different numbers of threads performs with the same number of tasks. I would change the example the next way:
int threadNumber = 16;
int taskNumber = 200;
//...task method
final ExecutorService executorService = Executors.newFixedThreadPool(threadNumber);
final List<Callable<Void>> tasks = new ArrayList<>();
while (taskNumber-- > 0) {
tasks.add(task);
}
long start = System.currentTimeMillis();
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
long end = System.currentTimeMillis() - start;
System.out.println(end);
executorService.shutdown();
Try this code for threadNumber=1 and, lets say, threadNumber=16 and you'll see the difference.

Dynamic evaluation of expressions involves a lot of resources including class loading, security manager, compilation and execution. It is not designed for high performance. If you just need to evaluate an expression for its value, you could try groovy.util.Eval. It may not consume as many resources as AstBuilder. However, it is probably not going to be that much different, so don't expect too much.
If you want to get the AST only and not any extra information like types, you could call the parser more directly. This would involve a lot fewer resources. See org.codehaus.groovy.control.ParserPluginFactory for more direct access to the source parser.

WordNetSimalarity in large dataset of synsets

I use wordnet similarity java api to measure similarity between two synsets as such:
public class WordNetSimalarity {
private static ILexicalDatabase db = new NictWordNet();
private static RelatednessCalculator[] rcs = {
new HirstStOnge(db), new LeacockChodorow(db), new Lesk(db), new WuPalmer(db),
new Resnik(db), new JiangConrath(db), new Lin(db), new Path(db)
};
public static double computeSimilarity( String word1, String word2 ) {
WS4JConfiguration.getInstance().setMFS(true);
double s=0;
for ( RelatednessCalculator rc : rcs ) {
s = rc.calcRelatednessOfWords(word1, word2);
// System.out.println( rc.getClass().getName()+"\t"+s );
}
return s;
}
Main class
public static void main(String[] args) {
long t0 = System.currentTimeMillis();
File source = new File ("TagsFiltered.txt");
File target = new File ("fich4.txt");
ArrayList<String> sList= new ArrayList<>();
try {
if (!target.exists()) target.createNewFile();
Scanner scanner = new Scanner(source);
PrintStream psStream= new PrintStream(target);
while (scanner.hasNext()) {
sList.add(scanner.nextLine());
}
for (int i = 0; i < sList.size(); i++) {
for (int j = i+1; j < sList.size(); j++) {
psStream.println(sList.get(i)+" "+sList.get(j)+" "+WordNetSimalarity.computeSimilarity(sList.get(i), sList.get(j)));
}
}
psStream.close();
} catch (Exception e) {e.printStackTrace();
}
long t1 = System.currentTimeMillis();
System.out.println( "Done in "+(t1-t0)+" msec." );
}
My database contain 595 synsets that's mean method computeSimilarity will be called (595*594/2) time
To compute Similarity between two words it spend more than 5000 ms!
so to finalize my task I need at least one week !!
My question is how to reduce this period !
How to ameliorate performances??

I don't think language is your issue.
You can help yourself with parallelism. I think this would be a good candidate for map reduce and Hadoop.

Have you tried the MatrixCalculator?

I don't know if it is possible to optimize this algorithm-wise.
But definitely you can run this much faster. On my machine this operation takes twice less time, so if you have eight i7 cores, you'd need 15 hours to process everything(if you process the loop in parallel)
You can get virtual machines at Amazon Web Services. So if you get several machines and run multithreaded processing for different chunks of data on each machine - you will complete in several hours.
Technically it is possible to use Hadoop for this, but if you need to run this just once - making computation parallel and launching on several machines will be simpler in my opinion.

Perl is different from a lot of other languages when it comes to threading/forking.
One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.
When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.
Forking Advantages
Very fast to create a fork
Very robust
Forking Disadvantages
Communicating between the processes can be slow and awkward
Thread Advantages
Thread coordination and data interchange is fairly easy
Threads are fairly easy to use
Thread Disadvantages
Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads
In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.
For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.
For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.
Also, the number of threads is going to be dependent on the number of cores that your computer has. If you have four cores, creating more than 3 threads isn't really going to be very helpful (3 + 1 for your main program).
To be honest, though, threads/forking may not be the way to go. In fact, in many situations they can even slow stuff down due to overhead. If you really need the speed, the best way to get it is going to be through distributed computing. I would suggest that you look into some sort of distributed computing platform to make your runtime better. If you can reduce the dimensionality of your search/compareTo space to less than n^2, then map reduce or Hadoop may be a good option; otherwise, you'll just have a bunch of overhead and no use of the real scalability that Hadoop offers (#Thomas Jungblut).

Java- FixedThreadPool with known pool size but unknown workers

So I think I sort of understand how fixed thread pools work (using the Executor.fixedThreadPool built into Java), but from what I can see, there's usually a set number of jobs you want done and you know how many to when you start the program. For example
int numWorkers = Integer.parseInt(args[0]);
int threadPoolSize = Integer.parseInt(args[1]);
ExecutorService tpes =
Executors.newFixedThreadPool(threadPoolSize);
WorkerThread[] workers = new WorkerThread[numWorkers];
for (int i = 0; i < numWorkers; i++) {
workers[i] = new WorkerThread(i);
tpes.execute(workers[i]);
}
Where each workerThread does something really simple,that part is arbitrary. What I want to know is, what if you have a fixed pool size (say 8 max) but you don't know how many workers you'll need to finish the task until runtime.
The specific example is: If I have a pool size of 8 and I'm reading from standard input. As I read, I split the input into blocks of a set size. Each one of these blocks is given to a thread (along with some other information) so that they can compress it. As such, I don't know how many threads I'll need to create as I need to keep going until I reach the end of the input. I also have to somehow ensure that the data stays in the same order. If thread 2 finishes before thread 1 and just submits its work, my data will be out of order!
Would a thread pool be the wrong approach in this situation then? It seems like it'd be great (since I can't use more than 8 threads at a time).
Basically, I want to do something like this:
ExecutorService tpes = Executors.newFixedThreadPool(threadPoolSize);
BufferedInputStream inBytes = new BufferedInputStream(System.in);
byte[] buff = new byte[BLOCK_SIZE];
byte[] dict = new byte[DICT_SIZE];
WorkerThread worker;
int bytesRead = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict)
tpes.execute(worker);
}
This is not working code, I know, but I'm just trying to illustrate what I want.
I left out a bit, but see how buff and dict have changing values and that I don't know how long the input is. I don't think I can't actually do this thought because, well worker already exists after the first call! I can't just say worker = new WorkerThread a bunch of time since isn't it already pointing towards an existing thread (true, a thread that might be dead) and obviously in this implemenation if it did work I wouldn't be running in parallel. But my point is, I want to keep creating threads until I hit the max pool size, wait till a thread is done, then keep creating threads until I hit the end of the input.
I also need to keep stuff in order, which is the part that's really annoying.

Your solution is completely fine (the only point is that parallelism is perhaps not necessary if the workload of your WorkerThreads is very small).
With a thread pool, the number of submitted tasks is not relevant. There may be less or more than the number of threads in the pool, the thread pool takes care of that.
However, and this is important: You rely on some kind of order of the results of your WorkerThreads, but when using parallelism, this order is not guaranteed! It doesn't matter whether you use a thread pool, or how much worker threads you have, etc., it will always be possible that your results will be finished in an arbitrary order!
To keep the order right, give each WorkerThread the number of the current item in its constructor, and let them put their results in the right order after they are finished:
int noOfWorkItem = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict, noOfWorkItem++)
tpes.execute(worker);
}

As #ignis points out, parallel execution may not be the best answer for your situation.
However, to answer the more general question, there are several other Executor implementations to consider beyond FixedThreadPool, some of which may have the characteristics that you desire.
As far as keeping things in order, typically you would submit tasks to the executor, and for each submission, you get a Future (which is an object that promises to give you a result later, when the task finishes). So, you can keep track of the Futures in the order that you submitted tasks, and then when all tasks are done, invoke get() on each Future in order, to get the results.

java while loop memory leak

I used a while loop to fetch message from Amazon SQS. Partial code is as follows:
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl);
while (true) {
List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();
if (messages.size() > 0) {
MemcachedClient c = new MemcachedClient(new BinaryConnectionFactory(), AddrUtil.getAddresses(memAddress));
for (Message message : messages) {
// get message from aws sqs
String messageid = message.getBody();
String messageRecieptHandle = message.getReceiptHandle();
sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));
// get details info from memcache
String result = null;
String key = null;
key = "message-"+messageid;
result = c.get(key);
}
c.shutdown();
}
}
Will it cause memory leak in such case?
I checked using "ps aux". What I found is that the RSS (resident set size, the non-swapped physical memory that a task used) is growing slowly.

You can't evaluate whether your Java application has a memory leak simply based on the RSS of the process. Most JVMs are pretty greedy, they would rather take more memory from the OS than spend a lot of work on Garbage Collection.
That said your while loop doesn't seem like it has any obvious memory "leaks" either, but that depends on what some of the method calls do (which isn't included above). If you are storing things in static variables, that can be a cause of concern but if the only references are within the scope of the loop you're probably fine.
The simplest way to know if you have a memory leak in a certain area of code is to rigorously exercise that code within a single run of your application (potentially set with a relatively low maximum heap size). If you get an OutOfMemoryError, you probably have a memory leak.

Sorry, but I don't see here code to remove message from the message queue. Did you clean the message list? In case that DeleteRequest removes message from the queue then you try to modify message list which you itereate.
Also you can get better memory usage statistic with visualvm tool which is part of JDK now.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.