Heap size issue - Memory management using java - java

I have the following code in my application which does two things:
Parse the file which has 'n' number of data.
For each data in the file, there will be two web service calls.
public static List<String> parseFile(String fileName) {
List<String> idList = new ArrayList<String>();
try {
BufferedReader cfgFile = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
cfgFile.readLine();
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
idList.add(fields[0]);
}
}
cfgFile.close();
} catch (IOException e) {
System.out.println(e+" Unexpected File IO Error.");
}
return idList;
}
When i try parse the file having 1 million lines of record, the java process fails after processing certain amount of data. I got java.lang.OutOfMemoryError: Java heap space error. I can partly figure out that the java process stops because of this huge data being provided. Kindly suggest me how to proceed with this huge data.
EDIT: Will this part of code new BufferedReader(new FileReader(new File(fileName))); parse the whole file and gets affected to the size of the file.

The problem you have is you are accumulating all the data on the list. The best way to approach this is to do it on a streaming fashion. This means do not accumulate all the ids on the list, but call your web service on each row or accumulate a smaller buffer and then do the call.
Opening the file and creating the BufferedReader will have no impact on memory consumption, as the bytes from the file will be read (more or less) line by line. The problem is at this point in the code idList.add(fields[0]);, the list will grow as large as the file as you keep accumulating all of the file data into it.
Your code should do something like this:
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
callToRemoteWebService(fields[0]);
}
}

Increase your java heap memory size using the -Xms and -Xmx options. If not set explicitly, the jvm sets the heap size to the ergonomic defaults which in your case is not enough. Read this paper to find out more about tuning the memory in jvm: http://www.oracle.com/technetwork/java/javase/tech/memorymanagement-whitepaper-1-150020.pdf
EDIT: Alternative way on doing this in a producer-consumer way to exploit parallel processing. The general idea is to create a producer thread that reads the file and queues tasks for processing and n consumer threads that consume them. A very general idea (for illustrative purposes) is the following:
// blocking queue holding the tasks to be executed
final SynchronousQueue<Callable<String[]> queue = // ...
// reads the file and submit tasks for processing
final Runnable producer = new Runnable() {
public void run() {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
while ((line = file.readLine()) != null) {
if (!line.trim().equals("")) {
String[] fields = line.split("\\|");
// this will block if there are not available consumer threads to process it...
queue.put(new Callable<Void>() {
public Void call() {
process(fields);
}
});
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt());
} finally {
// close the buffered reader here...
}
}
}
// Consumes the tasks submitted from the producer. Consumers can be pooled
// for parallel processing.
final Runnable consumer = new Runnable() {
public void run() {
try {
while (true) {
// this method blocks if there are no items left for processing in the queue...
Callable<Void> task = queue.take();
taks.call();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
Of course you have to write code that manages the lifecycle of the consumer and producer threads. The right way to do this would be by implementing it using an Executor.

When you want to work with big data, you have 2 choices:
use a big enough heap to fit all the data. this will "work" for a while, but if your data size is unbounded, it will eventually fail.
work with the data incrementally. only keep part of the data (of a bounded size) in memory at any one time. this is the ideal solution as it will scale to any amount of data.

Related

Using a threadpool to add in to a list

I am trying to read a file and add each line to a list.
Simple drawing explaining the goal
Main class -
public class SimpleTreadPoolMain {
public static void main(String[] args) {
ReadFile reader = new ReadFile();
File file = new File("C:\\myFile.csv");
try {
reader.readFile(file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Reader class -
public class ReadFile {
ExecutorService executor = Executors.newFixedThreadPool(5);//creating a pool of 5 threads
List<String> list = new ArrayList<>();
void readFile(File file) throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != "") {
Runnable saver = new SaveToList(line,list);
executor.execute(saver);//calling execute method of ExecutorService
}
}
executor.shutdown();
while (!executor.isTerminated()) { }
}
}
Saver class -
public class SaveToList<E> implements Runnable{
List<E> myList;
E line;
public SaveToList(E line, List<E> list) {
this.line = line;
this.myList = list;
}
public void run() {
//modify the line
myList.add(line);
}
}
I tried to have many saver threads to add in to a same list instead of one saver adding to the list one by one. I want to use threads because I need to modify the data before adding to the list. So I assume modifying the data would take up some time. So paralleling this part would reduce the time consumption, right?
But this doesn't work. I am unable to return a global list which includes all the values from the file. I want to have only one global list of values from the file. So the code definitely should change. If one can guide me it would be greatly appreciated.
Even though adding one by one in a single thread would work, using a thread pool would make it faster, right?
Using multiple threads won't speed anything up here.
You are:
Reading a line from a file, serially.
Creating a runnable and submitting it into a thread pool
The runnable then adds things into a list
Given that you're using an ArrayList, you need to synchronize access to it, because you're mutating it from multiple threads. So, you are adding things into the list serially.
But even without the synchronization, the time taken for the IO will far exceed the time taken to add the string into the list. And adding in multithreading is just going to slow it down more, because it's doing work to construct the runnable, submit it to the thread pool, schedule it, etc.
It's simpler just to miss out the whole middle step:
Read a line from a file, serially.
Add the list to the list, serially.
So:
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while (!(line = br.readLine()).isEmpty()) {
list.add(line);
}
}
You should in fact try if it's worth using multi threading in you application, just compare how much time it takes to read the whole file without any processing on rows done, and compare it with the time it takes to process serially the whole file.
If your process is not too complex my guess is it is not worth to use multi threading.
If you find that the time it takes is much more then you can think about using one or more threads to do the computations.
If so, you could use Futures to process batches of input strings or maybe you could use a thread safe Queue to send string to another process.
private static final int BATCH_SIZE = 1000;
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("big_file.csv"), "utf-8"));
ExecutorService pool = Executors.newFixedThreadPool(8);
String line;
List<String> batch = new ArrayList<>(BATCH_SIZE);
List<Future> results = new LinkedList<>();
while((line=reader.readLine())!=null){
batch.add(line);
if(batch.size()>=BATCH_SIZE){
Future<Object> f = noWaitExec(batch, pool);
results.add(f);
batch = new ArrayList<>(BATCH_SIZE);
}
}
Future<List> f = noWaitExec(batch,pool);
results.add(f);
for (Future future : results) {
try {
Object object = future.get();
// Use your results here
} catch (Exception e) {
// Manage this....
}
}
}
private static Future<List> noWaitExec(final List<String> batch, ExecutorService pool) {
return pool.submit(new Callable<List>() {
public List call() throws Exception {
List result = new ArrayList<>(batch.size());
for (String string : batch) {
result.add(process(string));
}
return result;
}
});
}
private static Object process(String string) {
// Your process ....
return null;
};
There are many other possible solutions (Observables, ParallelStreams, Pipes, CompletableFutures ... you name it), still I think that most of the time spent is the time it takes to read the file, just using a BufferedInputStream to read the file with a big enough buffer could cut your times more then parallel computing.

Exec'ing multiple processes from Java: outputs mixed up?

I have a servlet which creates a new Action object inside doGet(), and this object uses exec() to run external processes. There may be several requests to the servlet at the same time, so I might have several Action objects where each one is running an external process at the same time. Occasionally when this happens I find that the output from one process gets mixed up with the output from one of the others.
Each Action creates a unique temporary directory and runs the process with this as its current directory. Separate threads then read the output streams from the process into a string buffer. The code that the Action object executes looks basically like this:
Process proc = null;
ReaderThread stdout = null;
ReaderThread stderr = null;
StringBuffer buff = new StringBuffer();
int exit = -1;
try {
//
// Run the process
//
proc = Runtime.getRuntime().exec(command,null,directory);
//
// Read the output from the process
//
stdout = new ReaderThread(proc.getInputStream(),buff);
stderr = new ReaderThread(proc.getErrorStream(),buff);
stdout.start();
stderr.start();
//
// Get the exit code
//
exit = proc.waitFor();
//
// Wait for all the output to be read
//
stdout.join();
stderr.join();
}
catch (InterruptedException e) {
if (proc != null) {
proc.destroy();
}
}
catch (Exception e) {
buff.append(e.getClass() + ": " + e.getMessage());
if (proc != null) {
proc.destroy();
}
}
So each request uses a separate Action object to run a process, and this has its own StringBuffer "buff" that the output of the process is accumulated into by the two ReaderThreads. But what I find is that, when two requests are running two processes at the same time, the output of one will sometimes end up in the StringBuffer of the thread that is running the other one, and one of the two servlet requests will see output intended for the other one. It basically behaves as if Runtime.exec() provides a single global pipe to which the output streams of all the processes are connected.
The ReaderThread looks like this:
public class ReaderThread extends Thread {
private BufferedReader reader;
private StringBuffer buffer;
public ReaderThread (InputStream stream, StringBuffer buffer) {
this.reader = new BufferedReader(new InputStreamReader(stream));
this.buffer = buffer;
}
#Override
public void run () {
try {
String line;
while ((line = reader.readLine()) != null) {
synchronized (buffer) {
buffer.append(line + "\n");
}
}
}
catch (IOException e) {
synchronized (buffer) {
buffer.append(e.getMessage() + "\n");
}
}
}
}
Can anyone suggest what I can do to fix this?
Use a ThreadLocal variable to store the output of each thread.
When and how should I use a ThreadLocal variable?
Here's the explanation, partly as a cautionary tale:
the output was being added to an ArrayList in the XML node that
accessed it
the XML nodes were created by cloning prototypes
the ArrayList was initialised in the declaration, not in the initialise()
method of the class, so every instance ended up referring to the same
object.
Duh.
Another two days of my life down the drain!
Happy new year...

Writing to one file from two threads produces unexpected result

When I run this test
public class Test extends Thread {
String str;
Test(String s) {
this.str = s;
}
#Override
public void run() {
try {
FileWriter fw = new FileWriter("1.txt", true);
for (char c : str.toCharArray()) {
System.out.print(c);
fw.write(c);
}
fw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new File("1.txt").delete();
new Test("11111111111111111111").start();
new Test("22222222222222222222").start();
}
}
it shows exactly how it writes characters to 1.txt
2222222222222222111211111211111121211111
but in 1.txt I see a different result
2222222222222222222211111111111111111111
why is that?
Intermediate buffers. Usually modern OS buffer file writes to write full sectors at once, to avoid too much hard-drive header seeks, allow use of DMA techniques, etc...
This may be an example of ASYNC I/O Write.
The kernel updates the corresponding process(different) pages in the page-cache and marks them dirty(needs to be updated in HDD). Then the control quickly returns to the corresponding process (here 2 different processes) which can continue to run and update in the console in the order called by the scheduler. The data is flushed to HDD later at a more optimal time(low cpu-load) in a more optimal way(sequentially bunched writes). and hence writes from process area sequentially.

Java: Pause thread and get position in file

I'm writing an application in Java with multithreading which I want to pause and resume.
The thread is reading a file line by line while finding matching lines to a pattern. It has to continue on the place I paused the thread. To read the file I use a BufferedReader in combination with an InputStreamReader and FileInputStream.
fip = new FileInputStream(new File(*file*));
fileBuffer = new BufferedReader(new InputStreamReader(fip));
I use this FileInputStream because I need the filepointer for the position in the file.
When processing the lines it writes the matching lines to a MySQL database. To use a MySQL-connection between the threads I use a ConnectionPool to make sure just one thread is using one connection.
The problem is when I pause the threads and resume them, a few matching lines just disappear. I also tried to subtract the buffersize from the offset but it still has the same problem.
What is a decent way to solve this problem or what am I doing wrong?
Some more details:
The loop
// Regex engine
RunAutomaton ra = new RunAutomaton(this.conf.getAuto(), true);
lw = new LogWriter();
while((line=fileBuffer.readLine()) != null) {
if(line.length()>0) {
if(ra.run(line)) {
// Write to LogWriter
lw.write(line, this.file.getName());
lw.execute();
}
}
}
// Loop when paused.
while(pause) { }
}
Calculating place in file
// Get the position in the file
public long getFilePosition() throws IOException {
long position = fip.getChannel().position() - bufferSize + fileBuffer.getNextChar();
return position;
}
Putting it into the database
// Get the connector
ConnectionPoolManager cpl = ConnectionPoolManager.getManager();
Connector con = null;
while(con == null)
con = cpl.getConnectionFromPool();
// Insert the query
con.executeUpdate(this.sql.toString());
cpl.returnConnectionToPool(con);
Here's an example of what I believe you're looking for. You didn't show much of your implementation so it's hard to debug what might be causing gaps for you. Note that the position of the FileInputStream is going to be a multiple of 8192 because the BufferedReader is using a buffer of that size. If you want to use multiple threads to read the same file you might find this answer helpful.
public class ReaderThread extends Thread {
private final FileInputStream fip;
private final BufferedReader fileBuffer;
private volatile boolean paused;
public ReaderThread(File file) throws FileNotFoundException {
fip = new FileInputStream(file);
fileBuffer = new BufferedReader(new InputStreamReader(fip));
}
public void setPaused(boolean paused) {
this.paused = paused;
}
public long getFilePos() throws IOException {
return fip.getChannel().position();
}
public void run() {
try {
String line;
while ((line = fileBuffer.readLine()) != null) {
// process your line here
System.out.println(line);
while (paused) {
sleep(10);
}
}
} catch (IOException e) {
// handle I/O errors
} catch (InterruptedException e) {
// handle interrupt
}
}
}
I think the root of the problem is that you shouldn't be subtracting bufferSize. Rather you should be subtracting the number of unread characters in the buffer. And I don't think there's a way to get this.
The easiest solution I can think of is to create a custom subclass of FilterReader that keeps track of the number of characters read. Then stack the streams as follows:
FileReader
< BufferedReader
< custom filter reader
< BufferedReader(sz == 1)
The final BufferedReader is there so that you can use readLine ... but you need to set the buffer size to 1 so that the character count from your filter matches the position that the application has reached.
Alternatively, you could implement your own readLine() method in the custom filter reader.
After a few days searching I found out that indeed subtracting the buffersize and adding the position in the buffer wasn't the right way to do it. The position was never right and I was always missing some lines.
When searching a new way to do my job I didn't count the number of characters because it are just too many characters to count which will decrease my performance a lot. But I've found something else. Software engineer Mark S. Kolich created a class JumpToLine which uses the Apache IO library to jump to a given line. It can also provide the last line it has readed so this is really what I need.
There are some examples on his homepage for those interested.

Threading in Java for files

I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.

Categories