I am writing lot of data to the stdandard output and I remark that according to the output console the program execution time is variable. The program is slower on NetBeans console than in Windows cmd for example.
So i think writing to stdout fill a buffer and writing become blocking when this buffer is full (the console output doesn't consume fast enough).
I reproduce this behavior with a Java program.
Here a program that output data .
public class Output {
public static void main(String[] args) {
long start = System.currentTimeMillis();
for (int i = 0; i < 10_000; i++) {
System.out.println("Awesone Output ");
}
System.out.println("Total time : " + (System.currentTimeMillis() - start));
}
}
And here a program that consume data from the above program
public class StdoutBlocking {
public static void main(String[] args) throws IOException, InterruptedException {
ProcessBuilder processBuilder = new ProcessBuilder("java", "stackoverflowDemo.StdoutBlocking.Output");
processBuilder.directory(new File("C:/Users/me/Documents/NetBeansProjects/tmp/build/classes"));
Process process = processBuilder.start();
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line = null;
String lastLine = null;
while ((line = reader.readLine()) != null) {
Thread.sleep(10);
lastLine = line;
}
System.out.println("done : " + lastLine);
}
}
I launch the second one that start output and consume its output.
Without sleep the Output program is really quick with it's very slow.
So what is the size of this buffer ??
This is just for my Curiosity i'am not trying to achieve Something particular
Default buffer size for System.out and System.err is 8192. You can see it by looking the source code of it's methods like println(String). They use BufferedWriter with default buffer size.
Without sleep the program is really quick with it's very slow. this has no relation to buffer size. Your program sleeps at least 10 milliseconds for ach line of output (10000 lines). So with Thread.sleep(10) your program does nothing for at least 100 seconds. That is why it is slow with Thread.sleep().
Regarding performance of System.out.println() check this question: Why is System.out.println so slow?
The underlying OS operation (displaying chars on a console window) is slow because
1.The bytes have to be sent to the console application (should be quite fast)
2.Each char has to be rendered using (usually) a true type font (that's pretty slow, switching off anti aliasing could improve performance, btw)
3.The displayed area may have to be scrolled in order to append a new line to the visible area (best case: bit block transfer operation, worst case: re-rendering of the complete text area)
Also you could find this in Javadoc for Process class
Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock.
For UNIX you could check this link: https://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer
Related
i am making a java program that reads data from a binary stream (using a DataInputStream).
Sometimes during this process i need to read a data chunk, however the method (which i cannot modify) that reads it will stop before reaching the end of the chunk (it is the normal behavior, apparently it just doesn't need the last bytes, but i can't do anything about the fact that they are there). This is not a problem in itself because i know exactly how long the chunk is, i.e. i know how many bytes there are in the chunk so i can skip bytes (with the skipBytes(int) method) until the end of the chunk ; the problem is : i don't actually know how many bytes the method actually read (or left), so i don't know how many bytes i need to skip to reach the end of the chunk.
Is there any way to :
know how many bytes were read in a stream since a certain point in time ?
know how many bytes were read in a stream since it was ?
any other way i could know how many bytes my data-chunk-reading method just read (since it won't directly tell me) ?
Just in case, i made a small diagram
Thanks in advance
ImageInputStream can do what you want. It implements DataInput and it has most of the methods of InputStream. And it has getStreamPosition, seek and skipBytes methods.
However, as you correctly point out, ImageIO.read(ImageInputStream) would close the stream, preventing you from reading more than one image.
The solution is to avoid using ImageIO.read, and instead obtain an ImageReader explicitly, using ImageIO.getImageReaders. Then you can invoke an ImageReader’s read method, which does not close the stream.
Here’s how I implemented it:
public void readImages(InputStream source,
Consumer<? super BufferedImage> imageHandler)
throws IOException {
// Every image is at a byte index which is a multiple of this number.
int boundary = 5000;
try (ImageInputStream stream = ImageIO.createImageInputStream(source)) {
while (true) {
long pos = stream.getStreamPosition();
Iterator<ImageReader> readers = ImageIO.getImageReaders(stream);
if (!readers.hasNext()) {
break;
}
ImageReader reader = readers.next();
reader.setInput(stream);
BufferedImage image = reader.read(0);
imageHandler.accept(image);
pos = stream.getStreamPosition();
long bytesToSkip = boundary - (pos % boundary);
if (bytesToSkip < boundary) {
stream.skipBytes(bytesToSkip);
}
}
}
}
And here’s how I tested it:
try (InputStream source = new BufferedInputStream(
Files.newInputStream(Path.of(filename)))) {
reader.readImages(source, img -> EventQueue.invokeLater(() -> {
JOptionPane.showMessageDialog(null, new ImageIcon(img));
}));
}
All the buffered read methods return the actual number of bytes read.
Quoting documentation for InputStream#read(byte[] b):
Returns:
the total number of bytes read into the buffer, or -1 if there is no more data because the end of the stream has been reached.
I need to read a file one character at a time and I'm using the read() method from BufferedReader. *
I found that read() is about 10x slower than readLine(). Is this expected? Or am I doing something wrong?
Here's a benchmark with Java 7. The input test file has about 5 million lines and 254 million characters (~242 MB) **:
The read() method takes about 7000 ms to read all the characters:
#Test
public void testRead() throws IOException, UnindexableFastaFileException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
long t0= System.currentTimeMillis();
int c;
while( (c = fa.read()) != -1 ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 7000 ms
}
The readLine() method takes only ~700 ms:
#Test
public void testReadLine() throws IOException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
String line;
long t0= System.currentTimeMillis();
while( (line = fa.readLine()) != null ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 700 ms
}
* Practical purpose: I need to know the length of each line, including the newline characters (\n or \r\n) AND the line length after stripping them. I also need to know if a line starts with the > character. For a given file this is done only once at the start of the program. Since EOL chars are not returned by BufferedReader.readLine() I'm resorting on the read() method. If there are better ways of doing this, please say.
** The gzipped file is here http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz. For those who may be wondering, I'm writing a class to index fasta files.
The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.
One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.
The code then looks as follows:
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Mode;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
#BenchmarkMode(Mode.AverageTime)
#Fork(1)
public class IoPerformanceBenchmark {
private static final String FILE_PATH = "test.fa";
#Benchmark
public int readTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int value;
while ((value = reader.read()) != -1) {
result += value;
}
}
return result;
}
#Benchmark
public int readLineTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
String line;
while ((line = reader.readLine()) != null) {
result += line.chars().sum();
}
}
return result;
}
private void clearFileCaches() throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
pb.inheritIO();
pb.start().waitFor();
}
}
and if we run it with
chcp 65001 # set codepage to utf-8
mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar
we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):
Benchmark Mode Cnt Score Error Units
IoPerformanceBenchmark.readLineTest avgt 20 3.749 ± 0.039 s/op
IoPerformanceBenchmark.readTest avgt 20 3.745 ± 0.023 s/op
Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:
# Warmup Iteration 1: 6.186 s/op
# Warmup Iteration 2: 3.744 s/op
which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.
Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).
Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:
# Warmup Iteration 1: 3.965 s/op
# Warmup Iteration 2: 3.753 s/op
Possible solution
Now that we have a good idea what the actual problem is (my guess is still all those locks neither being coalesced nor using the efficient biased locks implementation), the solution is rather straight forward and simple: Reduce the number of function calls (so yes we could've arrived at this solution without everything above, but it's always nice to have a good grip on the problem and there might have been a solution that didn't involve changing much code).
The following code runs consistently faster than either of the other two - you can play with the array size but it's surprisingly unimportant (presumably because contrary to the other methods read(char[]) does not have to acquire a lock so the cost per call is lower to begin with).
private static final int BUFFER_SIZE = 256;
private char[] arr = new char[BUFFER_SIZE];
#Benchmark
public int readArrayTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int charsRead;
while ((charsRead = reader.read(arr)) != -1) {
for (int i = 0; i < charsRead; i++) {
result += arr[i];
}
}
}
return result;
}
This is most likely good enough performance wise, but if you wanted to improve performance even further using a file mapping might (wouldn't count on too large an improvement in a case such as this, but if you know that your text is always ASCII, you could make some further optimizations) further help performance.
So this is the practical answer to my own question: Don't use BufferedReader.read() use FileChannel instead. (Obviously I'm not answering the WHY I put in the title). Here's the quick and dirty benchmark, hopefully others will find it useful:
#Test
public void testFileChannel() throws IOException{
FileChannel fileChannel = FileChannel.open(Paths.get("chr1.fa"));
long n= 0;
int noOfBytesRead = 0;
long t0= System.nanoTime();
while(noOfBytesRead != -1){
ByteBuffer buffer = ByteBuffer.allocate(10000);
noOfBytesRead = fileChannel.read(buffer);
buffer.flip();
while ( buffer.hasRemaining() ) {
char x= (char)buffer.get();
n++;
}
}
long t1= System.nanoTime();
System.err.println((float)(t1-t0) / 1e6); // ~ 250 ms
System.err.println("nchars: " + n); // 254235640 chars read
}
With ~250 ms to read the whole file char by char, this strategy is considerably faster than BufferedReader.readLine() (~700 ms), let alone read(). Adding if statements in the loop to check for x == '\n' and x == '>' makes little difference. Also putting a StringBuilder to reconstruct lines doesn't affect the timing too much. So this is plenty good for me (at least for now).
Thanks to #Marco13 for mentioning FileChannel.
Java JIT optimizes away empty loop bodies, so your loops actually look like this:
while((c = fa.read()) != -1);
and
while((line = fa.readLine()) != null);
I suggest you read up on benchmarking here and the optimization of the loops here.
As to why the time taken differs:
Reason one (This only applies if the bodies of the loops contain code): In the first example, you're doing one operation per line, in the second, you're doing one per character. This this adds up the more lines/characters you have.
while((c = fa.read()) != -1){
//One operation per character.
}
while((line = fa.readLine()) != null){
//One operation per line.
}
Reason two: In the class BufferedReader, the method readLine() doesn't use read() behind the scenes - it uses its own code. The method readLine() does less operations per character to read a line, than it would take to read a line with the read() method - this is why readLine() is faster at reading an entire file.
Reason three: It takes more iterations to read each character, than it does to read each line (unless each character is on a new line); read() is called more times than readLine().
Thanks #Voo for the correction. What I mentioned below is correct from FileReader#read() v/s BufferedReader#readLine() point of view BUT not correct from BufferedReader#read() v/s BufferedReader#readLine() point of view, so I have striked-out the answer.
Using read() method on BufferedReader is not a good idea, it wouldn't cause you any harm but it certainly wastes the purpose of class.
Whole purpose in life of BufferedReader is to reduce the i/o by buffering the content. You can read here in Java tutorials. You may also notice that read() method in BufferedReader is actually inherited from Reader while readLine() is BufferedReader's own method.
If you want to use read() method then I would say you better use FileReader, which is meant for that purpose. You can read here in Java tutorials.
So, I think answer to your question is very simple (without going into bench-marking and all that explainations) -
Each read() is handled by underlying OS and triggers disk access, network activity, or some other operation that is relatively expensive.
When you use readLine() then you save all these overheads, so readLine() will always be faster than read(), may not be substantially for small data but faster.
It is not surprising to see this difference if you think about it. One test is iterating the lines in a text file, while the other is iterating characters.
Unless each line contains one character, it is expected that the readLine() is way faster than the read() method.(although as pointed out by the comments above, it is arguable since a BufferedReader buffers the input, while the physical file reading might not be the only performance taking operation)
If you really want to test the difference between the 2 I would suggest a setup where you iterate over each character in both tests. E.g. something like:
void readTest(BufferedReader r)
{
int c;
StringBuilder b = new StringBuilder();
while((c = r.read()) != -1)
b.append((char)c);
}
void readLineTest(BufferedReader r)
{
String line;
StringBuilder b = new StringBuilder();
while((line = b.readLine())!= null)
for(int i = 0; i< line.length; i++)
b.append(line.charAt(i));
}
Besides the above, please use a "Java performance diagnostic tool" to benchmark your code. Also, readup on how to microbenchmark java code.
According to the documentation:
Every read() method call makes an expensive system call.
Every readLine() method call still makes an expensive system call, however, for more bytes at once, so there are fewer calls.
Similar situation happens when we make database update command for each record we want to update, versus a batch update, where we make one call for all the records.
I'm running a multithreaded minimalistic http(s) server (not a web server though) that accepts connections on three server sockets: local, internet and internet-ssl.
Each socket has a so timeout of 1000ms (might be lowered in the future).
The worker threads read requests like this:
byte[] reqBuffer = new byte[512];
theSocket.getInputStream().read(reqBuffer);
The problem now is that with the newly implemented ssl socket the problem with the 1/n-1 record splitting technique arises. Also some clients split in other strange ways when using ssl (4/n-4 etc.) so I thought I might just perform multiple reads like this:
byte[] reqBuffer = new byte[512];
InputStream is = theSocket.getInputStream();
int read = is.read(reqBuffer, 0, 128); // inital read - with x/n-x this is very small
int pos = 0;
if (read > 0) {
pos = read;
}
int i = 0;
do {
read = is.read(reqBuffer, pos, 128);
if (read > 0) {
pos += read;
}
i++;
} while(read == 128 && i < 3); // max. 3 more reads (4 total = 512 bytes) or until less than 128 bytes are read (request should be completely read)
Which works with browsers like firefox or chrome and other clients using that technique.
Now my problem is that the new method is much slower. Requests to the local socket are so slow that a script with 2 seconds timeout times out requesting (I have no idea why). Maybe I have some logical problem in my code?
Is there a better way to read from a SSL socket? Because there are up to hundreds or even a thousand requests per second and the new read method slows down even the http requests.
Note: The ssl-socket is not in use at the moment and will not be used until I can fix this problem.
I have also tried reading line for line using a buffered reader since we are talking about http here but the server exploded running out of file descriptors (limit is 20 000). Might have been because of my implementation, though.
I'm thankful for every suggestion regarding this problem. If you need more information about the code just tell me and I will post them asap.
EDIT:
I actually put a little bit more thought into what I am trying to do and I realized that it comes down to reading HTTP headers. So the best solution would be to actually read the request line for line (or character for character) and stop reading after x lines or until an empty line (marking the end of the header) is reached.
My current approach would be to put a BufferedInputStream around the socket's InputStream and read it with an InputStreamReader which is "read" by a BufferedReader (question: does it make sense to use a BufferedInputStream when I'm using a BufferedReader?).
This BufferedReader reads the request character for character, detects end-of-lines (\r\n) and continues to read until either a line longer than 64 characters is reached, a maximum of 8 lines are read or an empty line is reached (marking the end of the HTTP header). I will test my implementation tomorrow and edit this edit accordingly.
EDIT:
I almost forgot to write my results here: It works. On every socket, even faster than the previously working way. Thanks everyone for pointing me in the right direction. I ended up implementing it like this:
List<String> requestLines = new ArrayList<String>(6);
InputStream is = this.cSocket.getInputStream();
bis = new BufferedInputStream(is, 1024);
InputStreamReader isr = new InputStreamReader(bis, Config.REQUEST_ENCODING);
BufferedReader br = new BufferedReader(isr);
/* read input character for character
* maximum line size is 768 characters
* maximum number of lines is 6
* lines are defined as char sequences ending with \r\n
* read lines are added to a list
* reading stops at the first empty line => HTTP header end
*/
int readChar; // the last read character
int characterCount = 0; // the character count in the line that is currently being read
int lineCount = 0; // the overall line count
char[] charBuffer = new char[768]; // create a character buffer with space for 768 characters (max line size)
// read as long as the stream is not closed / EOF, the character count in the current line is below 768 and the number of lines read is below 6
while((readChar = br.read()) != -1 && characterCount < 768 && lineCount < 6) {
charBuffer[characterCount] = (char) readChar; // fill the char buffer with the read character
if (readChar == '\n' && characterCount > 0 && charBuffer[characterCount-1] == '\r') { // if end of line is detected (\r\n)
if (characterCount == 1) { // if empty line
break; // stop reading after an empty line (HTTP header ended)
}
requestLines.add(new String(charBuffer, 0, characterCount-1)); // add the read line to the readLines list (and leave out the \r)
// charBuffer = new char[768]; // clear the buffer - not required
characterCount = 0; // reset character count for next line
lineCount++; // increase read line count
} else {
characterCount++; // if not end of line, increase read character count
}
}
This is most likely slower as you are waiting for the other end to send more data, possibly data it is never going to send.
A better approach is you give it a larger buffer like 32KB (128 is small) and only read once the data which is available. If this data needs to be re-assembled in the messages of some sort, you shouldn't be using timeouts or a fixed number of loops as read() is only guaranteed to return one byte at least.
You should certainly wrap a BufferedInputStream around the SSLSocket's input stream.
Your technique of reading 128 bytes at a time and advancing the offset is completely pointless. Just read as much as you can at a time and deal with it. Or one byte at a time from the buffered stream.
Similarly you should certainly wrap the SSLSocket's output stream in a BufferedOutputStream.
I have a very simple java program that prints out 1 million random numbers. In linux, I observed the %CPU that this program takes during its lifespan, it starts off at 98% then gradually decreases to 2%, thus causing the program to be very slow. What are some of the factors that might cause the program to gradually get less CPU time?
I've tried running it with nice -20 but I still see the same results.
EDIT: running the program with /usr/bin/time -v I'm seeing an unusual amount of involuntary context switches (588 voluntary vs 16478 involuntary), which suggests that the OS is letting some other higher priority process run.
It boils down to two things:
I/O is expensive, and
Depending on how you're storing the numbers as you go along, that can have an adverse effect on performance as well.
If you're mainly doing System.out.println(randInt) in a loop a million times, then that can get expensive. I/O isn't one of those things that comes for free, and writing to any output stream costs resources.
I would start by profiling via JConsole or VisualVM to see what it's actually doing when it has low CPU %. As mentioned in comments there's a high chance it's blocking, e.g. waiting for IO (user input, SQL query taking a long time, etc.)
If your application is I/O bound - for example waiting for responses from network calls, or disk read/write
If you want to try and balance everything, you should create a queue to hold numbers to print, then have one thread generate them (the producer) and the other read and print them (the consumer). This can easily be done with a LinkedBlockingQueue.
public class PrintQueueExample {
private BlockingQueue<Integer> printQueue = new LinkedBlockingQueue<Integer>();
public static void main(String[] args) throws InterruptedException {
PrinterThread thread = new PrinterThread();
thread.start();
for (int i = 0; i < 1000000; i++) {
int toPrint = ...(i) ;
printQueue.put(Integer.valueOf(toPrint));
}
thread.interrupt();
thread.join();
System.out.println("Complete");
}
private static class PrinterThread extends Thread {
#Override
public void run() {
try {
while (true) {
Integer toPrint = printQueue.take();
System.out.println(toPrint);
}
} catch (InterruptedException e) {
// Interruption comes from main, means processing numbers has stopped
// Finish remaining numbers and stop thread
List<Integer> remainingNumbers = new ArrayList<Integer>();
printQueue.drainTo(remainingNumbers);
for (Integer toPrint : remainingNumbers)
System.out.println(toPrint);
}
}
}
}
There may be a few problems with this code, but this is the gist of it.
I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.