Efficiency of NLineInputFormat's InputSplit calculations

Efficiency of NLineInputFormat's InputSplit calculations - java

I looked into getSplitsForFile() fn of NLineInputFormat. I found that a InputStream is created for the input file & then its iterated and splits are created every n lines.
Is it efficient? Particularly when this read operation is happening on 1 node before launching a mapper task. What if 1 have 5gb of file. Basically it means file data is seeked twice, once during the split creation & once during read from the mapper tasks.
If this is a bottleneck how does hadoop job overrides this?
public static List<FileSplit> getSplitsForFile(FileStatus status,
Configuration conf, int numLinesPerSplit) throws IOException {
List<FileSplit> splits = new ArrayList<FileSplit> ();
Path fileName = status.getPath();
if (status.isDirectory()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in = fs.open(fileName);
lr = new LineReader(in, conf);
Text line = new Text();
int numLines = 0;
long begin = 0;
long length = 0;
int num = -1;
<!-- my part of concern start -->
while ((num = lr.readLine(line)) > 0) {
numLines++;
length += num;
if (numLines == numLinesPerSplit) {
splits.add(createFileSplit(fileName, begin, length));
begin += length;
length = 0;
numLines = 0;
}
}
<!-- my part of concern end -->
if (numLines != 0) {
splits.add(createFileSplit(fileName, begin, length));
}
} finally {
if (lr != null) {
lr.close();
}
}
return splits;
}
Editing to provide my usecase to clément-mathieu
My data sets are big input files 2gb approx each. Each line in the files represent a record that needs to be inserted into the database's table (in my case cassandra)
I want to limit the bulk transactions to my database to every n-lines.
I have succeeded to do this using nlineinputformat. My only concern is if there is a hidden performance bottleneck that might show up in production.

Basically it means file data is seeked twice, once during the split creation & once during read from the mapper tasks.
Yes.
The purpose of this InputFormat is to create a split for every N-lines. The only way to compute the split boundaries is to read this file and find the new line characters. This operation can be costly, but you cannot avoid it if this is what you need.
If this is a bottleneck how does hadoop job overrides this?
Not sure to understand the question.
NLineInputFormat is not the default InputFormat and very few use cases require it. If you read the javadoc of the class you will see that this class mainly exists to feed the parameters to embarrassingly parallel jobs (= "small" input files).
Most of the InputFormat do no need to read the file to compute the splits. They usually use hard rules like a split should be 128MB or one split for each HDFS block and the RecordReaders will take care of the real start/end-of-split offset.
If the cost of NLineInputFormat.getSplitsForFile is an issue I would really review why I need to use this InputFormat. What you want to do is to limit the batch size of a business process in your mapper. With NLineInputFormat a mapper is created for every N lines, it means that a mapper will never do more than one bulk transaction. You don't seems to need this feature, you only want to limit the size of a bulk transaction but don't care if a mapper does several of them sequentially. So you are paying the cost of the code you spotted for nothing in return.
I would use TextInputFormat and create the batch in the mapper. In pseudo code:
setup() {
buffer = new Buffer<String>(1_000_000);
}
map(LongWritable key, Text value) {
buffer.append(value.toString())
if (buffer.isFull()) {
new Transaction(buffer).doIt()
buffer.clear()
}
}
cleanup() {
new Transaction(buffer).doIt()
buffer.clear()
}
By default a mapper is created per HDFS block. If you think this is too much or little, mapred.(max|min).split.size variables allow to increase or decrease the parallelism.
Basically, while convenient NLineInputFormat is too fine grained for what you need. You can achieve almost the same thing using TextInputFormat and playing with *.split.size which does not involve reading the files to create the splits.

Related

Why is BufferedReader read() much slower than readLine()?

I need to read a file one character at a time and I'm using the read() method from BufferedReader. *
I found that read() is about 10x slower than readLine(). Is this expected? Or am I doing something wrong?
Here's a benchmark with Java 7. The input test file has about 5 million lines and 254 million characters (~242 MB) **:
The read() method takes about 7000 ms to read all the characters:
#Test
public void testRead() throws IOException, UnindexableFastaFileException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
long t0= System.currentTimeMillis();
int c;
while( (c = fa.read()) != -1 ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 7000 ms
}
The readLine() method takes only ~700 ms:
#Test
public void testReadLine() throws IOException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
String line;
long t0= System.currentTimeMillis();
while( (line = fa.readLine()) != null ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 700 ms
}
* Practical purpose: I need to know the length of each line, including the newline characters (\n or \r\n) AND the line length after stripping them. I also need to know if a line starts with the > character. For a given file this is done only once at the start of the program. Since EOL chars are not returned by BufferedReader.readLine() I'm resorting on the read() method. If there are better ways of doing this, please say.
** The gzipped file is here http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz. For those who may be wondering, I'm writing a class to index fasta files.

The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.
One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.
The code then looks as follows:
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Mode;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
#BenchmarkMode(Mode.AverageTime)
#Fork(1)
public class IoPerformanceBenchmark {
private static final String FILE_PATH = "test.fa";
#Benchmark
public int readTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int value;
while ((value = reader.read()) != -1) {
result += value;
}
}
return result;
}
#Benchmark
public int readLineTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
String line;
while ((line = reader.readLine()) != null) {
result += line.chars().sum();
}
}
return result;
}
private void clearFileCaches() throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
pb.inheritIO();
pb.start().waitFor();
}
}
and if we run it with
chcp 65001 # set codepage to utf-8
mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar
we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):
Benchmark Mode Cnt Score Error Units
IoPerformanceBenchmark.readLineTest avgt 20 3.749 ± 0.039 s/op
IoPerformanceBenchmark.readTest avgt 20 3.745 ± 0.023 s/op
Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:
# Warmup Iteration 1: 6.186 s/op
# Warmup Iteration 2: 3.744 s/op
which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.
Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).
Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:
# Warmup Iteration 1: 3.965 s/op
# Warmup Iteration 2: 3.753 s/op
Possible solution
Now that we have a good idea what the actual problem is (my guess is still all those locks neither being coalesced nor using the efficient biased locks implementation), the solution is rather straight forward and simple: Reduce the number of function calls (so yes we could've arrived at this solution without everything above, but it's always nice to have a good grip on the problem and there might have been a solution that didn't involve changing much code).
The following code runs consistently faster than either of the other two - you can play with the array size but it's surprisingly unimportant (presumably because contrary to the other methods read(char[]) does not have to acquire a lock so the cost per call is lower to begin with).
private static final int BUFFER_SIZE = 256;
private char[] arr = new char[BUFFER_SIZE];
#Benchmark
public int readArrayTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int charsRead;
while ((charsRead = reader.read(arr)) != -1) {
for (int i = 0; i < charsRead; i++) {
result += arr[i];
}
}
}
return result;
}
This is most likely good enough performance wise, but if you wanted to improve performance even further using a file mapping might (wouldn't count on too large an improvement in a case such as this, but if you know that your text is always ASCII, you could make some further optimizations) further help performance.

So this is the practical answer to my own question: Don't use BufferedReader.read() use FileChannel instead. (Obviously I'm not answering the WHY I put in the title). Here's the quick and dirty benchmark, hopefully others will find it useful:
#Test
public void testFileChannel() throws IOException{
FileChannel fileChannel = FileChannel.open(Paths.get("chr1.fa"));
long n= 0;
int noOfBytesRead = 0;
long t0= System.nanoTime();
while(noOfBytesRead != -1){
ByteBuffer buffer = ByteBuffer.allocate(10000);
noOfBytesRead = fileChannel.read(buffer);
buffer.flip();
while ( buffer.hasRemaining() ) {
char x= (char)buffer.get();
n++;
}
}
long t1= System.nanoTime();
System.err.println((float)(t1-t0) / 1e6); // ~ 250 ms
System.err.println("nchars: " + n); // 254235640 chars read
}
With ~250 ms to read the whole file char by char, this strategy is considerably faster than BufferedReader.readLine() (~700 ms), let alone read(). Adding if statements in the loop to check for x == '\n' and x == '>' makes little difference. Also putting a StringBuilder to reconstruct lines doesn't affect the timing too much. So this is plenty good for me (at least for now).
Thanks to #Marco13 for mentioning FileChannel.

Java JIT optimizes away empty loop bodies, so your loops actually look like this:
while((c = fa.read()) != -1);
and
while((line = fa.readLine()) != null);
I suggest you read up on benchmarking here and the optimization of the loops here.
As to why the time taken differs:
Reason one (This only applies if the bodies of the loops contain code): In the first example, you're doing one operation per line, in the second, you're doing one per character. This this adds up the more lines/characters you have.
while((c = fa.read()) != -1){
//One operation per character.
}
while((line = fa.readLine()) != null){
//One operation per line.
}
Reason two: In the class BufferedReader, the method readLine() doesn't use read() behind the scenes - it uses its own code. The method readLine() does less operations per character to read a line, than it would take to read a line with the read() method - this is why readLine() is faster at reading an entire file.
Reason three: It takes more iterations to read each character, than it does to read each line (unless each character is on a new line); read() is called more times than readLine().

Thanks #Voo for the correction. What I mentioned below is correct from FileReader#read() v/s BufferedReader#readLine() point of view BUT not correct from BufferedReader#read() v/s BufferedReader#readLine() point of view, so I have striked-out the answer.
Using read() method on BufferedReader is not a good idea, it wouldn't cause you any harm but it certainly wastes the purpose of class.
Whole purpose in life of BufferedReader is to reduce the i/o by buffering the content. You can read here in Java tutorials. You may also notice that read() method in BufferedReader is actually inherited from Reader while readLine() is BufferedReader's own method.
If you want to use read() method then I would say you better use FileReader, which is meant for that purpose. You can read here in Java tutorials.
So, I think answer to your question is very simple (without going into bench-marking and all that explainations) -
Each read() is handled by underlying OS and triggers disk access, network activity, or some other operation that is relatively expensive.
When you use readLine() then you save all these overheads, so readLine() will always be faster than read(), may not be substantially for small data but faster.

It is not surprising to see this difference if you think about it. One test is iterating the lines in a text file, while the other is iterating characters.
Unless each line contains one character, it is expected that the readLine() is way faster than the read() method.(although as pointed out by the comments above, it is arguable since a BufferedReader buffers the input, while the physical file reading might not be the only performance taking operation)
If you really want to test the difference between the 2 I would suggest a setup where you iterate over each character in both tests. E.g. something like:
void readTest(BufferedReader r)
{
int c;
StringBuilder b = new StringBuilder();
while((c = r.read()) != -1)
b.append((char)c);
}
void readLineTest(BufferedReader r)
{
String line;
StringBuilder b = new StringBuilder();
while((line = b.readLine())!= null)
for(int i = 0; i< line.length; i++)
b.append(line.charAt(i));
}
Besides the above, please use a "Java performance diagnostic tool" to benchmark your code. Also, readup on how to microbenchmark java code.

According to the documentation:
Every read() method call makes an expensive system call.
Every readLine() method call still makes an expensive system call, however, for more bytes at once, so there are fewer calls.
Similar situation happens when we make database update command for each record we want to update, versus a batch update, where we make one call for all the records.

Best method for parallel log aggregation

My program needs to analyze a bunch of log files daily which are generated on a hourly basis from each application server.
So if I have 2 app servers I will be processing 48 files (24 files * 2 app servers).
file sizes range 100-300 mb. Each line in every file is a log entry which is of the format
[identifier]-[number of pieces]-[piece]-[part of log]
for example
xxx-3-1-ABC
xxx-3-2-ABC
xxx-3-3-ABC
These can be distributed over the 48 files which I mentioned, I need to merge these logs like so
xxx-PAIR-ABCABCABC
My implementation uses a thread pool to read through files in parallel and then aggregate them using a ConcurrentHashMap
I define a class LogEvent.scala
class LogEvent (val id: String, val total: Int, var piece: Int, val json: String) {
var additions: Long = 0
val pieces = new Array[String](total)
addPiece(json)
private def addPiece (json: String): Unit = {
pieces(piece) = json
additions += 1
}
def isDone: Boolean = {
additions == total
}
def add (slot: Int, json: String): Unit = {
piece = slot
addPiece(json)
}
The main processing happens over multiple threads and the code is something on the lines of
//For each file
val logEventMap = new ConcurrentHashMap[String, LogEvent]().asScala
Future {
Source.fromInputStream(gis(file)).getLines().foreach {
line =>
//Extract the id part of the line
val idPart: String = IDPartExtractor(line)
//Split line on '-'
val split: Array[String] = idPart.split("-")
val id: String = split(0) + "-" + split(1)
val logpart: String = JsonPartExtractor(line)
val total = split(2) toInt
val piece = split(3) toInt
def slot: Int = {
piece match {
case x if x - 1 < 0 => 0
case _ => piece - 1
}
}
def writeLogEvent (logEvent: LogEvent): Unit = {
if (logEvent.isDone) {
//write to buffer
val toWrite = id + "-PAIR-" + logEvent.pieces.mkString("")
logEventMap.remove(logEvent.id)
writer.writeLine(toWrite)
}
}
//The LOCK
appendLock {
if (!logEventMap.contains(id)) {
val logEvent = new LogEvent(id, total, slot, jsonPart)
logEventMap.put(id, logEvent)
//writeLogEventToFile()
}
else {
val logEvent = logEventMap.get(id).get
logEvent.add(slot, jsonPart)
writeLogEvent(logEvent)
}
}
}
}
The main thread blocks till all the futures complete
Using this approach I have been able to cut the processing time from an hour+ to around 7-8 minutes.
My questions are as follows -
Can this be done in a better way, I am reading multiple files using different threads and I need to lock at the block where the aggregation happens, are there better ways of doing this?
The Map grows very fast in memory, any suggestions for off heap storage for such a use case
Any other feedback.
Thanks

A common way to do this is to sort each file and then merge the sorted files. The result is a single file that has the individual items in the order that you want them. Your program then just needs to do a single pass through the file, combining adjacent matching items.
This has some very attractive benefits:
The sort/merge is done by standard tools that you don't have to write
Your aggregator program is very simple. Or, there might even be a standard tool that will do it.
Memory requirements are lessened. The sort/merge programs know how to manage memory, and your aggregation program's memory requirements are minimal.
There are, of course some drawbacks. You'll use more disk space and the process will be somewhat slower due to the I/O cost.
When I'm faced with something like this, I almost always go with using the standard tools and a simple aggregator program. The increased performance I get from a custom program just doesn't justify the time it takes to develop the thing.

For this sort of thing, if you can, use Splunk, if not, copy what it does which is index the log files for aggregation on demand at a later point.
For off heap storage, look at distributed caches - Hazelcast or Coherence. Both support provide java.util.Map implementations that are stored over multiple JVMs.

file comparison with memory consideration

I want to compare two files, one is in file system and the other is being downloaded from a HTTP URL.
We have tried to compare by byte[] arrays (we used HTTPRequestBuilder by Apache), but the concern is that the files may be too large and they may exhaust the memory. Do we have any good alternates.

You can compare the contents from two InputStream objects by reading just a buffer at a time. You'll need to read data as and when you "run out" from each stream, noting that you when you call read you may not end up actually reading a full buffer.
The two streams are equal if each byte-by-byte comparison from the buffers is equal and the streams run out of data at the same time. I suspect the code may be slightly fiddly, but it shouldn't be too bad.
In fact, for simpler code, if you wrap each InputStream in a BufferedInputStream, you could probably just compare byte-by-byte (calling the parameterless read() method on each iteration) without losing too much performance:
public boolean equals(InputStream x, InputStream y)
{
// TODO: Only wrap them if they're not already buffered
x = new BufferedInputStream(x);
y = new BufferedInputStream(y);
while (true)
{
int xValue = x.read();
int yValue = y.read();
if (xValue != yValue)
{
return false;
}
if (xValue == -1)
{
// Reached the end of both streams at the same time
return true;
}
}
}

Read large files in Java

I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.

To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}

First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,

You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness

This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.

Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.

You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.

package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}

Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.

Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.

Read file at a certain rate in Java

Is there an article/algorithm on how I can read a long file at a certain rate?
Say I do not want to pass 10 KB/sec while issuing reads.

A simple solution, by creating a ThrottledInputStream.
This should be used like this:
final InputStream slowIS = new ThrottledInputStream(new BufferedInputStream(new FileInputStream("c:\\file.txt"),8000),300);
300 is the number of kilobytes per second. 8000 is the block size for BufferedInputStream.
This should of course be generalized by implementing read(byte b[], int off, int len), which will spare you a ton of System.currentTimeMillis() calls. System.currentTimeMillis() is called once for each byte read, which can cause a bit of an overhead. It should also be possible to store the number of bytes that can savely be read without calling System.currentTimeMillis().
Be sure to put a BufferedInputStream in between, otherwise the FileInputStream will be polled in single bytes rather than blocks. This will reduce the CPU load form 10% to almost 0. You will risk to exceed the data rate by the number of bytes in the block size.
import java.io.InputStream;
import java.io.IOException;
public class ThrottledInputStream extends InputStream {
private final InputStream rawStream;
private long totalBytesRead;
private long startTimeMillis;
private static final int BYTES_PER_KILOBYTE = 1024;
private static final int MILLIS_PER_SECOND = 1000;
private final int ratePerMillis;
public ThrottledInputStream(InputStream rawStream, int kBytesPersecond) {
this.rawStream = rawStream;
ratePerMillis = kBytesPersecond * BYTES_PER_KILOBYTE / MILLIS_PER_SECOND;
}
#Override
public int read() throws IOException {
if (startTimeMillis == 0) {
startTimeMillis = System.currentTimeMillis();
}
long now = System.currentTimeMillis();
long interval = now - startTimeMillis;
//see if we are too fast..
if (interval * ratePerMillis < totalBytesRead + 1) { //+1 because we are reading 1 byte
try {
final long sleepTime = ratePerMillis / (totalBytesRead + 1) - interval; // will most likely only be relevant on the first few passes
Thread.sleep(Math.max(1, sleepTime));
} catch (InterruptedException e) {//never realized what that is good for :)
}
}
totalBytesRead += 1;
return rawStream.read();
}
}

The crude solution is just to read a chunk at a time and then sleep eg 10k then sleep a second. But the first question I have to ask is: why? There are a couple of likely answers:
You don't want to create work faster than it can be done; or
You don't want to create too great a load on the system.
My suggestion is not to control it at the read level. That's kind of messy and inaccurate. Instead control it at the work end. Java has lots of great concurrency tools to deal with this. There are a few alternative ways of doing this.
I tend to like using a producer consumer pattern for soling this kind of problem. It gives you great options on being able to monitor progress by having a reporting thread and so on and it can be a really clean solution.
Something like an ArrayBlockingQueue can be used for the kind of throttling needed for both (1) and (2). With a limited capacity the reader will eventually block when the queue is full so won't fill up too fast. The workers (consumers) can be controlled to only work so fast to also throttle the rate covering (2).

while !EOF
store System.currentTimeMillis() + 1000 (1 sec) in a long variable
read a 10K buffer
check if stored time has passed
if it isn't, Thread.sleep() for stored time - current time
Creating ThrottledInputStream that takes another InputStream as suggested would be a nice solution.

If you have used Java I/O then you should be familiar with decorating streams. I suggest an InputStream subclass that takes another InputStream and throttles the flow rate. (You could subclass FileInputStream but that approach is highly error-prone and inflexible.)
Your exact implementation will depend upon your exact requirements. Generally you will want to note the time your last read returned (System.nanoTime). On the current read, after the underlying read, wait until sufficient time has passed for the amount of data transferred. A more sophisticated implementation may buffer and return (almost) immediately with only as much data as rate dictates (be careful that you should only return a read length of 0 if the buffer is of zero length).

You can use a RateLimiter. And make your own implementation of the read in InputStream. An example of this can be seen bellow
public class InputStreamFlow extends InputStream {
private final InputStream inputStream;
private final RateLimiter maxBytesPerSecond;
public InputStreamFlow(InputStream inputStream, RateLimiter limiter) {
this.inputStream = inputStream;
this.maxBytesPerSecond = limiter;
}
#Override
public int read() throws IOException {
maxBytesPerSecond.acquire(1);
return (inputStream.read());
}
#Override
public int read(byte[] b) throws IOException {
maxBytesPerSecond.acquire(b.length);
return (inputStream.read(b));
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
maxBytesPerSecond.acquire(len);
return (inputStream.read(b,off, len));
}
}
if you want to limit the flow by 1 MB/s you can get the input stream like this:
final RateLimiter limiter = RateLimiter.create(RateLimiter.ONE_MB);
final InputStreamFlow inputStreamFlow = new InputStreamFlow(originalInputStream, limiter);

It depends a little on whether you mean "don't exceed a certain rate" or "stay close to a certain rate."
If you mean "don't exceed", you can guarantee that with a simple loop:
while not EOF do
read a buffer
Thread.wait(time)
write the buffer
od
The amount of time to wait is a simple function of the size of the buffer; if the buffer size is 10K bytes, you want to wait a second between reads.
If you want to get closer than that, you probably need to use a timer.
create a Runnable to do the reading
create a Timer with a TimerTask to do the reading
schedule the TimerTask n times a second.
If you're concerned about the speed at which you're passing the data on to something else, instead of controlling the read, put the data into a data structure like a queue or circular buffer, and control the other end; send data periodically. You need to be careful with that, though, depending on the data set size and such, because you can run into memory limitations if the reader is very much faster than the writer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficiency of NLineInputFormat's InputSplit calculations - java

Related

Why is BufferedReader read() much slower than readLine()?

Best method for parallel log aggregation

file comparison with memory consideration

Read large files in Java

Read file at a certain rate in Java

Categories

Resources