I found some surprising behavior with Java parallel streams. I made my own Spliterator, and the resulting parallel stream gets divided up until each stream has only one element in it. That seems way too small and I wonder what I'm doing wrong. I'm hoping there's some characteristics I can set to correct this.
Here's my test code. The Float here is just a dummy payload, my real stream class is somewhat more complicated.
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 10 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total: " + total );
}
This code will continually split this stream until each Spliterator has only one element. That seems way too much to be efficient.
Output:
run:
Split on count: 10
Split on count: 5
Split on count: 3
Split on count: 5
Split on count: 2
Split on count: 2
Split on count: 3
Split on count: 2
Split on count: 2
Total: 5.164293184876442
BUILD SUCCESSFUL (total time: 0 seconds)
Here's the code of the Spliterator. My main concern is what characteristics I should be using, but perhaps there's a problem somewhere else?
public class TestingSpliterator implements Spliterator<Float> {
int count;
int splits;
public TestingSpliterator( int count ) {
this.count = count;
}
#Override
public boolean tryAdvance( Consumer<? super Float> cnsmr ) {
if( count > 0 ) {
cnsmr.accept( (float)Math.random() );
count--;
return true;
} else
return false;
}
#Override
public Spliterator<Float> trySplit() {
System.err.println( "Split on count: " + count );
if( count > 1 ) {
splits++;
int half = count / 2;
TestingSpliterator newSplit = new TestingSpliterator( count - half );
count = half;
return newSplit;
} else
return null;
}
#Override
public long estimateSize() {
return count;
}
#Override
public int characteristics() {
return IMMUTABLE | SIZED;
}
}
So how can I get the stream to be split in to much larger chunks? I was hoping in the neighborhood of 10,000 to 50,000 would be better.
I know I can return null from the trySplit() method, but that seems like a backwards way of doing it. It seems like the system should have some notion of number of cores, current load, and how complex the code is that uses the stream, and adjust itself accordingly. In other words, I want the stream chunk size to be externally configured, not internally fixed by the stream itself.
EDIT: re. Holger's answer below, when I increase the number of elements in the original stream, the stream splits are somewhat less, so StreamSupport does stop splitting eventually.
At an initial stream size of 100 elements, StreamSupport stops splitting when it reaches a stream size of 2 (the last line I see on my screen is Split on count: 4).
And for an initial stream size of 1000 elements, the final size of the individual stream chunks is about 32 elements.
Edit part deux: After looking at the output of the above, I changed my code to list out the individual Spliterators created. Here's the changes:
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 100 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total Spliterators: " + testers.size() );
for( TestingSpliterator t : testers ) {
System.out.println( "Splits: " + t.splits );
}
}
And to the TestingSpliterator's ctor:
static Queue<TestingSpliterator> testers = new ConcurrentLinkedQueue<>();
public TestingSpliterator( int count ) {
this.count = count;
testers.add( this ); // OUCH! 'this' escape
}
The result of this code is that the first Spliterator gets split 5 times. The nextSpliterator gets split 4 times. The next set of Spliterators get split 3 times. Etc. The result is that 36 Spliterators get made and the stream is split into as many parts. On typical desktop systems this seems to be the way that the API thinks is the best for parallel operations.
I'm going to accept Holger's answer below, which is essentially that the StreamSupport class is doing the right thing, don't worry, be happy. Part of the issue for me was that I was doing my early testing on very small stream sizes and I was surprised at the number of splits. Don't make the same mistake yourself.
You are looking on it from the wrong angle. The implementation did not split “until each spliterator has one element”, it rather split “until having ten spliterators”.
A single spliterator instance can only be processed by one thread. A spliterator is not required to support splitting after its traversal has been started. Therefore any splitting opportunity that has not been used beforehand may lead to limited parallel processing capabilities afterwards.
It’s important to keep in mind that the Stream implementation received a ToDoubleFunction with an unknown workload¹. It doesn’t know that it is as simple as Float::doubleValue in your case. It could be a function taking a minute to evaluate and then, having a spliterator per CPU core would be righteous right. Even having more than CPU cores is a valid strategy to handle the possibility that some evaluations take significantly longer than others.
A typical number of initial spliterators will be “number of CPU cores” × 4, though here might be more split operations later when more knowledge about actual workloads exist. When your input data has less than that number, it’s not surprising when it gets split down until one element per spliterator is left.
You may try with new TestingSpliterator( 10000 ) or 1000 or 100 to see that the number of splits will not change significantly, once the implementation assumes to have enough chunks to keep all CPU cores busy.
Since your spliterator does not know anything about the per-element workload of the consuming stream either, you shouldn’t be concerned about this. If you can smoothly support splitting down to single elements, just do that.
¹ It doesn’t have special optimizations for the case that no operations have been chained, though.
Unless I am missing the obvious, you could always pass a bufferSize in the constructor and use that for your trySplit:
#Override
public Spliterator<Float> trySplit() {
if( count > 1 ) {
splits++;
if(count > bufferSize) {
count = count - bufferSize;
return new TestingSpliterator( bufferSize, bufferSize);
}
}
return null;
}
And with this:
TestingSpliterator splits = new TestingSpliterator(12, 5);
Stream<Float> test = StreamSupport.stream(splits, true);
test.map(x -> new AbstractMap.SimpleEntry<>(
x.doubleValue(),
Thread.currentThread().getName()))
.collect(Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(
Map.Entry::getKey,
Collectors.toList())))
.forEach((x, y) -> System.out.println("Thread : " + x + " processed : " + y));
You would see that there are 3 threads. Two of them process 5 elements and one 2.
I have simple programm:
public class Test {
public static void main(String[] args) {
for (int i = 0; i < 1_000_000; i++) {
System.out.print(1);
}
}
}
And launched profiling. Here are the results:
I assume that memory grows because of this method calls:
public void print(int i) {
write(String.valueOf(i));
}
Is there a way to print int values in the console without memory drawdown?
On local machine I try add if (i % 10000 == 0) System.gc(); to cycle and memory consumption evened out. But the system that checks the solution does not make a decision. I tried to change the values of the step but still does not pass either in memory(should work less than 20mb) or in time(<1sec)
EDIT I try this
String str = "hz";
for (int i = 0; i < 1_000_0000; i++) {
System.out.print(str);
}
But same result:
EDIT2 if I write this code
public class Test {
public static void main(String[] args) {
byte[] bytes = "hz".getBytes();
for (int i = 0; i < 1_000_0000; i++) {
System.out.write(bytes, 0, bytes.length);
}
}
}
I have this
Therefore, I do not believe that Java is making its noises. They would be in both cases.
You need to convert the int into characters without generating a new String each time you do it. This could be done in a couple of ways:
Write a custom "int to characters" method that converts to ASCII bytes in a byte[] (See #AndyTurner's example code). Then write the byte[]. And repeat.
Use ByteBuffer, fill it directly using a custom "int to characters" converter method, and use a Channel to output the bytes when the buffer is full. And repeat.
If done correctly, you should be able to output the numbers without generating any garbage ... other than your once-off buffers.
Note that System.out is a PrintStream wrapping a BufferedOutputStream wrapping a FileOuputStream. And, when you output a String directly or indirectly using one of the print methods, that actually does through a BufferedWriter that is internal to the PrintStream. It is complicated ... and apparently the print(String) method generates garbage somewhere in that complexity.
Concerning your EDIT 1: when you repeatedly print out a constant string, you are still apparently generating garbage. I was surprised by this, but I guess it is happening in the BufferedWriter.
Concerning your EDIT 2: when you repeatedly write from a byte[], the garbage generation all but disappears. This confirms that at least one of my suggestions will work.
However, since you are monitoring the JVM with an external profile, your JVM is also running an agent that is periodically sending updates to your profiler. That agent will most likely be generating a small amount of garbage. And there could be other sources of garbage in the JVM; e.g. if you have JVM GC logging enabled.
Since you have discovered that printing a byte[] keeps memory allocation within the required bounds, you can use this fact:
Allocate a byte array the length of the ASCII representation of Integer.MIN_VALUE (11 - the longest an int can be). Then you can fill the array backwards to convert a number i:
int p = buffer.length;
if (i == Integer.MIN_VALUE) {
buffer[--p] = (byte) ('0' - i % 10);
i /= 10;
}
boolean neg = i < 0;
if (neg) i = -i;
do {
buffer[--p] = (byte) ('0' + i % 10);
i /= 10;
} while (i != 0);
if (neg) buffer[--p] = '-';
Then write this to your stream:
out.write(buffer, p, buffer.length - p);
You can reuse the same buffer to write as many numbers as you wish.
The pattern of memory usage is typical for java. Your code is irrelevant. To control java memory usage you need to use some -X parameters for example "-Xms512m -Xmx512m" will set both minimum and maximum heap size to 512m. BTW in order to minimize the sow-like memory graph it would be recommended to set min and max size to the same value. Those params could be given to java on command line when you run your java for example:
java -Xms512m -Xmx512m myProgram
There are other ways as well. Here is one link where you can read more about it: Oracle docs. There are other params that control stacksize and some other things. The code itself if written without memory usage considerations may influence memory usage as well, but in your case its too trivial of a code to do anything. Most memory concerns are addressed by configuring jvm memory usage params
I am playing around with RxJava (RxKotlin to be precise). Here I have the following Observables:
fun metronome(ms: Int) = observable<Int> {
var i = 0;
while (true) {
if (ms > 0) {
Thread.sleep(ms.toLong())
}
if (it.isUnsubscribed()) {
break
}
it.onNext(++i)
}
}
And I'd like to have a few of them merged and running concurrently. They ignore backpressure so the backpressure operators have to be applied to them.
Then I create
val cores = Runtime.getRuntime().availableProcessors()
val threads = Executors.newFixedThreadPool(cores)
val scheduler = Schedulers.from(threads)
And then I merge the metronomes:
val o = Observable.merge(listOf(metronome(0),
metronome(1000).map { "---------" })
.map { it.onBackpressureBlock().subscribeOn(scheduler) })
.take(5000, TimeUnit.MILLISECONDS)
The first one is supposed to emit items incessantly.
If I do so in the last 3 seconds of the run I get the following output:
...
[RxComputationThreadPool-5]: 369255
[RxComputationThreadPool-5]: 369256
[RxComputationThreadPool-5]: 369257
[RxComputationThreadPool-5]: ---------
[RxComputationThreadPool-5]: ---------
[RxComputationThreadPool-5]: ---------
Seems that the Observables are subscribed on the same one thread, and the first observable is blocked for 3+ seconds.
But when I swap onBackpressureBlock() and subscribeOn(scheduler) calls the output becomes what I expected, the output gets merged during the whole execution.
It's obvious to me that calls order matters in RxJava, but I don't quite understand what happens in this particular situation.
So what happens when onBackpressureBlock operator is applied before subscribeOn and what if after?
The onBackpressureBlock operator is a failed experiment; it requires care where to apply. For example, subscribeOn().onBackpressureBlock() works but not the other way around.
RxJava has non-blocking periodic timer called interval so you don't need to roll your own.
Update
I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.
Update #2
I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.
Thread ID - Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2) could show up multiple times within the logs.
Script - Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.
File - Every Thread ID runs a Script on a File. If Thread(10) is running myscript.script on myfile.file, then that EXACT line won't be executed again. A successful example using the above example would be something like so.
------START------
Thread(10) starting myscript.script on myfile.file
Thread(10) finished myscript.script on myfile.file
------END-------
An unsuccessful example using the above example would be:
------START------
Thread(10) starting myscript.script on myfile.file
------END------
Before addressing my query I'll give a rundown of the code used and the desired behavior.
Summary
I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.
In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:
REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
Set started, finished
for (int i=log.size()-1; i >=0; i--) {
if(group(2).contains("starting")
started.add(log.get(i))
else if(group(2).contains("finished")
finished.add(log.get(i)
}
started.removeAll(finished);
Search Hung Threads
Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread"))
JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);
}
started.removeAll(finished);
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) {
JLogger.appendLineToConsole(s);
commonTextArea.append(s+"\n");
}
Has Thread Hung
public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {
Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
Matcher m = r.matcher(str);
boolean hasHung = m.find();
if(m.group(2).contains("starting"))
started.add(str);
else if (m.group(2).contains("finished"))
finished.add(str);
System.out.println("Started size: " + started.size());
System.out.println("Finished size: " + finished.size());
return hasHung;
}
Example Data
ScriptExecThread(1) started on afile.xyz
ScriptExecThread(2) started on bfile.abc
ScriptExecThread(3) started on cfile.zyx
ScriptExecThread(4) started on dfile.zxy
ScriptExecThread(5) started on efile.yzx
ScriptExecThread(1) finished on afile.xyz
ScriptExecThread(2) finished on bfile.abc
ScriptExecThread(3) finished on cfile.zyx
ScriptExecThread(4) finished on dfile.zxy
ScriptExecThread(5) finished on efile.yzy
ScriptExecThread(1) started on bfile.abc
ScriptExecThread(2) started on dfile.zxy
ScriptExecThread(3) started on afile.xyz
ScriptExecThread(1) finished on bfile.abc
END OF LOG
If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).
Sample Data
09.08 15:06.53, ScriptExecThread(7),Info,########### starting
09.08 15:06.54, ScriptExecThread(18),Info,###################### starting
09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########
09.08 15:06.54, ScriptExecThread(13),Info,########## starting
09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########
09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########
09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########
09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(8),Info,####### starting
09.08 15:06.55, ScriptExecThread(0),Info,##########starting
09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####
09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(0),Info,########### starting
09.08 15:06.55, ScriptExecThread(9),Info,########## starting
09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########
09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######
09.08 15:06.56, ScriptExecThread(17),Info,###################### starting
09.08 15:06.56, ScriptExecThread(1),Info,########## starting
Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.
I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.
If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.
If I were to do this, I would use this pseudocode:
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("started")) started.add(n);
else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished
This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|) matching to look for both alternatives at once, reducing the number of passes.
Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.
Edit 2: implemented pseudocode, works for me
Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.
public class T {
public static Collection<String> findHung(Iterable<String> data) {
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|starting)");
HashMap<Integer, String> started = new HashMap<Integer, String>();
Set<Integer> finished = new HashSet<Integer>();
for (String d : data) {
Matcher m = p.matcher(d);
if (m.find()) {
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("starting")) started.put(n, d);
else finished.add(n);
}
}
for (int f : finished) {
started.remove(f);
}
return started.values();
}
static Iterable<String> readFile(String path, String encoding) throws IOException {
final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
return new Iterable<String>() {
public Iterator<String> iterator() { return sc; }
};
}
public static void main(String[] args) throws Exception {
for (String fileName : args) {
for (String s : findHung(readFile(fileName, "UTF-8"))) {
System.out.println(fileName + ": '" + s + "' unfinished");
}
}
}
}
Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt). Output:
data.txt: ' 09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: ' 09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: ' 09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: ' 09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
Keeping two separate sets of started and finished threads (as described by #tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.
Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started and finished sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.
Preamble out of the way, here is a working version of #tucuxi's code:
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class T {
public static Collection<String> findHung(Iterable<String> data) {
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|starting)");
HashMap<Integer, String> running = new HashMap<Integer, String>();
for (String d : data) {
Matcher m = p.matcher(d);
if (m.find()) {
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("starting"))
running.put(n, d);
else
running.remove(n);
}
}
return running.values();
}
static Iterable<String> readFile(String path, String encoding) throws IOException {
final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
return new Iterable<String>() {
public Iterator<String> iterator() { return sc; }
};
}
public static void main(String[] args) throws Exception {
for (String fileName : args) {
for (String s : findHung(readFile(fileName, "UTF-8"))) {
System.out.println(fileName + ": '" + s + "' unfinished");
}
}
}
}
Note that I've dropped the finished set, and the HashMap is now called running. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)
Here's a Python program I used to generate huge test cases:
#!/usr/bin/python
from random import random, choice
from datetime import datetime
import tempfile
all_threads = set([])
running = []
hung = []
filenames = { }
target_thread_count = 16
hang_chance = 0.001
def log(id, msg):
now = datetime.now().strftime("%m:%d %H:%M:%S")
print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)
def new_thread():
if len(all_threads)>0:
for t in range(0, 2+max(all_threads)):
if t not in all_threads:
all_threads.add(t)
return t
else:
all_threads.add(0)
return 0
for i in range(0, 100000):
if len(running) > target_thread_count:
new_thread_chance = 0.25
else:
new_thread_chance = 0.75
pass
if random() < new_thread_chance:
t = new_thread()
name = next(tempfile._get_candidate_names())+".txt"
filenames[t] = name
log(t, "%s starting" % (name,))
if random() < hang_chance:
hung.append(t)
else:
running.append(t)
elif len(running)>0:
victim = choice(running)
all_threads.remove(victim)
running.remove(victim)
log(t, "%s finished" % (filenames[victim],))
The removeAll will never work.
The hasThreadHung is storing the entire string.
So the values in started will never be matched by values in finished.
You want to do something like:
class ARecord {
// Proper encapsulation of the members omitted for brevity
String thread;
String line;
public ARecord (String thread, String line) {
this.thread = thread;
this.line = line;
}
public int hashcode() {
return thread.hashcode();
}
public boolean equals(ARecord o) {
return thread.equals(o.thread);
}
}
Then in hasHungThread, create an ARecord and add that to the Sets.
Ex:
started.add(new ARecord(m.group(2), str));
In searchHungThreads you will retrieve the ARecord from the started and output it as:
for(ARecord rec : started) {
JLogger.appendLineToConsole(rec.line);
commonTextArea.append(rec.line+"\n");
}
Why not solve the problem in another way. If all you want is hung threads, can take thread stack programmatically. Can use an external tool also but internal to own JVM I presume would be easiest. Then expose that as an API or save a date-time stamped file regularly with thread dump. Another program just needs to analyze the thread dumps. If same thread is at same spot (same stack trace or over same 3-5 functions) over thread dumps its a good chance its hung.
There are tools that help you analyze https://www.google.com/search?q=java+thread+dump+tool+open+source
I looked into getSplitsForFile() fn of NLineInputFormat. I found that a InputStream is created for the input file & then its iterated and splits are created every n lines.
Is it efficient? Particularly when this read operation is happening on 1 node before launching a mapper task. What if 1 have 5gb of file. Basically it means file data is seeked twice, once during the split creation & once during read from the mapper tasks.
If this is a bottleneck how does hadoop job overrides this?
public static List<FileSplit> getSplitsForFile(FileStatus status,
Configuration conf, int numLinesPerSplit) throws IOException {
List<FileSplit> splits = new ArrayList<FileSplit> ();
Path fileName = status.getPath();
if (status.isDirectory()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in = fs.open(fileName);
lr = new LineReader(in, conf);
Text line = new Text();
int numLines = 0;
long begin = 0;
long length = 0;
int num = -1;
<!-- my part of concern start -->
while ((num = lr.readLine(line)) > 0) {
numLines++;
length += num;
if (numLines == numLinesPerSplit) {
splits.add(createFileSplit(fileName, begin, length));
begin += length;
length = 0;
numLines = 0;
}
}
<!-- my part of concern end -->
if (numLines != 0) {
splits.add(createFileSplit(fileName, begin, length));
}
} finally {
if (lr != null) {
lr.close();
}
}
return splits;
}
Editing to provide my usecase to clément-mathieu
My data sets are big input files 2gb approx each. Each line in the files represent a record that needs to be inserted into the database's table (in my case cassandra)
I want to limit the bulk transactions to my database to every n-lines.
I have succeeded to do this using nlineinputformat. My only concern is if there is a hidden performance bottleneck that might show up in production.
Basically it means file data is seeked twice, once during the split creation & once during read from the mapper tasks.
Yes.
The purpose of this InputFormat is to create a split for every N-lines. The only way to compute the split boundaries is to read this file and find the new line characters. This operation can be costly, but you cannot avoid it if this is what you need.
If this is a bottleneck how does hadoop job overrides this?
Not sure to understand the question.
NLineInputFormat is not the default InputFormat and very few use cases require it. If you read the javadoc of the class you will see that this class mainly exists to feed the parameters to embarrassingly parallel jobs (= "small" input files).
Most of the InputFormat do no need to read the file to compute the splits. They usually use hard rules like a split should be 128MB or one split for each HDFS block and the RecordReaders will take care of the real start/end-of-split offset.
If the cost of NLineInputFormat.getSplitsForFile is an issue I would really review why I need to use this InputFormat. What you want to do is to limit the batch size of a business process in your mapper. With NLineInputFormat a mapper is created for every N lines, it means that a mapper will never do more than one bulk transaction. You don't seems to need this feature, you only want to limit the size of a bulk transaction but don't care if a mapper does several of them sequentially. So you are paying the cost of the code you spotted for nothing in return.
I would use TextInputFormat and create the batch in the mapper. In pseudo code:
setup() {
buffer = new Buffer<String>(1_000_000);
}
map(LongWritable key, Text value) {
buffer.append(value.toString())
if (buffer.isFull()) {
new Transaction(buffer).doIt()
buffer.clear()
}
}
cleanup() {
new Transaction(buffer).doIt()
buffer.clear()
}
By default a mapper is created per HDFS block. If you think this is too much or little, mapred.(max|min).split.size variables allow to increase or decrease the parallelism.
Basically, while convenient NLineInputFormat is too fine grained for what you need. You can achieve almost the same thing using TextInputFormat and playing with *.split.size which does not involve reading the files to create the splits.