Java Spliterator Continually Splits Parallel Stream

Java Spliterator Continually Splits Parallel Stream - java

I found some surprising behavior with Java parallel streams. I made my own Spliterator, and the resulting parallel stream gets divided up until each stream has only one element in it. That seems way too small and I wonder what I'm doing wrong. I'm hoping there's some characteristics I can set to correct this.
Here's my test code. The Float here is just a dummy payload, my real stream class is somewhat more complicated.
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 10 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total: " + total );
}
This code will continually split this stream until each Spliterator has only one element. That seems way too much to be efficient.
Output:
run:
Split on count: 10
Split on count: 5
Split on count: 3
Split on count: 5
Split on count: 2
Split on count: 2
Split on count: 3
Split on count: 2
Split on count: 2
Total: 5.164293184876442
BUILD SUCCESSFUL (total time: 0 seconds)
Here's the code of the Spliterator. My main concern is what characteristics I should be using, but perhaps there's a problem somewhere else?
public class TestingSpliterator implements Spliterator<Float> {
int count;
int splits;
public TestingSpliterator( int count ) {
this.count = count;
}
#Override
public boolean tryAdvance( Consumer<? super Float> cnsmr ) {
if( count > 0 ) {
cnsmr.accept( (float)Math.random() );
count--;
return true;
} else
return false;
}
#Override
public Spliterator<Float> trySplit() {
System.err.println( "Split on count: " + count );
if( count > 1 ) {
splits++;
int half = count / 2;
TestingSpliterator newSplit = new TestingSpliterator( count - half );
count = half;
return newSplit;
} else
return null;
}
#Override
public long estimateSize() {
return count;
}
#Override
public int characteristics() {
return IMMUTABLE | SIZED;
}
}
So how can I get the stream to be split in to much larger chunks? I was hoping in the neighborhood of 10,000 to 50,000 would be better.
I know I can return null from the trySplit() method, but that seems like a backwards way of doing it. It seems like the system should have some notion of number of cores, current load, and how complex the code is that uses the stream, and adjust itself accordingly. In other words, I want the stream chunk size to be externally configured, not internally fixed by the stream itself.
EDIT: re. Holger's answer below, when I increase the number of elements in the original stream, the stream splits are somewhat less, so StreamSupport does stop splitting eventually.
At an initial stream size of 100 elements, StreamSupport stops splitting when it reaches a stream size of 2 (the last line I see on my screen is Split on count: 4).
And for an initial stream size of 1000 elements, the final size of the individual stream chunks is about 32 elements.
Edit part deux: After looking at the output of the above, I changed my code to list out the individual Spliterators created. Here's the changes:
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 100 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total Spliterators: " + testers.size() );
for( TestingSpliterator t : testers ) {
System.out.println( "Splits: " + t.splits );
}
}
And to the TestingSpliterator's ctor:
static Queue<TestingSpliterator> testers = new ConcurrentLinkedQueue<>();
public TestingSpliterator( int count ) {
this.count = count;
testers.add( this ); // OUCH! 'this' escape
}
The result of this code is that the first Spliterator gets split 5 times. The nextSpliterator gets split 4 times. The next set of Spliterators get split 3 times. Etc. The result is that 36 Spliterators get made and the stream is split into as many parts. On typical desktop systems this seems to be the way that the API thinks is the best for parallel operations.
I'm going to accept Holger's answer below, which is essentially that the StreamSupport class is doing the right thing, don't worry, be happy. Part of the issue for me was that I was doing my early testing on very small stream sizes and I was surprised at the number of splits. Don't make the same mistake yourself.

You are looking on it from the wrong angle. The implementation did not split “until each spliterator has one element”, it rather split “until having ten spliterators”.
A single spliterator instance can only be processed by one thread. A spliterator is not required to support splitting after its traversal has been started. Therefore any splitting opportunity that has not been used beforehand may lead to limited parallel processing capabilities afterwards.
It’s important to keep in mind that the Stream implementation received a ToDoubleFunction with an unknown workload¹. It doesn’t know that it is as simple as Float::doubleValue in your case. It could be a function taking a minute to evaluate and then, having a spliterator per CPU core would be righteous right. Even having more than CPU cores is a valid strategy to handle the possibility that some evaluations take significantly longer than others.
A typical number of initial spliterators will be “number of CPU cores” × 4, though here might be more split operations later when more knowledge about actual workloads exist. When your input data has less than that number, it’s not surprising when it gets split down until one element per spliterator is left.
You may try with new TestingSpliterator( 10000 ) or 1000 or 100 to see that the number of splits will not change significantly, once the implementation assumes to have enough chunks to keep all CPU cores busy.
Since your spliterator does not know anything about the per-element workload of the consuming stream either, you shouldn’t be concerned about this. If you can smoothly support splitting down to single elements, just do that.
¹ It doesn’t have special optimizations for the case that no operations have been chained, though.

Unless I am missing the obvious, you could always pass a bufferSize in the constructor and use that for your trySplit:
#Override
public Spliterator<Float> trySplit() {
if( count > 1 ) {
splits++;
if(count > bufferSize) {
count = count - bufferSize;
return new TestingSpliterator( bufferSize, bufferSize);
}
}
return null;
}
And with this:
TestingSpliterator splits = new TestingSpliterator(12, 5);
Stream<Float> test = StreamSupport.stream(splits, true);
test.map(x -> new AbstractMap.SimpleEntry<>(
x.doubleValue(),
Thread.currentThread().getName()))
.collect(Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(
Map.Entry::getKey,
Collectors.toList())))
.forEach((x, y) -> System.out.println("Thread : " + x + " processed : " + y));
You would see that there are 3 threads. Two of them process 5 elements and one 2.

Related

Mulithreading Usage

I am iterating through a HashMap with +- 20 Million entries. In each iteration I am again iterating through HashMap with +- 20 Million entries.
HashMap<String, BitSet> data_1 = new HashMap<String, BitSet>
HashMap<String, BitSet> data_2 = new HashMap<String, BitSet>
I am dividng data_1 into chunks based on number of threads(threads = cores, i have four core processor).
My code is taking more than 20 Hrs to excute. Excluding not storing the results into a file.
1) If i want to store the results of each thread without overlapping into a file, How can i
do that?.
2) How can i make the following much faster.
3) How to create the chunks dynamically, based on number of cores?
int cores = Runtime.getRuntime().availableProcessors();
int threads = cores;
//Number of threads
int Chunks = data_1.size() / threads;
//I don't trust with chunks created by the below line, that's why i created chunk1, chunk2, chunk3, chunk4 seperately and validated them.
Map<Integer, BitSet>[] Chunk= (Map<Integer, BitSet>[]) new HashMap<?,?>[threads];
4) How to create threads using for loops? Is it correct what i am doing?
ClassName thread1 = new ClassName(data2, chunk1);
ClassName thread2 = new ClassName(data2, chunk2);
ClassName thread3 = new ClassName(data2, chunk3);
ClassName thread4 = new ClassName(data2, chunk4);
thread1.start();
thread2.start();
thread3.start();
thread4.start();
thread1.join();
thread2.join();
thread3.join();
thread4.join();
Representation of My Code
Public class ClassName {
Integer nSimilarEntities = 30;
public void run() {
for (String kNonRepeater : data_1.keySet()) {
// Extract the feature vector
BitSet vFeaturesNonRepeater = data_1.get(kNonRepeater);
// Calculate the sum of 1s (L2 norm is the sqrt of this)
double nNormNonRepeater = Math.sqrt(vFeaturesNonRepeater.cardinality());
// Loop through the repeater set
double nMinSimilarity = 100;
int nMinSimIndex = 0;
// Maintain the list of top similar repeaters and the similarity values
long dpind = 0;
ArrayList<String> vSimilarKeys = new ArrayList<String>();
ArrayList<Double> vSimilarValues = new ArrayList<Double>();
for (String kRepeater : data_2.keySet()) {
// Status output at regular intervals
dpind++;
if (Math.floorMod(dpind, pct) == 0) {
System.out.println(dpind + " dot products (" + Math.round(dpind / pct) + "%) out of "
+ nNumSimilaritiesToCompute + " completed!");
}
// Calculate the norm of repeater, and the dot product
BitSet vFeaturesRepeater = data_2.get(kRepeater);
double nNormRepeater = Math.sqrt(vFeaturesRepeater.cardinality());
BitSet vTemp = (BitSet) vFeaturesNonRepeater.clone();
vTemp.and(vFeaturesRepeater);
double nCosineDistance = vTemp.cardinality() / (nNormNonRepeater * nNormRepeater);
// queue.add(new MyClass(kRepeater,kNonRepeater,nCosineDistance));
// if(queue.size() > YOUR_LIMIT)
// queue.remove();
// Don't bother if the similarity is 0, obviously
if ((vSimilarKeys.size() < nSimilarEntities) && (nCosineDistance > 0)) {
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
} else { // If there are more, keep only the best
// If this is better than the smallest distance, then remove the smallest
if (nCosineDistance > nMinSimilarity) {
// Remove the lowest similarity value
vSimilarKeys.remove(nMinSimIndex);
vSimilarValues.remove(nMinSimIndex);
// Add this one
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
// Refresh the index of lowest similarity value
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
}
} // End loop for maintaining list of similar entries
}// End iteration through repeaters
for (int i = 0; i < vSimilarValues.size(); i++) {
System.out.println(Thread.currentThread().getName() + kNonRepeater + "|" + vSimilarKeys.get(i) + "|" + vSimilarValues.get(i));
}
}
}
}
Finally, If not Multithreading, is there any other approaches in java, to reduce time complexity.

The computer works similarly to what you have to do by hand (It processes more digits/bits at a time but the problem is the same.
If you do addition, the time is proportional to the of the size of the number.
If you do multiplication or divisor it's proportional to the square of the size of the number.
For the computer the size is based on multiples of 32 or 64 significant bits depending on the implementation.

I'd say this task is suitable for parallel streams. Don't hesitate to take a look at this conception if you have time. Parallel streams seamlessly use multithreading at full speed.
The top-level processing will look like this:
data_1.entrySet()
.parallelStream()
.flatmap(nonRepeaterEntry -> processOne(nonRepeaterEntry.getKey(), nonRepeaterEntry.getValue(), data2))
.forEach(System.out::println);
You should provide processOne function with prototype like this:
Stream<String> processOne(String nonRepeaterKey, String nonRepeaterBitSet, Map<String BitSet> data2);
It will return prepared string list with what you print now into file.
To make stream inside you can prepare List list first and then turn it into stream in return statement:
return list.stream();
Even though inner loop can be processed in streams, parallel streaming inside is discouraged - you already have enough parallelism.
For your questions:
1) If i want to store the results of each thread without overlapping into a file, How can i do that?.
Any logging framework (logback, log4j) can deal with it. Parallel streams can deal with it. Also you can store prepared lines into some queue/array and print them in separate thread. It takes a bit of care, though, ready solutions are easier and effectively they do such thing.
2) How can i make the following much faster.
Optimize and parallelize. At normal situation you get number_of_threads/1.5..number_of_threads times faster processing thinking you have hyperthreading in play, but it depends on things you do not-so-parallel and underlying implementations of stuff.
3) How to create the chunks dynamically, based on number of cores?
You don't have to. Make a list of tasks (1 task per data_1 entry) and feed executor service with them - that's already big enough task size. You can use FixedThreadPool with number of threads as parameter, and it will deal will distribute tasks evenly.
Not you should create task class, get Future for each task upon threadpool.submit and in the end run a loop doing .get for each Future. It will throttle main thread down to executor processing speed implicitly doing fork-join like behaviour.
4) Direct threads creation is outdated technique. It's recommended to use executor service of some sort, parallel streams etc. For loop processing you need to create list of chunks, and in loop create thread, add it to list of threads. And in another loop join to each thread if the list.
Ad hoc optimizations:
1) Make Repeater class that will store key, bitset and cardinality. Preprocess your hashsets turning them into Repeater instances and calculating cardinality once (i.e. not for every inner loop run). It will save you 20mil*(20mil-1) calls of .cardinality(). You still need to call it for difference.
2) Replace similarKeys, similarValues with limited size priorityQueue on combined entries. It works faster for 30 elements.
Take a look at this question for infor about PriorityQueue:
Java PriorityQueue with fixed size
3) You can skip processing of nonRepeater if its cardinality is already 0 - bitSet and will never increase resulting cardinality, and you'll filter out all 0-distance values.
4) You can skip (remove from temporary list you create in p.1 optimization) every Repeater with zero cardinality. Like in p.3 it will never produce anything fruitful.

Unexpected parallelstream performance in Java 8

I experienced a performance issue when using the stream created using the spliterator() over an Iterable. ie., like StreamSupport.stream(integerList.spliterator(), true). Wanted to prove this over a normal collection. Please see below some benchmark results.
Question:
Why does the parallel stream created from an iterable much slower than the stream created from an ArrayList or an IntStream ?
From a range
public void testParallelFromIntRange() {
long start = System.nanoTime();
IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
System.out.println("Is Parallel: "+stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from range Takes : 490 milli seconds
From an Iterable
public void testParallelFromIterable() {
Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
long start = System.nanoTime();
Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
System.out.println("Is Parallel: " + stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from Iterable Takes : 12517 milli seconds
And the so trivial calculate method.
public static Integer calculate(Integer input) {
return input + 2;
}

Not all spliterators are created equally. One of the tasks of a spliterator is to decompose the source into two parts, that can be processed in parallel. A good spliterator will divide the source roughly in half (and will be able to continue to do so recursively.)
Now, imagine you are writing a spliterator for a source that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is divide the source into "first" and "rest". That's about as bad as it gets. The result is a computation tree that is very "right-heavy".
The spliterator that you get from a data structure has more to work with; it knows the layout of the data, and can use that to give better splits, and therefore better parallel performance. The spliterator for ArrayList can always divide in half, and retains knowledge of exactly how much data is in each half. That's really good. The spliterator from a balanced tree can get good distribution (since each half of the tree has roughly half the elements), but isn't quite as good as the ArrayList spliterator because it doesn't know the exact sizes. The spliterator for a LinkedList is about as bad as it gets; all it can do is (first, rest). And the same for deriving a spliterator from an iterator.
Now, all is not necessarily lost; if the work per element is high, you can overcome bad splitting. But if you're doing a small amount of work per element, you'll be limited by the quality of splits from your spliterator.

There are several problems with your benchmark.
Stream<Integer> cannot be compared to IntStream because of boxing overhead.
You aren't doing anything with the result of the calculation, which makes it hard to know whether the code is actually being run
You are benchmarking with System.nanoTime instead of using a proper benchmarking tool.
Here's a JMH-based benchmark:
import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class Ranges {
final static int SIZE = 10_000_000;
#Benchmark
public long intStream() {
Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();
return st.parallel().mapToInt(x -> x).sum();
}
#Benchmark
public long contiguousSet() {
ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
Stream<Integer> st = cs.stream();
return st.parallel().mapToInt(x -> x).sum();
}
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder()
.include(".*Ranges.*")
.forks(1)
.warmupIterations(5)
.measurementIterations(5)
.build()
).run();
}
}
And the output:
Benchmark Mode Samples Score Score error Units
b.Ranges.contiguousSet thrpt 5 13.540 0.924 ops/s
b.Ranges.intStream thrpt 5 27.047 5.119 ops/s
So IntStream.range is about twice as fast as ContiguousSet, which is perfectly reasonable, given that ContiguousSet doesn't implement its own Spliterator and uses the default from Set

Best method for parallel log aggregation

My program needs to analyze a bunch of log files daily which are generated on a hourly basis from each application server.
So if I have 2 app servers I will be processing 48 files (24 files * 2 app servers).
file sizes range 100-300 mb. Each line in every file is a log entry which is of the format
[identifier]-[number of pieces]-[piece]-[part of log]
for example
xxx-3-1-ABC
xxx-3-2-ABC
xxx-3-3-ABC
These can be distributed over the 48 files which I mentioned, I need to merge these logs like so
xxx-PAIR-ABCABCABC
My implementation uses a thread pool to read through files in parallel and then aggregate them using a ConcurrentHashMap
I define a class LogEvent.scala
class LogEvent (val id: String, val total: Int, var piece: Int, val json: String) {
var additions: Long = 0
val pieces = new Array[String](total)
addPiece(json)
private def addPiece (json: String): Unit = {
pieces(piece) = json
additions += 1
}
def isDone: Boolean = {
additions == total
}
def add (slot: Int, json: String): Unit = {
piece = slot
addPiece(json)
}
The main processing happens over multiple threads and the code is something on the lines of
//For each file
val logEventMap = new ConcurrentHashMap[String, LogEvent]().asScala
Future {
Source.fromInputStream(gis(file)).getLines().foreach {
line =>
//Extract the id part of the line
val idPart: String = IDPartExtractor(line)
//Split line on '-'
val split: Array[String] = idPart.split("-")
val id: String = split(0) + "-" + split(1)
val logpart: String = JsonPartExtractor(line)
val total = split(2) toInt
val piece = split(3) toInt
def slot: Int = {
piece match {
case x if x - 1 < 0 => 0
case _ => piece - 1
}
}
def writeLogEvent (logEvent: LogEvent): Unit = {
if (logEvent.isDone) {
//write to buffer
val toWrite = id + "-PAIR-" + logEvent.pieces.mkString("")
logEventMap.remove(logEvent.id)
writer.writeLine(toWrite)
}
}
//The LOCK
appendLock {
if (!logEventMap.contains(id)) {
val logEvent = new LogEvent(id, total, slot, jsonPart)
logEventMap.put(id, logEvent)
//writeLogEventToFile()
}
else {
val logEvent = logEventMap.get(id).get
logEvent.add(slot, jsonPart)
writeLogEvent(logEvent)
}
}
}
}
The main thread blocks till all the futures complete
Using this approach I have been able to cut the processing time from an hour+ to around 7-8 minutes.
My questions are as follows -
Can this be done in a better way, I am reading multiple files using different threads and I need to lock at the block where the aggregation happens, are there better ways of doing this?
The Map grows very fast in memory, any suggestions for off heap storage for such a use case
Any other feedback.
Thanks

A common way to do this is to sort each file and then merge the sorted files. The result is a single file that has the individual items in the order that you want them. Your program then just needs to do a single pass through the file, combining adjacent matching items.
This has some very attractive benefits:
The sort/merge is done by standard tools that you don't have to write
Your aggregator program is very simple. Or, there might even be a standard tool that will do it.
Memory requirements are lessened. The sort/merge programs know how to manage memory, and your aggregation program's memory requirements are minimal.
There are, of course some drawbacks. You'll use more disk space and the process will be somewhat slower due to the I/O cost.
When I'm faced with something like this, I almost always go with using the standard tools and a simple aggregator program. The increased performance I get from a custom program just doesn't justify the time it takes to develop the thing.

For this sort of thing, if you can, use Splunk, if not, copy what it does which is index the log files for aggregation on demand at a later point.
For off heap storage, look at distributed caches - Hazelcast or Coherence. Both support provide java.util.Map implementations that are stored over multiple JVMs.

High Level Java Optimization

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.
My question is more of a High Level based optimization in design. Let's assume I have to do the following:
for a given string input, count the occurrence of each letter in the string.
this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.
So what is the high level design pattern that can be applied to this type of problem.
I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.
thanks in advance
Sam
p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.
I'll give you some thoughts on it anyway.
First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:
h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
h[w] = w in h ? h[w]++ : 1
}
Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:
Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file
These three steps go in a Unix pipeline. Let the OS do the work for you here.
Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.
Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.
ADDENDUM
The OP asked for an example of tokenizing the input in Java. Here is the easiest way:
import java.util.Scanner;
public class WordGenerator {
/**
* Tokenizes standard input into words, writing each word to standard output,
* on per line. Because it reads from standard input and writes to standard
* output, it can easily be used in a pipeline combined with sort, uniq, and
* any other such application.
*/
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
while (input.hasNext()) {
System.out.println(input.next().toLowerCase());
}
}
}
Now here is an example of using it:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator
This outputs
hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.
You can combine this tokenizer with sort and uniq like so:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq
Yielding
hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo
Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:
Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));
And now
echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq
Yields
hey
moe
nyuk
soitenly
why
woo
There is a blank line in the output; I'll let you figure out how to whack it. :)

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.
Some pseduo-code (may not compile)
HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
if (map.containsKey(c)) map.put(c, map.get(c) + 1);
else map.put(c, 1);
}

It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).

You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)
However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();
long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());
prints
Took 111 ms to count 139,715,647 characters
Even 11x times the number of words takes a fraction of a second.
A much longer parallel version is a little faster.
public static void main(String... args) throws InterruptedException, ExecutionException {
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
final String text = words.toString();
long start = System.nanoTime();
// start a thread pool to generate 4 tasks to count sections of the text.
final int nThreads = 4;
ExecutorService es = Executors.newFixedThreadPool(nThreads);
List<Future<int[]>> results = new ArrayList<Future<int[]>>();
int blockSize = (text.length() + nThreads - 1) / nThreads;
for (int i = 0; i < nThreads; i++) {
final int min = i * blockSize;
final int max = Math.min(min + blockSize, text.length());
results.add(es.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
int[] charCount = new int[Character.MAX_VALUE];
for (int j = min; j < max; j++)
charCount[text.charAt(j)]++;
return charCount;
}
}));
}
es.shutdown();
// combine the results.
int[] charCount = new int[Character.MAX_VALUE];
for (Future<int[]> resultFuture : results) {
int[] result = resultFuture.get();
for (int i = 0, resultLength = result.length; i < resultLength; i++) {
charCount[i] += result[i];
}
}
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}
prints
Took 45 ms to count 139,715,537 characters
But for a String with less than a million words its not likely to be worth it.

As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible.
If that means putting in a faster algorithm, do so, but at first, keep it simple.
For a small program like this, it won't be too hard.
The essential skill in performance tuning is not guessing.
Instead, let the program itself tell you what to fix.
This is my method.
For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.

You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.
On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.
You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.
I'll be glad of somebody point to another widely-used API.

Can this code be more efficient?

This Program should do this
N 10*N 100*N 1000*N
1 10 100 1000
2 20 200 2000
3 30 300 3000
4 40 400 4000
5 50 500 5000
So here's my code:
public class ex_4_21 {
public static void main( String Args[] ){
int process = 1;
int process2 = 1;
int process22 = 1;
int process3 = 1;
int process33 = 2;
System.out.println("N 10*N 100*N 1000*N");
while(process<=5){
while(process2<=3){
System.out.printf("%d ",process2);
while(process22<=3){
process2 = process2 * 10;
System.out.printf("%d ",process2);
process22++;
}
process2++;
}
process++;
}
}
}
Can my code be more effecient? I am currently learning while loops. So far this what I got. Can anyone make this more efficient, or give me ideas on how to make my code more efficient?
This is not a homework, i am self studying java

You can use a single variable n to do this.
while(n is less than the maximum value that you wish n to be)
print n and a tab
print n * 10 and a tab
print n * 100 and a tab
print n * 1000 and a new line
n++
if the power of 10 is variable then you can try this:
while(n is less than the maximum value that you wish n to be)
while(i is less than the max power of ten)
print n * i * 10 and a tab
i++
print a newline
n++

If you must use a while loop
public class ex_4_21 {
public static void main( String Args[] ){
int process = 1;
System.out.println("N 10*N 100*N 1000*N");
while(process<=5){
System.out.println(process + " " + 10*process + " " + 100*process + " " + 1000*process + "\n");
process++;
}
}
}

You have one too many while loops (your "process2" while loop is unnecessary). You also appear to have some bugs related to the fact that the variables you are looping on in the inner loops are not re-initialized with each iteration.
I would also recommend against while loops for this; Your example fits a for loop much better; I understand you are trying to learn the looping mechanism, but part of learning should also be in deciding when to use which construct. This really isn't a performance recommendation, more an approach recommendation.
I don't have any further performance improvement suggestions, for what you are trying to do; You could obviously remove loops (dropping down to a single or even no loops), but two loops makes sense for what you are doing (allows you to easily add another row or column to the output with minimal changes).

You can try loop unrolling, similar to #Vincent Ramdhanie's answer.
However, loop unrolling and threading won't produce a significant performance improvement for such a small sample. The overhead involved in creating and launching threads (processes) takes more time than a simple while loop. The overhead in I/O will take more time than the unrolled version saves. A complex program is harder to debug and maintain than a simple one.
You're thinking is called microoptimization. Save the optimizations for larger programs and only when the requirements cannot be met or the customer(s) demand so.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.