Find first index of matching character from two strings using parallel streams - java

Trying to figure out whether it is possible to find what the first index of a matching character that is within one string that is also in another string. So for example:
String first = "test";
String second = "123er";
int value = get(test, other);
// method would return 1, as the first matching character in
// 123er, e is at index 1 of test
So I'm trying to accomplish this using parallel streams. I know I can find whether there is a matching character fairly simply like such:
test.chars().parallel().anyMatch(other::contains);
How would I use this to find the exact index?

If you really care for performance, you should try to avoid the O(n × m) time complexity of iterating over one string for every character of the other. So, first iterate over one string to get a data structure supporting efficient (O(1)) lookup, then iterate over the other utilizing this.
BitSet encountered = new BitSet();
test.chars().forEach(encountered::set);
int index = IntStream.range(0, other.length())
.filter(ix->encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
If the strings are sufficiently large, the O(n + m) time complexity of this solution will turn to much shorter execution times. For smaller strings, it’s irrelevant anyway.
If you really think, the strings are large enough to benefit from parallel processing (which is very unlikely), you can perform both operations in parallel, with small adaptions:
BitSet encountered = CharBuffer.wrap(test).chars().parallel()
.collect(BitSet::new, BitSet::set, BitSet::or);
int index = IntStream.range(0, other.length()).parallel()
.filter(ix -> encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
The first operation uses the slightly more complicated, parallel compatible collect now and it contains a not-so-obvious change for the Stream creation.
The problem is described in bug report JDK-8071477. Simply said, the stream returned by String.chars() has a poor splitting capability, hence a poor parallel performance. The code above wraps the string in a CharBuffer, whose chars() method returns a different implementation, having the same semantics, but a good parallel performance. This work-around should become obsolete with Java 9.
Alternatively, you could use IntStream.range(0, test.length()).map(test::charAt) to create a stream with a good parallel performance. The second operation already works that way.
But, as said, for this specific task it’s rather unlikely that you ever encounter strings large enough to make parallel processing beneficial.

You can do it by relying on String#indexOf(int ch), keeping only values >= 0 to remove non existing characters then get the first value.
// Get the index of each characters of test in other
// Keep only the positive values
// Then return the first match
// Or -1 if we have no match
int result = test.chars()
.parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.orElse(-1);
System.out.println(result);
Output:
1
NB 1: The result is 1 not 2 because indexes start from 0 not 1.
NB 2: Unless you have very very long String, using a parallel Stream in this case should not help much in term of performances because the tasks are not complexes and creating, starting and synchronizing threads has a very high cost so you will probably get your result much slower than with a normal stream.

Upgrading Nicolas' answer here. min() method enforces consumption of the whole Stream. In such cases, it's better to use findFirst() which stops the whole execution after finding the first matching element and not the minimum of all:
test.chars().parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.ifPresent(System.out::println);

Related

Are java streams able to lazilly reduce from map/filter conditions?

I am using a functional programming style to solve the Leetcode easy question, Count the Number of Consistent Strings. The premise of this question is simple: count the amount of values for which the predicate of "all values are in another set" holds.
I have two approaches, one which I am fairly certain behaves as I want it to, and the other which I am less sure about. Both produce the correct output, but ideally they would stop evaluating other elements after the output is in a final state.
public int countConsistentStrings(String allowed, String[] words) {
final Set<Character> set = allowed.chars()
.mapToObj(c -> (char)c)
.collect(Collectors.toCollection(HashSet::new));
return (int)Arrays.stream(words)
.filter(word ->
word.chars()
.allMatch(c -> set.contains((char)c))
)
.count();
}
In this solution, to the best of my knowledge, the allMatch statement will terminate and evaluate to false at the first instance of c for which the predicate does not hold true, skipping the other values in that stream.
public int countConsistentStrings(String allowed, String[] words) {
Set<Character> set = allowed.chars()
.mapToObj(c -> (char)c)
.collect(Collectors.toCollection(HashSet::new));
return (int)Arrays.stream(words)
.filter(word ->
word.chars()
.mapToObj(c -> set.contains((char)c))
.reduce((a,b) -> a&&b)
.orElse(false)
)
.count();
}
In this solution, the same logic is used but instead of allMatch, I use map and then reduce. Logically, after a single false value comes from the map stage, reduce will always evaluate to false. I know Java streams are lazy, but I am unsure when they ''know'' just how lazy they can be. Will this be less efficient than using allMatch or will laziness ensure the same operation?
Lastly, in this code, we can see that the value for x will always be 0 as after filtering for only positive numbers, the sum of them will always be positive (assume no overflow) so taking the minimum of positive numbers and a hardcoded 0 will be 0. Will the stream be lazy enough to evaluate this to 0 always, or will it work to reduce every element after the filter anyways?
List<Integer> list = new ArrayList<>();
...
/*Some values added to list*/
...
int x = list.stream()
.filter(i -> i >= 0)
.reduce((a,b) -> Math.min(a+b, 0))
.orElse(0);
To summarize the above, how does one know when the Java stream will be lazy? There are lazy opportunities that I see in the code, but how can I guarantee that my code will be as lazy as possible?
The actual term you’re asking for is short-circuiting
Further, some operations are deemed short-circuiting operations. An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result. A terminal operation is short-circuiting if, when presented with infinite input, it may terminate in finite time. Having a short-circuiting operation in the pipeline is a necessary, but not sufficient, condition for the processing of an infinite stream to terminate normally in finite time.
The term “lazy” only applies to intermediate operations and means that they only perform work when being requested by the terminal operation. This is always the case, so when you don’t chain a terminal operation, no intermediate operation will ever process any element.
Finding out whether a terminal operation is short-circuiting, is rather easy. Go to the Stream API documentation and check whether the particular terminal operation’s documentation contains the sentence
This is a short-circuiting terminal operation.
allMatch has it, reduce has not.
This does not mean that such optimizations based on logic or algebra are impossible. But the responsibility lies at the JVM’s optimizer which might do the same for loops. However, this requires inlining of all involved methods to be sure that this conditions always applies and there are no side effect which must be retained. This behavioral compatibility implies that even if the processing gets optimized away, a peek(System.out::println) would keep printing all elements as if they were processed. In practice, you should not expect such optimizations, as the Stream implementation code is too complex for the optimizer.

Is Java 8 filters on Stream of large list more time consuming than for loop?

I was replacing one of my legacy code from simple for loop over a list to Java 8 stream and filters.
I have the for loop like below:
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
for(int i: numbers) {
if((i > 4) && (i % 2 == 0)) {
System.out.println("First even number in list more than 4 : " + numbers.get(i))
break;
}
}
Here I have 6 iterations of the loop. When 6 is obtained, I print it.
Now, I am replacing it with below:
numbers.stream()
.filter(e -> e > 4)
.filter(e -> e % 2 == 0)
.findFirst()
I am confused, if we are applying filter twice on the list, then is the time complexity more than the for loop case or is there something I am missing?
The answer to your question is no. A stream with multiple filter, map, or other intermediate operations is not necessarily slower than an equivalent for loop.
The reason for this is that all intermediate Stream operations are lazy, meaning that they are not actually applied individually at the time the filter method is called, but are instead all evaluated at once during the terminal operation (findFirst() in this case). From the documentation:
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines. A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
Intermediate operations return a new stream. They are always lazy; executing an intermediate operation such as filter() does not actually perform any filtering, but instead creates a new stream that, when traversed, contains the elements of the initial stream that match the given predicate. Traversal of the pipeline source does not begin until the terminal operation of the pipeline is executed.
In practice, since your two pieces of code do not compile to exactly the same bytecode, there will likely be some minor performance difference between them, but logically they do very similar things and will perform very similarly.

Arrays.sort() vs sorting using map

I have a requirement where I have to loop through an array which has list of strings:
String[] arr = {"abc","cda","cka","snd"}
and match the string "bca", ignoring the order of the characters, which will return true as it’s present in the array ("abc").
To solve this I have two approaches:
Use Arrays.sort() to sort both the strings and then use Arrays.equals to compare them.
create 2 hashmaps and add frequency of each letter in string and then finally compare two map of char using equals method.
I read that complexity of using Arrays.sort() method is more. So, thought of working on 2nd approach but when I am running both the code 1st approach is taking very less time to execute program.
Any suggestions why this is happening?
The Time Complexity only tells you, how the approach will scale with (significantly) larger input. It doesn’t tell you which approach is faster.
It’s perfectly possible that a solution is faster for small input sizes (string lengths and/or array length) but scales badly for larger sizes, due to its Time Complexity. But it’s even possible that you never encounter the point where an algorithm with a better Time Complexity becomes faster, when natural limits to the input sizes prevent it.
You didn’t show the code of your approaches, but it’s likely that your first approach calls a method like toCharArray() on the strings, followed by Arrays.sort(char[]). This implies that sort operates on primitive data.
In contrast, when your second approach uses a HashMap<Character,Integer> to record frequencies, it will be subject to boxing overhead, for the characters and the counts, and also use a significantly larger data structure that needs to be processed.
So it’s not surprising that the hash approach is slower for small strings and arrays, as it has a significantly larger fixed overhead and also a size dependent (O(n)) overhead.
So first approach had to suffer from the O(n log n) time complexity significantly to turn this result. But this won’t happen. That time complexity is a worst case of sorting in general. As explained in this answer, the algorithms specified in the documentation of Arrays.sort should not be taken for granted. When you call Arrays.sort(char[]) and the array size crosses a certain threshold, the implementation will turn to Counting Sort with an O(n) time complexity (but use more memory temporarily).
So even with large strings, you won’t suffer from a worse time complexity. In fact, the Counting Sort shares similarities with the frequency map, but usually is more efficient, as it avoids the boxing overhead, using an int[] array instead of a HashMap<Character,Integer>.
Approach 1: will be O(NlogN)
Approach 2: will be O(N*M), where M is the length of each string in your array.
You should search linearly in O(N):
for (String str : arr) {
if (str.equals(target)) return true;
}
return false;
Let's decompose the problem:
You need a function to sort a string by its chars (bccabc -> abbccc) to be able to compare a given string with the existing ones.
Function<String, String> sortChars = s -> s.chars()
.sorted()
.mapToObj(i -> (char) i)
.map(String::valueOf)
.collect(Collectors.joining());
Instead of sorting the chars of the given strings anytime you compare them, you can precompute the set of unique tokens (values from your array, sorted chars):
Set<String> tokens = Arrays.stream(arr)
.map(sortChars)
.collect(Collectors.toSet());
This will result in the values "abc","acd","ack","dns".
Afterwards you can create a function which checks if a given string, when sorted by chars, matches any of the given tokens:
Predicate<String> match = s -> tokens.contains(sortChars.apply(s));
Now you can easily check any given string as follows:
boolean matches = match.test("bca");
Matching will only need to sort the given input and do a hash set lookup to check if it matches, so it's very efficient.
You can of course write the Function and Predicate as methods instead (String sortChars(String s) and boolean matches(String s) if you're unfamiliar with functional programming.
More of an addendum to the other answers. Of course, your two options have different performance characteristics. But: understand that performance is not necessarily the only factor to make a decision!
Meaning: if you are talking about a search that runs hundreds or thousands of time per minute, on large data sets: then for sure, you should invest a lot of time to come up with a solution that gives you best performance. Most likely, that includes doing various experiments with actual measurements when processing real data. Time complexity is a theoretical construct, in the real world, there are also elements such as CPU cache sizes, threading issues, IO bottlenecks, and whatnot that can have significant impact on real numbers.
But: when your code will doing its work just once a minute, even on a few dozen or hundred MB of data ... then it might not be worth to focus on performance.
In other words: the "sort" solution sounds straight forward. It is easy to understand, easy to implement, and hard to get wrong (with some decent test cases). If that solution gets the job done "good enough", then consider to use use that: the simple solution.
Performance is a luxury problem. You only address it if there is a reason to.

Find all concatenations of two string in a huge set

Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.
What I've tried
, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings by length 4 prefixes and, separately, suffixes. Then for every string composed of length at least 8, I look up the set of candidates for s using the first four characters of composed and the set of candidates for t using its last four characters. This works, but it needs to look at 30M candidate pairs (s, t) for finding the 7k results.
This surprisingly high number of candidates comes from the fact, that the string are (mostly German) words from a limited vocabulary and the word starts and ends often the same. It's still much better than trying all 2.5G pairs, but much worse than I hoped.
What I need
As the additional constraint may get dropped and the set will grow, I'm looking for a better algorithm.
The "missing" question
There were complaints about me not asking a question. So the missing question mark is at the end of the next sentence. How can this be done more efficiently, ideally without using the constraint?
Algorithm 1: Test pairs, not singles
One way could be, instead of working from all possible pairs to all possible composite strings containing those pairs, work from all possible composite strings and see if they contain pairs. This changes the problem from n^2 lookups (where n is the number of strings >= 4 characters) to m * n lookups (where m is the average length of all strings >= 8 characters, minus 7, and n is now the number of strings >= 8 characters). Here's one implementation of that:
int minWordLength = 4;
int minPairLength = 8;
Set<String> strings = Stream
.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downstream", "stream"
)
.filter(s -> s.length() >= minWordLength)
.collect(ImmutableSet.toImmutableSet());
strings
.stream()
.filter(s -> s.length() >= minPairLength)
.flatMap(s -> IntStream
.rangeClosed(minWordLength, s.length() - minWordLength)
.mapToObj(splitIndex -> ImmutableList.of(
s.substring(0, splitIndex),
s.substring(splitIndex)
))
.filter(pair ->
strings.contains(pair.get(0))
&& strings.contains(pair.get(1))
)
)
.map(pair ->
pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
)
.forEach(System.out::println);
Gives the result:
downstream = down + stream
This has average algorithmic complexity of m * n as shown above. So in effect, O(n). In the worst case, O(n^2). See hash table for more on the algorithmic complexity.
Explanation
Put all strings four or more characters long into a hash set (which takes average O(1) complexity for search). I used Guava's ImmutableSet for convenience. Use whatever you like.
filter: Restrict to only the items that are eight or more characters in length, representing our candidates for being a composition of two other words in the list.
flatMap: For each candidate, compute all possible pairs of sub-words, ensuring each is at least 4 characters long. Since there can be more than one result, this is in effect a list of lists, so flatten it into a single-deep list.
rangeClosed: Generate all integers representing the number of characters that will be in the first word of the pair we will check.
mapToObj: Use each integer combined with our candidate string to output a list of two items (in production code you'd probably want something more clear like a two-property value class, or an appropriate existing class).
filter: Restrict to only pairs where both are in the list.
map: Pretty up the results a little.
forEach: Output to the console.
Algorithm Choice
This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. Given that the list is 50,000 strings in size, and German words while long are very unlikely to exceed 50 characters, that is a 1:1000 factor in favor of this algorithm.
If on the other hand, you had 50 strings that were on average 50,000 characters long, a different algorithm would be far more efficient.
Algorithm 2: Sort and keep a candidate list
One algorithm I thought about for a little while was to sort the list, with the knowledge that if a string represents the start of a pair, all candidate strings that could be one of its pairs will be immediately after it in order, among the set of items that start with that string. Sorting my tricky data above, and adding some confounders (downer, downs, downregulate) we get:
a
abc
abcdef
bear
bearhug
cur
curl
curlique
def
down ---------\
downs |
downer | not far away now!
downregulate |
downstream ---/
hug
shine
stream
sun
sunshine
Thus if a running set of all items to check were kept, we could find candidate composites in essentially constant time per word, then probe directly into a hash table for the remainder word:
int minWordLength = 4;
Set<String> strings = Stream
.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downs", "downer", "downregulate", "downstream", "stream")
.filter(s -> s.length() >= minWordLength)
.collect(ImmutableSet.toImmutableSet());
ImmutableList<String> orderedList = strings
.stream()
.sorted()
.collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();
for (String currentString : orderedList) {
List<String> nextCandidates = new ArrayList<>();
nextCandidates.add(currentString);
for (String candidate : candidates) {
if (currentString.startsWith(candidate)) {
nextCandidates.add(candidate);
String remainder = currentString.substring(candidate.length());
if (remainder.length() >= minWordLength && strings.contains(remainder)) {
pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
}
}
}
candidates = nextCandidates;
}
pairs.forEach(System.out::println);
Result:
down=stream
The algorithmic complexity on this one is a little more complicated. The searching part I think is O(n) average, with O(n^2) worst case. The most expensive part might be the sorting—which depends on the algorithm used and the characteristics of the unsorted data. So use this one with a grain of salt, but it has possibility. It seems to me that this is going to be way less expensive than building a Trie out of an enormous data set, because you only probe it once comprehensively and don’t get any amortization of the build cost.
Also, this time I chose a Map.Entry to hold the pair. It's completely arbitrary how you do it. Making a custom Pair class or using some existing Java class would be fine.
You can improve Erik’s answer by avoiding most of the sub-String creation using CharBuffer views and altering their position and limit:
Set<CharBuffer> strings = Stream.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downstream", "stream"
)
.filter(s -> s.length() >= 4) // < 4 is irrelevant
.map(CharBuffer::wrap)
.collect(Collectors.toSet());
strings
.stream()
.filter(s -> s.length() >= 8)
.map(CharBuffer::wrap)
.flatMap(cb -> IntStream.rangeClosed(4, cb.length() - 4)
.filter(i -> strings.contains(cb.clear().position(i))&&strings.contains(cb.flip()))
.mapToObj(i -> cb.clear()+" = "+cb.limit(i)+" + "+cb.clear().position(i))
)
.forEach(System.out::println);
This is the same algorithm, hence doesn’t change the time complexity, unless you incorporate the hidden character data copying costs, which would be another factor (times the average string length).
Of course, the differences become significant only if you use a different terminal operation than printing the matches, as printing is quiet an expensive operation. Likewise, when the source is a stream over a large file, the I/O will dominate the operation. Unless you go into an entirely different direction, like using memory mapping and refactor this operation to operate over ByteBuffers.
A possible solution could be this.
You start with the first string as your prefix and the second string as your suffix.
You go through each string. If the string begins with the first string, you check if it ends in the second string. And keep going until the end. To save some time before checking if the letters themselves are the same you could make a length check.
It's pretty much what you made, but with this added length check you might be able to trim off a few. That's my take on it at least.
Not sure if this is better than your solution but I think it's worth a try.
Build two Tries, one with the candidates in normal order, the other with the words reversed.
Walk the forwards Trie from depth 4 inwards and use the remainder of the leaf to determine the suffix (or something like that) and look it up in the backwards Trie.
I've posted a Trie implementation in the past here https://stackoverflow.com/a/9320920/823393.

Making a Parallel IntStream more efficient/faster?

I've looked for a while for this answer, but couldn't find anything.
I'm trying to create an IntStream that can very quickly find primes (many, many primes, very fast -- millions in a few seconds).
I currently am using this parallelStream:
import java.util.stream.*;
import java.math.BigInteger;
public class Primes {
public static IntStream stream() {
return IntStream.iterate( 3, i -> i + 2 ).parallel()
.filter( i -> i % 3 != 0 ).mapToObj( BigInteger::valueOf )
.filter( i -> i.isProbablePrime( 1 ) == true )
.flatMapToInt( i -> IntStream.of( i.intValue() ) );
}
}
but it takes too long to generate numbers. (7546ms to generate 1,000,000 primes).
Is there any obvious way to making this more efficient/faster?
There are two general problems for efficient parallel processing with your code. First, using iterate, which unavoidably requires the previous element to calculate the next one, which is not a good starting point for parallel processing. Second, you are using an infinite stream. Efficient workload splitting requires at least an estimate of the number of element to process.
Since you are processing ascending integer numbers, there is an obvious limit when reaching Integer.MAX_VALUE, but the stream implementation doesn’t know that you are actually processing ascending numbers, hence, will treat your formally infinite stream as truly infinite.
A solution fixing these issues, is
public static IntStream stream() {
return IntStream.rangeClosed(1, Integer.MAX_VALUE/2).parallel()
.map(i -> i*2+1)
.filter(i -> i % 3 != 0).mapToObj(BigInteger::valueOf)
.filter(i -> i.isProbablePrime(1))
.mapToInt(BigInteger::intValue);
}
but it must be emphasized that in this form, this solution is only useful if you truly want to process all or most of the prime numbers in the full integer range. As soon as you apply skip or limit to the stream, the parallel performance will drop significantly, as specified by the documentation of these methods. Also, using filter with a predicate that accepts values in a smaller numeric range only, implies that there will be a lot of unnecessary work that would better not be done than done in parallel.
You could adapt the method to receive a value range as parameter to adapt the range of the source IntStream to solve this.
This is the time to emphasize the importance of algorithms over parallel processing. Consider the Sieve of Eratosthenes. The following implementation
public static IntStream primes(int max) {
BitSet prime = new BitSet(max>>1);
prime.set(1, max>>1);
for(int i = 3; i<max; i += 2)
if(prime.get((i-1)>>1))
for(int b = i*3; b>0 && b<max; b += i*2) prime.clear((b-1)>>1);
return IntStream.concat(IntStream.of(2), prime.stream().map(i -> i+i+1));
}
turned out to be faster by an order of magnitude compared to the other approaches despite not using parallel processing, even when using Integer.MAX_VALUE as upper bound (measured using a terminal operation of .reduce((a,b) -> b) instead of toArray or forEach(System.out::println), to ensure complete processing of all values without adding additional storage or printing costs).
The takeaway is, isProbablePrime is great when you have a particular candidate or want to process a small range of numbers (or when the number is way outside the int or even long range)¹, but for processing a large ascending sequence of prime numbers there are better approaches, and parallel processing is not the ultimate answer to performance questions.
¹ consider, e.g.
Stream.iterate(new BigInteger("1000000000000"), BigInteger::nextProbablePrime)
.filter(b -> b.isProbablePrime(1))
It seems that I can do 1/2 better than what you have in place, by doing some modifications:
return IntStream.iterate(3, i -> i + 2)
.parallel()
.unordered()
.filter(i -> i % 3 != 0)
.mapToObj(BigInteger::valueOf)
.filter(i -> i.isProbablePrime(1))
.mapToInt(BigInteger::intValue);

Categories