I need a fast approach to search substrings

I need a fast approach to search substrings - java

I'm reworking a framework and I need a fast algorithm to search for a substring in a collection of strings.
In short, a class is alerted when any event from a child association is triggered.
The event contains a path which is the path from the current class to the event that was triggered (usually a property change).
Each class has static bindings to paths that are loaded in a collection.
A binding consist of the actual path and a set of property names that are binded to the said path.
When a class receives an event it needs to check if any property name is binded to the event's path and triggers something on any property that has a binding.
Now, I'm only looking for the best collection type to store these bindings and the best way to search the event's path within the static bindings.
Right now my implementation is really basic. I am using a HashMap the key being the possible paths while the value is a set of properties binded to the path.
I am looping through the keyset and I use startsWith with the event's path. (The event's path needs to be a substring of a binding starting from index 0)
For exemple a path would look like this : "association1.association2.propertyInAssociation2" or "association1.association2.association3"
The binding map would look this this (not actually initialised like this it's just an example) :
HashMap<String, Set<String>> bindings = new HashMap<>();
{
bindings.put("association1.association2.propertyInAssociation2", new HashSet<>());
bindings.get("association1.association2.propertyInAssociation2").add("property1");
bindings.get("association1.association2.propertyInAssociation2").add("property2");
bindings.get("association1.association2.propertyInAssociation2").add("property3");
bindings.put("association1.association2.association3.propertyInAssociation3", new HashSet<>());
bindings.get("association1.association2.association3.propertyInAssociation3").add("property4");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property5");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property6");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property7");
}
So for a class with these bindings, receiving an event with a path like "association1.association2.association3.propertyInAssociation3" or "association1.association2.association3"
Would both need to trigger something on property4, property5, property6 and property7.
Like I said, what I need is the most efficient way to search which properties (if any) are binded to an event's path.
I use Java 8 so I don't mind using lambda or whatever is available.
Reworking the bindings as collection of strings of any other format is not out of the question neither if it helps.
Thanks a lot!

Since you say
I am looping through the keyset and I use startsWith with the event's path. (The event's path needs to be a substring of a binding starting from index 0)
You should consider using a different data structure. A HashMap provides for efficient whole-key lookups, but it doesn't help much at all for partial-key lookups. You could consider instead using a SortedMap such as TreeMap. For String keys, SortedMap.tailMap() or SortedMap.subMap() will help you navigate directly to the keys you're looking for, if they are present.
Of course, insertions, deletions, and whole-key lookups are less efficient in a TreeMap than in a HashMap (on average); this is a tradeoff against the much better efficiency of key substring searching.

I would suggest a Stream API approach:
String path = "association1.association2.association3";
List<Map.Entry<String, Set<String>>> result =
bindings.entrySet()
.stream()
.filter(e -> e.getKey().contains(path))
.collect(Collectors.toList());

thanks for all the replies but I've changed my approach.
I will still use a HashMap but instead of adding :
"association1.association2.property"
and try to match partial keys i will add:
"association1"
"association1.association2"
"association1.association2.property"
This way I can efficiently use the hash and since the bindings are static and generated only once for each class type, changing the algorithm of the generation has no performance cost at all.
Thanks again for all your answers.

My suggestion is to use Parallel Stream or to implement your own Map.
Here the tests:
Solution proposed by John (TreeMap)
The best: 6 millliseconds
String path = "association1.association2.association3";
TreeMap<String, HashSet> bindings2 = new TreeMap<String, HashSet>(new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
if (o1.equals(o2))
return 0;
if (o1.startsWith(o2))
return 1;
return -1;
}
});
{
bindings2.put("association1.association2.propertyInAssociation2", new HashSet<>());
bindings2.get("association1.association2.propertyInAssociation2").add("property1");
bindings2.get("association1.association2.propertyInAssociation2").add("property2");
bindings2.get("association1.association2.propertyInAssociation2").add("property3");
bindings2.put("association1.association2.association3.propertyInAssociation3", new HashSet<>());
bindings2.get("association1.association2.association3.propertyInAssociation3").add("property4");
bindings2.get("association1.association2.association3.propertyInAssociation3").add("property5");
bindings2.get("association1.association2.association3.propertyInAssociation3").add("property6");
bindings2.get("association1.association2.association3.propertyInAssociation3").add("property7");
}
// test 1
long time = System.currentTimeMillis();
Object result1 = bindings2.tailMap(path).entrySet().stream().filter(e -> e.getKey().contains(path))
.collect(Collectors.toList());
System.out.println(System.currentTimeMillis() - time);
System.out.println(result1);
Solution proposed by Stefan (Stream)
The best: 16 millliseconds
HashMap<String, Set<String>> bindings = new HashMap<>();
{
bindings.put("association1.association2.propertyInAssociation2", new HashSet<>());
bindings.get("association1.association2.propertyInAssociation2").add("property1");
bindings.get("association1.association2.propertyInAssociation2").add("property2");
bindings.get("association1.association2.propertyInAssociation2").add("property3");
bindings.put("association1.association2.association3.propertyInAssociation3", new HashSet<>());
bindings.get("association1.association2.association3.propertyInAssociation3").add("property4");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property5");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property6");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property7");
}
// test 1
long time = System.currentTimeMillis();
String path = "association1.association2.association3";
List<Map.Entry<String, Set<String>>> result = bindings.entrySet().stream()
.filter(e -> e.getKey().contains(path)).collect(Collectors.toList());
System.out.println(System.currentTimeMillis() - time);
result.forEach(System.out::println);
Solution proposed by Me (parallel Stream)
The best: 9 millliseconds
HashMap<String, Set<String>> bindings = new HashMap<>();
{
bindings.put("association1.association2.propertyInAssociation2", new HashSet<>());
bindings.get("association1.association2.propertyInAssociation2").add("property1");
bindings.get("association1.association2.propertyInAssociation2").add("property2");
bindings.get("association1.association2.propertyInAssociation2").add("property3");
bindings.put("association1.association2.association3.propertyInAssociation3", new HashSet<>());
bindings.get("association1.association2.association3.propertyInAssociation3").add("property4");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property5");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property6");
bindings.get("association1.association2.association3.propertyInAssociation3").add("property7");
}
// test 1
long time = System.currentTimeMillis();
String path = "association1.association2.association3";
List<Map.Entry<String, Set<String>>> result = bindings.entrySet().stream().parallel()
.filter(e -> e.getKey().contains(path)).collect(Collectors.toList());
System.out.println(System.currentTimeMillis() - time);
result.forEach(System.out::println);
Tests are not reliable with few data. Personally I prefer the solution proposed by Fred.
UPDATE: as suggested by Dodgy here you can find a more formal test using JMH
https://github.com/venergiac/benchmark-jmh
git clone https://github.com/venergiac/benchmark-jmh.git
mvn install
java -jar target\benchmark-0.0.1-SNAPSHOT.jar
and the tests revealed a better throughput on parallel stream with hashmap, but we should perform these test on a more formal environlment with more time.

Related

Aggregate values and convert into single type within the same Java stream

I have a class with a collection of Seed elements. One of the method's return type of Seed is Optional<Pair<Boolean, String>>.
I'm trying to loop over all seeds, find if any boolean value is true and at the same time, create a set with all the String values. For instance, my input is in the form Optional<Pair<Boolean, String>>, the output should be Optional<Signal> where Signal is like:
class Signal {
public boolean exposure;
public Set<String> alarms;
// constructor and getters (can add anything to this class, it's just a bag)
}
This is what I currently have that works:
// Seed::hadExposure yields Optional<Pair<Boolean, String>> where Pair have key/value or left/right
public Optional<Signal> withExposure() {
if (seeds.stream().map(Seed::hadExposure).flatMap(Optional::stream).findAny().isEmpty()) {
return Optional.empty();
}
final var exposure = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.anyMatch(Pair::getLeft);
final var alarms = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.map(Pair::getRight)
.filter(Objects::nonNull)
.collect(Collectors.toSet());
return Optional.of(new Signal(exposure, alarms));
}
Now I have time to make it better because Seed::hadExposure could become and expensive call, so I was trying to see if I could make all of this with only one pass. I've tried (some suggestions from previous questions) with reduce, using collectors (Collectors.collectingAndThen, Collectors.partitioningBy, etc.), but nothing so far.

It's possible to do this in a single stream() expression using map to convert the non-empty exposure to a Signal and then a reduce to combine the signals:
Signal signal = exposures.stream()
.map(exposure ->
new Signal(
exposure.getLeft(),
exposure.getRight() == null
? Collections.emptySet()
: Collections.singleton(exposure.getRight())))
.reduce(
new Signal(false, new HashSet<>()),
(leftSig, rightSig) -> {
HashSet<String> alarms = new HashSet<>();
alarms.addAll(leftSig.alarms);
alarms.addAll(rightSig.alarms);
return new Signal(
leftSig.exposure || rightSig.exposure, alarms);
});
However, if you have a lot of alarms it would be expensive because it creates a new Set and adds the new alarms to the accumulated alarms for each exposure in the input.
In a language that was designed from the ground-up to support functional programming, like Scala or Haskell, you'd have a Set data type that would let you efficiently create a new set that's identical to an existing set but with an added element, so there'd be no efficiency worries:
filteredSeeds.foldLeft((false, Set[String]())) { (result, exposure) =>
(result._1 || exposure.getLeft, result._2 + exposure.getRight)
}
But Java doesn't come with anything like that out of the box.
You could create just a single Set for the result and mutate it in your stream's reduce expression, but some would regard that as poor style because you'd be mixing a functional paradigm (map/reduce over a stream) with a procedural one (mutating a set).
Personally, in Java, I'd just ditch the functional approach and use a for loop in this case. It'll be less code, more efficient, and IMO clearer.
If you have enough space to store an intermediate result, you could do something like:
List<Pair<Boolean, String>> exposures =
seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.collect(Collectors.toList());
Then you'd only be calling the expensive Seed::hadExposure method once per item in the input list.

Map merge-function (shouldn't be called!?)

I don't get in this rather short method posted below why the merger() function is called (to determine what happens with values which are associated with the same key).
The method is supposed to group the list of search configurations by their application and sort the map keys (the applications by their names), as well as the map values (the search configurations by their names). Maybe the second stream isn't straight forward and I could/should use another approach, but nontheless I want to understand what's happening.
Output is something along the lines:
App1
Search Config Title1
Search Config Title2
App2
Search Config Title
App3
Search Config Title1
Search Config Title2
Search Config Title3
The ApplicationInfo implementation isn't overriding int hashCode() nor boolean equals(Object).
I would have thought that the map keys are always different in the second stream for each list of search configurations. However, in one particular situation the merge-function is called which I don't get why at all it's called.
public SortedMap<ApplicationInfo, List<SearchConfigInfo>> groupByApplications(final BusinessLogicProcessingContext ctx,
final List<SearchConfigInfo> searchConfigInfos) {
requireNonNull(ctx, "The processing context must not be null.");
requireNonNull(searchConfigInfos, "The search configuration informations must not be null.");
final String lang;
final RtInfoWithTitleComparator comp;
lang = ContextLanguage.get(ctx);
appComp = new RtInfoWithTitleComparator(lang);
final Map<ApplicationInfo, List<SearchConfigInfo>> appToSearchConfigs;
appToSearchConfigs = searchConfigInfos.stream()
.collect(groupingBy(searchConfig -> RtCache.getApplication(searchConfig.getApplicationGuid())));
return appToSearchConfigs.entrySet()
.stream()
.collect(toMap(Map.Entry::getKey,
p_entry -> _sortValueList(p_entry.getValue()),
merger(),
() -> new TreeMap<>(appComp)));
}
The general contract of a map is:
"An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value."
That's why I really wonder what happens in this case.
private static BinaryOperator<List<SearchConfigInfo>> merger() {
return (list1, list2) -> { System.out.println(RtCache.getApplication(list1.get(0).getApplicationGuid()).hashCode());
System.out.println(RtCache.getApplication(list2.get(0).getApplicationGuid()).hashCode());
System.out.println(list1.get(0).getApplicationGuid().equals(list2.get(0).getApplicationGuid()));
list1.addAll(list2);
return list1;
};
}
As I can see with the simple STDOUT debugging statements the hashCodes are different as well as they are not equal to each other.

Note that you're supplying a TreeMap as the result of the supplier function given to the Collectors.toMap() method (that's the last argument):
toMap(Map.Entry::getKey,
p_entry -> _sortValueList(p_entry.getValue()),
merger(),
() -> new TreeMap<>(appComp)));
(A supplier function provides the collection that the collector will use to contain the results - so in this case it always supplies a TreeMap.)
A TreeMap performs key comparisons with compareTo(), which is why you can get a key collision in this case - the collisions are taken in respect to the supplier map, not the map from which they originate.

Not entirely clear on what this code does? (Includes Set, HashMap and .keySet())

So I've finished a program and have had help building it/worked with another person. I understand all of the program in terms of what each line of code does except one part. This is the code:
Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
Map<String, Integer> stringIntegerMap1 = set1.stream().collect(HashMap::new,
(hashMap, s) -> hashMap.put(s, s.length()), HashMap::putAll);
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
Some background info:
getWordCut is a method that looks like this:
public HashMap getWordCountsMap() {
HashMap<String, Integer> myHashMap = new HashMap<String, Integer>();
for (String word : this.getWordArray()) {
if (myHashMap.keySet().contains(word)) {
myHashMap.put(word, myHashMap.get(word) + 1);
} else {
myHashMap.put(word, 1);
}
}
return myHashMap;
}
firstWordGroup is a constructor that stores a string of words.
If anybody could explain exactly what this block of code does and how it does it then that would be helpful, thanks.
P.S: I'm not sure if supplying the whole program to reproduce the code is relevant so if you think it is, just leave a comment saying so and I will edit the question so you can reproduce the program.

getWordsCountsMap() returns a map where the key is a word and the value is how many times the word occurred in the array
Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
The .keyset() method returns just the keys of the map, so now you have a set of the words, but have lost the occurrence counts.
Map<String, Integer> stringIntegerMap1 =
set1.stream()
.collect(HashMap::new,
(hashMap, s) -> hashMap.put(s, s.length()),
HashMap::putAll)
This is using Java8 streams to iterate through the set (words) originally put into a map and create a new hash map, where the key is the word (as it was before) and the value is the length of the word (whereas originally it was the word count). A new hash map is created and populated and returned.
What I'm not understanding is the final HashMap::putAll() which would seem to take the hashmap just populated and re-add all entries (which really would be a no-op because the keys would be replaced). Since I haven't whipped open my IDE to put in the code and test it (which, if you haven't yourself, would recommend, I'm just not interested enough to do so because it's not my problem!).
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
In essence, this is a cleaner way to iterate through the entries in the map created, printing out the word and length for each.
After working through this and thinking about it, I have a feeling I'm doing your homework for you, the real way to figure this out is to break things down and debug through your IDE and seeing what each step of the way does.

Set<String> set1 = firstWordGroup.getWordCountsMap().keySet();
This line calles getWordCountsMap which returns a map from words to their count. It then ignores the count and just takes the words in a set. Note this could be achieved in a lot of much simpler ways.
Map<String, Integer> stringIntegerMap1 = set1.stream()
.collect(HashMap::new, (hashMap, s) -> hashMap.put(s, s.length()), HashMap::putAll);
This converts the set of words to a stream and then collects the stream. The collector starts by creating a hash map then, for each word, adding a map from the to its length. If multiple maps are created (as is allowed in streams) then they are combined using putAll. Again there are much simpler and clearer ways to achieve this.
stringIntegerMap1.forEach((key,value) ->System.out.println(key + " : "+value));
This line iterates through all entries in the map and prints out the key and value.
All this code could have been achieved with:
Arrays.stream(getWordArray())
.distinct().map(w -> w + ":" + w.length()).forEach(System.out::println);
This command converts the words to a stream, removes duplicates, maps the words to the output string then prints them.

Performance of algorithms sorting list entries into a map

Given a list in which each entry is a object that looks like
class Entry {
public String id;
public Object value;
}
Multiple entries could have the same id. I need a map where I can access all values that belong to a certain id:
Map<String, List<Object>> map;
My algorithm to achieve this:
for (Entry entry : listOfEntries) {
List<Object> listOfValues;
if (map.contains(entry.id)) {
listOfValues = map.get(entry.id);
} else {
listOfValues = new List<Object>();
map.put(entry.id, listOfValues);
}
listOfValues.add(entry.value);
}
Simply: I transform a list that looks like
ID | VALUE
---+------------
a | foo
a | bar
b | foobar
To a map that looks like
a--+- foo
'- bar
b---- foobar
As you can see, contains is called for each entry of the source list. That's why I wonder if I could improve my algorithm, if I pre-sort the source list and then do this:
List<Object> listOfValues = new List<Object>();
String prevId = null;
for (Entry entry : listOfEntries) {
if (prevId != null && prevId != entry.id) {
map.put(prevId, listOfValues);
listOfValues = new List<Object>();
}
listOfValues.add(entry.value);
prevId = entry.id;
}
if (prevId != null) map.put(prevId, listOfValues);
The second solution has the advantage that I don't need to call map.contains() for every entry but the disadvantage that I have to sort before. Futhermore the first algorithm is easier to implement and less error prone, since you have to add some code after the actual loop.
Therefore my question is: Which method has better performance?
The examples are written in Java pseudo code but the actual question applies to other programming languages as well.

If you have a hash map and a very large amount of entries then inserting items one by one will be faster than sorting and inserting them list by list (O(n) vs O(N log N)). If you use a tree based map than the complexity is the same for both approaches.
However, I really doubt you have a sufficiently large amount of entries so memory access patterns, and how fast compare and hash functions are come into effect. You have 2 options: ignore it since the difference is not going to be significant or benchmark both options and see which one is working better on your system. If you don't have millions of entries I would ignore the issue and go with whatever is easier to understand.

Don't presort. Even fast sorting algorithms like quicksort take, on average, O(n log n) for n items. Afterwards, you still need O(n) to walk the list. contains on a (hash) map takes constant time (checkout this question), don't worry about it. Walk the list in linear time and use contains.

Would like to offer another solution using streams
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;
Map<String, List<Object>> map = listOfValues.stream()
.collect(groupingBy(entry -> entry.id, mapping(entry -> entry.value, toList())));
This code is more declarative - it only specifies that List should be transformed into Map.
Then it is a library responsibility to actually perform transformation in efficient way.

Access to the key-value pair of a Map with one element in Java

A method of mine returns a Map<A,B>. In some clearly identified cases, the map only contains one key-value pair, effectively only being a wrapper for the two objects.
Is there an efficient / elegant / clear way to access both the key and the value? It seems overkill to iterate over the one-element entry set. I'm looking for somehing that would lower the brain power required for people who will maintain this, along the lines of:
(...)
// Only one result.
else {
A leKey = map.getKey(whicheverYouWantThereIsOnlyOne); // Is there something like this?
B leValue = map.get(leKey); // This actually exists. Any Daft Punk reference was non-intentional.
}
Edit: I ended up going with #akoskm solution's below. In the end, the only satisfying way of doing this without iteration was with a TreeMap, and the overhead made that unreasonable.
It turns out there is not always a silver bullet, especially as this would be a very small rabbit to kill with it.

If you need both key/value then try something like this:
Entry<Long, AccessPermission> onlyEntry = map.entrySet().iterator().next();
onlyEntry.getKey();
onlyEntry.getValue();

You can use TreeMap or ConcurrentSkipListMap.
TreeMap<String, String> myMap = new TreeMap<String, String>();
String firstKey = myMap.firstEntry().getKey();
String firstValue = myMap.firstEntry().getValue();
Another way to use this:
String firstKey = myMap.firstKey();
String firstValue = myMap.get(myMap.firstKey());
This can work as an alternate solution.

There is a method called keySet() to get set of keys. read this thread.
else {
A leKey=map.keySet().iterator().next();
B leValue; = map.get(leKey); // This actually exists. Any Daft Punk reference was non-intentional.
}

Using for-each loop and var :
for(var entry : map.entrySet()){
A key = entry.getKey();
B value = entry.getValue();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

I need a fast approach to search substrings - java

I would suggest a Stream API approach: String path = "association1.association2.association3"; List<Map.Entry<String, Set<String>>> result = bindings.entrySet() .stream() .filter(e -> e.getKey().contains(path)) .collect(Collectors.toList());

Related

Aggregate values and convert into single type within the same Java stream

Map merge-function (shouldn't be called!?)

Not entirely clear on what this code does? (Includes Set, HashMap and .keySet())

Performance of algorithms sorting list entries into a map

Access to the key-value pair of a Map with one element in Java

Categories

Resources