Find a complex element in a set of elements - java

I have a function that allows me to find a match between an incomplete element and at least one element in a set. An example of an incomplete element is 22.2.X.13, in which there is an item (defined with X) that could assume any value.
The goal of this function is to find at least one element in a set of elements that has 22 in the first position, 2 on the second, and 13 on the fourth.
For example, if we consider the set:
{
20.8.31.13,
32.3.29.13,
24.2.12.13,
19.2.37.13,
22.2.22.13,
27.17.22.13,
26.22.32.13,
22.3.22.13,
20.19.12.13,
17.4.37.13,
31.8.34.13
}
The output of the function return True since there are elements 22.2.22.13 which correspond to 22.2.X.13.
My function compares each pair of elements like strings and each item of the elements as an integer:
public boolean containsElement(String element) {
StringTokenizer strow = null, st = null;
boolean check = true;
String nextrow = "", next = "";
for(String row : setOfElements) {
strow = new StringTokenizer(row, ".");
st = new StringTokenizer(element, ".");
check = true;
while(st.hasMoreTokens()) {
next = st.nextToken();
if(!strow.hasMoreTokens()) {
break;
}
nextrow = strow.nextToken();
if(next.compareTo("X") != 0) {
int x = Integer.parseInt(next);
int y = Integer.parseInt(nextrow);
if(x != y) {
check = false;
break;
}
}
}
if(check) return true;
}
return false;
However, it is an expensive operation, particularly if the size of the string increases. Can you suggest to me another strategy or data structure to quickly perform this operation?
My solution is closely related to strings. However, we can consider other types for elements (e.g. array, list, tree node, etc)
Thanks to all for your answers. I have tried almost all the functions, and the bench:
myFunction: 0ms
hasMatch: 2ms
Stream API: 5ms
isIPMatch; 2ms
I think that the main problem of the regular expression is the time to create the pattern and match the strings.

You want to use Regex which is made exactly for tasks like this. Check out the demo.
22\.2\.\d+\.13
Java 8 and higher
You can use Stream API as of Java 8 to find at least one matching the Regex using Pattern and Matcher classes:
Set<String> set = ... // the set of Strings (can be any collection)
Pattern pattern = Pattern.compile("22\\.2\\.\\d+\\.13"); // compiled Pattern
boolean matches = set.stream() // Stream<String>
.map(pattern::matcher) // Stream<Matcher>
.anyMatch(Matcher::matches); // true if at least one matches
Java 7 and lower
The way is equal to Stream API: a short-circuit for-each loop with a break statement in case the match is found.
boolean matches = false;
Pattern pattern = Pattern.compile("22\\.2\\.\\d+\\.13");
for (String str: set) {
Matcher matcher = pattern.matcher(str);
if (matcher.matches()) {
matches = true;
break;
}
}

You can solve this by approaching the problem in a regex-based manner, as suggested by Nikolas Charalambidis (+1), or you can do it differently. To avoid being redundant with another answer, I will focus on an alternative approach here, using the split method.
public boolean isIPMatch(String pattern[], String input[]) {
if ((pattern == null) || (input == null) || (pattern.length <> input.length)) return false; //edge cases
for (int index = 0; index < pattern.length; index++) {
if ((!pattern[index].equals("X")) && (!pattern[index].equals(input[index]))) return false; //difference
}
return true; //everything matched
}
And you can call the method above in your loop, after converting the items to compare to String arrays via split.

For strings, regular expressions solve the task a lot better:
private boolean hasMatch(String[] haystack, String partial) {
String patternString = partial.replace("X", "[0-9]+").replace(".", "\\.");
// "22.2.X.13" becomes "22\\.2\\.[0-9]+\\.13"
Pattern p = Pattern.compile(patternString);
for (String s : haystack) {
if (p.matcher(s).matches()) return true;
}
return false;
}
For other types of objects, it depends on their structure.
If there is some kind of order, you could consider making your elements implement Comparable - and then you can place them into a TreeSet (or as keys in a TreeMap), which will always be kept sorted. This way, you can compare only against the elements that can match: mySortedSet.subSet(fromElement, toElement) returns only the elements between those two.
If there is no order, you will simply have to compare all elements against your "pattern".
Note that strings are comparable, but their default sorting order ignores the special semantics of your .-separators. So, with some care you can implement a treeset-based approach to make the search better-than-linear.

Other answers have already discussed using a regular expression by converting e.g. 22.2.X.13 to 22\.2\.\d+\.13 (don't forget to also escape the . or they mean "anything"). But while this will definitely be simpler and probably also a good bit faster, it does not lower the overall complexity. You still have to check each element in the set.
Instead, you might try to convert your set of IPs to a nested Map in this form:
{20: {8: {31: {13: null}}, 19: {12: {13: null}}}, 22: {2: {...}, 3: {...}}, ...}
(Of course, you should create this structure just once, and not for each search query.)
You can then write a recursive function match that works roughly as follows (pseudocode):
boolean match(ip: String, map: Map<String, Map<...>>) {
if (ip.empty) return true // done
first, rest = ip.splitfirst
if (first == "X") {
return map.values().any(submap -> match(rest, submap))
} else {
return first in map && match(rest, map[first])
}
}
This should reduce the complexity from O(n) to O(log n); more than that the more often you have to branch out, but at most O(n) for X.X.X.123 (X.X.X.X is trivial again). For small sets, a regular expression might still be faster, as it has less overhead, but for larger sets, this should be faster.

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

I am new to Java, and I'm trying to figure out how to count Characters in the given string and threat a combination of two characters "eu" as a single character, and still count all other characters as one character.
And I want to do that using recursion.
Consider the following example.
Input:
"geugeu"
Desired output:
4 // g + eu + g + eu = 4
Current output:
2
I've been trying a lot and still can't seem to figure out how to implement it correctly.
My code:
public static int recursionCount(String str) {
if (str.length() == 1) {
return 0;
}
else {
String ch = str.substring(0, 2);
if (ch.equals("eu") {
return 1 + recursionCount(str.substring(1));
}
else {
return recursionCount(str.substring(1));
}
}
}
OP wants to count all characters in a string but adjacent characters "ae", "oe", "ue", and "eu" should be considered a single character and counted only once.
Below code does that:
public static int recursionCount(String str) {
int n;
n = str.length();
if(n <= 1) {
return n; // return 1 if one character left or 0 if empty string.
}
else {
String ch = str.substring(0, 2);
if(ch.equals("ae") || ch.equals("oe") || ch.equals("ue") || ch.equals("eu")) {
// consider as one character and skip next character
return 1 + recursionCount(str.substring(2));
}
else {
// don't skip next character
return 1 + recursionCount(str.substring(1));
}
}
}
Recursion explained
In order to address a particular task using Recursion, you need a firm understanding of how recursion works.
And the first thing you need to keep in mind is that every recursive solution should (either explicitly or implicitly) contain two parts: Base case and Recursive case.
Let's have a look at them closely:
Base case - a part that represents a simple edge-case (or a set of edge-cases), i.e. a situation in which recursion should terminate. The outcome for these edge-cases is known in advance. For this task, base case is when the given string is empty, and since there's nothing to count the return value should be 0. That is sufficient for the algorithm to work, outcomes for other inputs should be derived from the recursive case.
Recursive case - is the part of the method where recursive calls are made and where the main logic resides. Every recursive call eventually hits the base case and stars building its return value.
In the recursive case, we need to check whether the given string starts from a particular string like "eu". And for that we don't need to generate a substring (keep in mind that object creation is costful). instead we can use method String.startsWith() which checks if the bytes of the provided prefix string match the bytes at the beginning of this string which is chipper (reminder: starting from Java 9 String is backed by an array of bytes, and each character is represented either with one or two bytes depending on the character encoding) and we also don't bother about the length of the string because if the string is shorter than the prefix startsWith() will return false.
Implementation
That said, here's how an implementation might look:
public static int recursionCount(String str) {
if(str.isEmpty()) {
return 0;
}
return str.startsWith("eu") ?
1 + recursionCount(str.substring(2)) : 1 + recursionCount(str.substring(1));
}
Note: that besides from being able to implement a solution, you also need to evaluate it's Time and Space complexity.
In this case because we are creating a new string with every call time complexity is quadratic O(n^2) (reminder: creation of the new string requires allocating the memory to coping bytes of the original string). And worse case space complexity also would be O(n^2).
There's a way of solving this problem recursively in a linear time O(n) without generating a new string at every call. For that we need to introduce the second argument - current index, and each recursive call should advance this index either by 1 or by 2 (I'm not going to implement this solution and living it for OP/reader as an exercise).
In addition
In addition, here's a concise and simple non-recursive solution using String.replace():
public static int count(String str) {
return str.replace("eu", "_").length();
}
If you would need handle multiple combination of character (which were listed in the first version of the question) you can make use of the regular expressions with String.replaceAll():
public static int count(String str) {
return str.replaceAll("ue|au|oe|eu", "_").length();
}

Find match with regex in arraylist

I'm trying to develop a function that reads an ArrayList of string and is capable to find if there exist at least two tuples that have the same values from a set of indices but differ for a supplementary index. I've developed a version of this function by using a RegEx comparison as follow:
public boolean checkMatching(){
ArrayList<String> rows = new ArrayList<String>();
rows.add("7,2,2,1,1");
rows.add("7,3,2,1,1");
rows.add("7,8,1,1,1");
rows.add("8,2,1,3,1");
rows.add("8,2,1,4,1");
rows.add("8,4,5,1,1");
int[] indices = new int[] {2,3};
int supplementaryIndex = 1;
String regex = "";
for(String r : rows){
String[] rt = r.split(",");
regex = "[a-zA-Z0-9,-.]*[,][a-zA-Z0-9,-.]*[,][" + rt[indices[0]] + "][,][" + rt[indices[1]] + "][,][a-zA-Z0-9,-.]*";
for(String r2 : rows){
if(r.equals(r2) == false){
if(Pattern.matches(regex, r2)){
String[] rt2 = r.split(",");
if(rt[supplementaryIndex].equals(rt2[supplementaryIndex]) == false){
return true;
}
}
}
}
}
return false;
}
However, it is very expensive, especially if there are many rows. I've thought to create a more complex RegEx that considers multiple choices (with '|' condition), as follow:
public boolean checkMatching(){
ArrayList<String> rows = new ArrayList<String>();
rows.add("7,2,2,1,1");
rows.add("7,3,2,1,1");
rows.add("7,8,1,1,1");
rows.add("8,2,1,3,1");
rows.add("8,2,1,4,1");
rows.add("8,4,5,1,1");
int[] indices = new int[] {2,3};
int supplementaryIndex = 1;
String regex = "";
for(String r : rows){
String[] rt = r.split(",");
regex += "[a-zA-Z0-9,-.]*[,][a-zA-Z0-9,-.]*[,][" + rt[indices[0]] + "][,][" + rt[indices[1]] + "][,][a-zA-Z0-9,-.]*";
regex += "|"; //or
}
for(String r2 : rows){
if(Pattern.matches(regex, r2)){
//String rt2 = r.split(",");
//if(rt[supplementaryIndex].equals(rt2[supplementaryIndex]) == false){
return true;
//}
}
}
return false;
}
But the problem is that this way I can't compare the supplementary index values. Do you have any suggestions on how to define a regex that can directly satisfy this condition? Or, is it possible to leverage java streams to do this efficiently?
The main problem of your first approach is that you have two nested loops over the same list, which gets you a quadratic time complexity. To recall, that implies that the inner loop’s body gets executed 10,000 times for a list with 100 elements and 1,000,000 times for a list of 1,000 elements, and so on.
It doesn’t help calling Pattern.matches(regex, r2) in the inner loop’s body. That method exist only to support (as delegation target) the String operation r2.matches(r2), a convenience method, to do Pattern.compile(regex).matcher(input).matches() in one go. If you have to apply the same regex multiple times, you should keep and re-use the result of Pattern.compile(regex).
But here, there is no point in using a regex at all. You have already decomposed the string using split and can access each component via a plain array access. Using this starting point to compose a regex to be applied on the string again, is complicated and expensive at the same time.
Just use something like
// return true when at least one string has the same values for indices
// but different value for supplementaryIndex
Map<List<String>,String> map = new HashMap<>();
for(String r : rows) {
String[] rt = r.split(",");
List<String> key = List.of(rt[indices[0]], rt[indices[1]]);
String old = map.putIfAbsent(key, rt[supplementaryIndex]);
if(old != null && !old.equals(rt[supplementaryIndex])) return true;
}
return false;
This loops over the list a single time, extracts the key elements from the array and composes a key for a HashMap. There are various ways to do this. But while it’s tempting to just concatenate these elements like rt[indices[0]] + "," + rt[indices[1]], which would work, using a List is preferable, as it avoids expensive string concatenation.
The code puts the value to check into the map which will return a previous value if this key has been encountered before. If so, the old and new values can be compared and the method can return immediately if they don’t match.
When you are using Java 8, you have to use Arrays.asList(rt[indices[0]], rt[indices[1]]) instead of List.of(rt[indices[0]], rt[indices[1]]).
This can be easily expanded to support variable lengths for indices, by changing
List<String> key = List.of(rt[indices[0]], rt[indices[1]]);
to
List<String> key = Arrays.stream(indices).mapToObj(i -> rt[i]).toList();
or, if you are using a Java version older than 16:
List<String> key
= Arrays.stream(indices).mapToObj(i -> rt[i]).collect(Collectors.toList());

Efficient alternative to nested For Loop

I am doing profanity filter. I have 2 for loops nested as shown below. Is there a better way of avoiding nested for loop and improve time complexity.
boolean isProfane = false;
final String phraseInLowerCase = phrase.toLowerCase();
for (int start = 0; start < phraseInLowerCase.length(); start++) {
if (isProfane) {
break;
}
for (int offset = 1; offset < (phraseInLowerCase.length() - start + 1 ); offset++) {
String subGeneratedCode = phraseInLowerCase.substring(start, start + offset);
//BlacklistPhraseSet is a HashSet which contains all profane words
if (blacklistPhraseSet.contains(subGeneratedCode)) {
isProfane=true;
break;
}
}
}
Consider Java 8 version of #Mad Physicist implementation:
boolean isProfane = Stream.of(phrase.split("\\s+"))
.map(String::toLowerCase)
.anyMatch(w -> blacklistPhraseSet.contains(w));
or
boolean isProfane = Stream.of(phrase
.toLowerCase()
.split("\\s+"))
.anyMatch(w -> blacklistPhraseSet.contains(w));
If you want to check every possible combination of consecutive characters, then your algorithm is O(n^2), assuming that you use a Set with O(1) lookup characteristics, like a HashSet. You would probably be able to reduce this by breaking the data and the blacklist into Trie structures and walking along each possibility that way.
A simpler approach might be to use a heuristic like "profanity always starts and ends at a word boundary". Then you can do
isProfane = false;
for(String word: phrase.toLowerCase().split("\\s+")) {
if(blacklistPhraseSet.contains(word)) {
isProfane = true;
break;
}
}
You won't improve a lot on time complexity, because those use iterations under the hood but you could split the phrase on spaces and iterate over the array of words from your phrase.
Something like:
String[] arrayWords = phrase.toLowerCase().split(" ");
for(String word:arrayWords){
if(blacklistPhraseSet.contains(word)){
isProfane = true;
break;
}
}
The problem of this code is that unless your word contains compound words, it won't match those, whereas your code as I understand it will. The word "f**k" in the black list won't match "f**kwit" in my code, it will in yours.

How to find duplicates inside a string?

I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}
No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.
Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.
for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Search array for value containing all characters(in any order) and return value

I've searched high and low and finally have to ask.
I have an array containing, for example, ["123456","132457", "468591", ... ].
I have a string with a value of "46891".
How do I search through the array and find the object that contains all the characters from my string value? For example the object with "468591" contains all the digits from my string value even though it's not an exact match because there's an added "5" between the "8" and "9".
My initial thought was to split the string into its own array of numbers (i.e. ["4","6","8","9","1"] ), then to search through the array for objects containing the number, to create a new array from it, and to keep whittling it down until I have just one remaining.
Since this is likely a learning assignment, I'll give you an idea instead of an implementation.
Start by defining a function that takes two strings, and returns true if the first one contains all characters of the second in any order, and false otherwise. It should looks like this:
boolean containsAllCharsInAnyOrder(String str, String chars) {
...
}
Inside the function set up a loop that picks characters ch from the chars string one by one, and then uses str.indexOf(ch) to see if the character is present in the string. If the index is non-negative, continue; otherwise, return false.
If the loop finishes without returning, you know that all characters from chars are present in src, so you can return true.
With this function in hand, set up another loop in your main function to go through elements of the array, and call containsAllCharsInAnyOrder on each one in turn.
I think you can use sets for this.
List<String> result = new ArrayList<>();
Set<String> chars = new HashSet<>(Arrays.asList(str.split(""));
for(String string : stringList) {
Set<String> stringListChars = new HashSet<>(Arrays.asList(string.split(""));
if(chars.containsAll(stringListChars)) {
result.add(string);
}
}
There is a caveat here; it doesn't work as you would expect for repeated characters and you haven't specified how you want to handle that (for example, 1154 compared against 154 will be considered a positive match). If you do want to take into account repeated characters and you want to make sure that they exist in the other string, you can use a List instead of a Set:
List<String> result = new ArrayList<>();
List<String> chars = Arrays.asList(str.split(""));
for(String string : stringList) {
List<String> stringListChars = Arrays.asList(string.split("");
if(chars.containsAll(stringListChars)) {
result.add(string);
}
}
Your initial idea was good start, so what you can do is to create not an array but set, then using Guava Sets#powerSet method to create all possible subsets filter only those that have "46891".length mebers, convert each set into String and look those strings in the original array :)
You could do this with the ArrayList containsAll method along with asList:
ArrayList<Character> lookingForChars = new ArrayList<Character>(Arrays.asList(lookingForString.toCharArray()));
for (String toSearchString : array) {
ArrayList<Character> toSearchChars = new ArrayList<Character>(Arrays.asList(toSearchString.toCharArray));
if (toSearchChars.containsAll(lookingForChars)) {
System.out.println("Match Found!");
}
}
You can use String#chartAt() in a nested for loop to compare your string with each of the array's elements.
This method would help you check whether a character is contained in both strings.
This is more tricky then a straigt-forward solution.
The are better algorithms but here one easy to implement and understand.
Ways of solving:
Go through every char at your given string and check if it at the
given arrray.
Collect list for every string from the selected
array containing the given char.
Check if no other char to check.
If there is, Perform A again but on the collected list(result list).
Else, Return all possible matches.
try this
public static void main(String args[]) {
String[] array = {"123456", "132457", "468591"};
String search = "46891";
for (String element : array) {
boolean isPresent = true;
for (int index = 0; index < search.length(); index++) {
if(element.indexOf(search.charAt(index)) == -1){
isPresent = false;
break;
}
}
if(isPresent)
System.out.println("Element "+ element + " Contains Serach String");
else
System.out.println("Element "+ element + " Does not Contains Serach String");
}
}
This sorts the char[]'s of the search string and the and the string to search on. Pretty sure (?) this is O(n logn) vs O(n^2) without sorting.
private static boolean contains(String searchMe, String searchOn){
char[] sm = searchMe.toCharArray();
Arrays.sort(sm);
char[] so = searchOn.toCharArray();
Arrays.sort(so);
boolean found = false;
for(int i = 0; i<so.length; i++){
found = false; // necessary to reset 'found' on subsequent searches
for(int j=0; j<sm.length; j++){
if(sm[j] == so[i]){
// Match! Break to the next char of the search string.
found = true;
break;
}else if(sm[j] > so[i]){ // No need to continue because they are sorted.
break;
}
}
if(!found){
// We can quit here because the arrays are sorted.
// I know if I did not find a match of the current character
// for so in sm, then no other characters will match because they are
// sorted.
break;
}
}
return found;
}
public static void main(String[] args0){
String value = "12345";
String[] testValues = { "34523452346", "1112", "1122009988776655443322",
"54321","7172839405","9495929193"};
System.out.println("\n Search where order does not matter.");
for(String s : testValues){
System.out.println(" Does " + s + " contain " + value + "? " + contains(s , value));
}
}
And the results
Search where order does not matter.
Does 34523452346 contain 12345? false
Does 1112 contain 12345? false
Does 1122009988776655443322 contain 12345? true
Does 54321 contain 12345? true
Does 7172839405 contain 12345? true
Does 9495929193 contain 12345? true

Categories