Java Code optimization - Fastest list to search and remove?

Java Code optimization - Fastest list to search and remove? - java

so I've been working on a task. I have two giant strings, both consist of the same characters just scrambled. The task is to find the lowest possible number of changes you can make to turn the first string in the other one, while 1 change = switching neighbour chars in the string. I found a solution that works just fine but there is a problem. It works under 5 seconds only for input of about 100 000 char Strings. I need to make it work for up to 1000 000 char. I tried ArrayList, LinkedList, regular arrays, substrings and different variations of the algorythm, this one is the best so far I came up with but I'm out of ideas. Any help? Is there a faster collection I can use? Maybe the algoryth is wrong here?
"jas" ArrayList is the first string converted into a list
"mal" is the other one. "steps" is the output
int steps=0;
int index=0;
while(jas.size()>1) {
if(jas.get(0)!=mal.get(index)) {
int distance = jas.indexOf(mal.get(index));
jas.remove(distance);
steps+=distance;
} else {
jas.remove(0);
}
index++;
}
System.out.println(steps);
Thanks in advance for any ideas!

I know its a bit off topic here (you should go to codereview) but idea I got is to use something already implemented in Java like:
listToSort.sort(Comparator.comparing(listWithOrder::indexOf));
Go to functions the function and look at them and you can either take them out and create your own based on that and put a counting there or inspire by it.
I believe that whatever is implemented there is probably very fast.

Related

Efficient way for checking if a string is present in an array of strings [duplicate]

This question already has answers here:
How do I determine whether an array contains a particular value in Java?
(30 answers)
Closed 2 years ago.
I'm working on a little project in java, and I want to make my algorithm more efficient.
What I'm trying to do is check if a given string is present in an array of strings.
The thing is, I know a few ways to check if a string is present in an array of strings, but the array I am working with is pretty big (around 90,000 strings) and I am looking for a way to make the search more efficient, and the only ways I know are linear search based, which is not good for an array of this magnitude.
Edit: So I tried implementing the advices that were given to me, but the code i wrote accordingly is not working properly, would love to hear your thoughts.`
public static int binaryStringSearch(String[] strArr, String str) {
int low = 0;
int high = strArr.length -1;
int result = -1;
while (low <= high) {
int mid = (low + high) / 2;
if (strArr[mid].equals(str)) {
result = mid;
return result;
}else if (strArr[mid].compareTo(str) < 0) {
low = mid + 1;
}else {
high = mid - 1;
}
}
return result;
}
Basically what it's supposed to do is return the index at which the string is present in the array, and if it is not in the array then return -1.

So you have a more or less fixed array of strings and then you throw a string at the code and it should tell you if the string you gave it is in the array, do I get that right?
So if your array pretty much never changes, it should be possible to just sort them by alphabet and then use binary search. Tom Scott did a good video on that (if you don't want to read a long, messy text written by someone who isn't a native english speaker, just watch this, that's all you need). You just look right in the middle and then check - is the string you have before or after the string in the middle you just read? If it is already precisely the right one, you can just stop. But in case it isn't, you can eliminate every string after that string in case it's after the string you want to find, otherwise every string that's before the just checked string. Of course, you also eliminate the string itself if it's not equal because - logic. And then you just do it all over again, check the string in the middle of the ones which are left (btw you don't have to actually delete the array items, it's enough just to set a variable for the lower and upper boundary because you don't randomly delete elements in the middle) and eliminate based on the result. And you do that until you don't have a single string in the list left. Then you can be sure that your input isn't in the array. So this basically means that by checking and comparing one string, you can't just eliminate 1 item like you could with checking one after the other, you can remove more then half of the array, so with a list of 256, it should only take 8 compares (or 9, not quite sure but I think it takes one more if you don't want to find the item but know if it exists) and for 65k (which almost matches your number) it takes 16. That's a lot more optimised.
If it's not already sorted and you can't because that would take way too long or for some reason I don't get, then I don't quite know and I think there would be no way to make it faster if it's not ordered, then you have to check them one by one.
Hope that helped!
Edit: If you don't want to really sort all the items and just want to make it a bit (26 times (if language would be random)) faster, just make 26 arrays for all letters (in case you only use normal letters, otherwise make more and the speed boost will increase too) and then loop through all strings and put them into the right array matching their first letter. That way it is much faster then sorting them normally, but it's a trade-off, since it's not so neat then binary search. You pretty much still use linear search (= looping through all of them and checking if they match) but you already kinda ordered the items. You can imagine that like two ways you can sort a buncha cards on a table if you want to find them quicker, the lazy one and the not so lazy one. One way would be to sort all the cards by number, let's just say the cards are from 1-100, but not continuously, there are missing cards. But nicely sorting them so you can find any card really quickly takes some time, so what you can do instead is making 10 rows of cards. In each one you just put your cards in some random order, so when someone wants card 38, you just go to the third row and then linearly search through all of them, that way it is much faster to find items then just having them randomly on your table because you only have to search through a tenth of the cards, but you can't take shortcuts once you're in that row of cards.

Depending on the requirements, there can be so many ways to deal with it. It's better to use a collection class for the rich API available OOTB.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and the insertion order does not matter: Use Set<String> set = new HashSet<>() and then you can use Set#contains to check the presence of a particular string.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and also the insertion order needs to be preserved: Use Set<String> set = new LinkedHashSet<>() and then you can use Set#contains to check the presence of a particular string.
Can the list contain duplicate strings. If yes, you can use a List<String> list = new ArrayList<>() to benefit from its rich API as well as get rid of the limitation of fixed size (Note: the maximum number of elements can be Integer.MAX_VALUE) beforehand. However, a List is navigated always in a sequential way. Despite this limitation (or feature), the can gain some efficiency by sorting the list (again, it's subject to your requirement). Check Why is processing a sorted array faster than processing an unsorted array? to learn more about it.

You could use a HashMap which stores all the strings if
Contains query is very frequent and lookup strings do not change frequently.
Memory is not a problem (:D) .

is this use of objects redundant and/or inefficient?

I'm fairly inexperienced with using objects so I would really like some input.
I'm trying to remove comments from a list that have certain "unwanted words" in them, both the comments and the list of "unwanted words" are in ArrayList objects.
This is inside of a class called FormHelper, which contains the private member comments as an ArrayList, the auditList ArrayList is created locally in a member function called populateComments(), which then calls this function (below). PopulateComments() is called by the constructor, and so this function only gets called once, when an instance of FormHelper is created.
private void filterComments(ArrayList <String> auditList) {
for(String badWord : auditList) {
for (String thisComment : this.comments) {
if(thisComment.contains(badWord)) {
int index = this.comments.indexOf(thisComment);
this.comments.remove(index);
}
}
}
}
something about the way I implemented this doesn't feel right, I'm also concerned that I'm using ArrayList functions inefficiently. Is my suspicion correct?

It is not particularly efficient. However, finding a more efficient solution is not straightforward.
Lets step back to a simpler problem.
private void findBadWords(List <String> wordList, List <String> auditList) {
for(String badWord : auditList) {
for (String word : wordList) {
if (word.equals(badWord)) {
System.err.println("Found a bad word");
}
}
}
}
Suppose that wordList contains N words and auditList contains M words. Some simple analysis will show that the inner loop is executed N x M times. The N factor is unavoidable, but the M factor is disturbing. It means that the more "bad" words you have to check for the longer it takes to check.
There is a better way to do this:
private void findBadWords(List <String> wordList, HashSet<String> auditWords) {
for (String word : wordList) {
if (auditWords.contains(word))) {
System.err.println("Found a bad word");
}
}
}
Why is that better? It is better (faster) because HashSet::contains doesn't need to check all of the audit words one at a time. In fact, in the optimal case it will check none of them (!) and the average case just one or two of them. (I won't go into why, but if you want to understand read the Wikipedia page on hash tables.)
But your problem is more complicated. You are using String::contains to test if each comment contains each bad word. That is not a simple string equality test (as per my simplified version).
What to do?
Well one potential solution is to split the the comments into an array of words (e.g. using String::split and then user the HashSet lookup approach. However:
That changes the behavior of your code. (In a good way actually: read up on the Scunthorpe problem!) You will now only match the audit words is they are actual words in the comment text.
Splitting a string into words is not cheap. If you use String::split it entails creating and using a Pattern object to find the word boundaries, creating substrings for each word and putting them into an array. You can probably do better, but it is always going to be a non-trivial calculation.
So the real question will be whether the optimization is going to pay off. That is ultimately going to depend on the value of M; i.e. the number of bad words you are looking for. The larger M is, the more likely it will be to split the comments into words and use a HashSet to test the words.
Another possible solution doesn't involve splitting the comments. You could take the list of audit words and assemble them into a single regex like this: \b(word-1|word-2|...|word-n)\b. Then use this regex with Matcher::find to search each comment string for bad words. The performance will depend on the optimizing capability of the regex engine in your Java platform. It has the potential to be faster than splitting.
My advice would be to benchmark and profile your entire application before you start. Only optimize:
when the benchmarking says that the overall performance of the requests where this comment checking occurs is concerning. (If it is OK, don't waste your time optimizing.)
when the profiling says that this method is a performance hotspot. (There is a good chance that the real hotspots are somewhere else. If so, you should optimize them rather than this method.)
Note there is an assumption that you have (sufficiently) completed your application and created a realistic benchmark for it before you think about optimizing. (Premature optimization is a bad idea ... unless you really know what you are doing.)

As a general approach, removing individual elements from an ArrayList in a loop is inefficient, because it requires shifting all of the "following" elements along one position in the array.
A B C D E
^ if you remove this
^---^ you have to shift these 3 along by one
/ / /
A C D E
If you remove lots of elements, this will have a substantial impact on the time complexity. It's better to identify the elements to remove, and then remove them all at once.
I suggest that a neater way to do this would be using removeIf, which (at least for collection implementations such as ArrayList) does this "all at once" removal:
this.comments.removeIf(
c -> auditList.stream().anyMatch(c::contains));
This is concise, but probably quite slow because it has to keep checking the entire comment string to see if it contains each bad word.
A probably faster way would be to use regex:
Pattern p = Pattern.compile(
auditList.stream()
.map(Pattern::quote)
.collect(joining("|")));
this.comments.removeIf(
c -> p.matcher(c).find());
This would be better because the compiled regex would search for all of the bad words in a single pass over each comment.
The other advantage of a regex-based approach is that you can check case insensitively, by supplying the appropriate flag when compiling the regex.

Multiple arithimetic operator store

I have been thinking to solve any problem like 1+2*4-5 with user entering it and program to solve it. I've read some questions on this site about storing arithmetic operator and the solution says to check by using switch which can't be applied here. I would be thankful if anybody could suggest any idea of how to make it.

I had a similar exercise not long ago, but in the question it was stated that the seperation is a space. So the user input would be 1 + 2 * 4 - 5, and i solved it that way. I will give you some tips but not paste the whole code.
-you read the input as a String
-you can use the String.split() method to devide the String into the pieces you need and they will be put in an array.(in this case: strArray[0]='1',strArray[1]='+', etc)
-you will need a for-loop to go trough every String in the array:
-the decimals will need to be converted to integers with the Integer.parseInt() method.
-The + - * / will need to be put in switch-statement.
(be careful how you construct your loop, think about how many times you want to go trough it and what you need in each loop)
I hope these tips helped.

Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.
Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.
So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.
My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.
wordFromInput.contains(wordFromPattern);
To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.
Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...
not possible with hash table
for what I am trying to do with a substring.
I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:
1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)
I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.
Here is a snippet of my code though if anyone wants to see my issue more concretely.
While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().
I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.
So to clean it up:
lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.
I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.

You can replace the hashtable with a Trie.
Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.
This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.
However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.

What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.
There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.
Hope this helps!

I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.
So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.
The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.

You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)
You can use a HashMap for all the possible words, you can then dissect the words later.
e.g. say you need to match red and apple, you can combine the sum of
redapple = 1
applered = 0
red = 10
apple = 15
This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)

I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied
I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The
if (first letter of word matches keyword's first letter)//we have a candidate
skip to last letter of the current word //see comment below
if(last letter of word matches keyword's last letter)//strong candidate
iterate backwards to start+1 checking remainder of letters
The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.
However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.
If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.

The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.
At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.
Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.
private static class Matcher {
private final int where;
private final String s;
private int i = 0;
public Matcher ( String s, int where ) {
this.s = s;
this.where = where;
}
public boolean match(char ch) {
return s.charAt(i++) == ch;
}
public int matched() {
return i == s.length() ? where: -1;
}
}
// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";
public void test() {
// Use a LinkedList because it is O(1) to remove anywhere.
List<Matcher> matchers = new LinkedList<> ();
int pos = 0;
for ( char c : test.toCharArray()) {
// Fire off all of the matchers at this point.
for ( String s : watchFor ) {
matchers.add(new Matcher(s, pos));
}
// Discard all matchers that fail here.
for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
Matcher m = i.next();
// Should it be removed?
boolean remove = !m.match(c);
if ( !remove ) {
// Still matches! Is it complete?
int matched = m.matched();
if ( matched >= 0 ) {
// Todo - Should use getters.
System.out.println(" "+m.s +" found at "+m.where+" active matchers "+matchers.size());
// Complete!
remove = true;
}
}
// Remove it where necessary.
if ( remove ) {
i.remove();
}
}
// Step pos to keep track.
pos += 1;
}
}
prints
flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2
There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.

This is a pretty broad question, so I won't go into too much detail, but roughly:
Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.
After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.
The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):
private Map<String, Set<Integer>> index; // pre-populated
Set<Integer> search(String... topics) {
Set<Integer> results = null;
for (String topic : topics) {
Set<Integer> hits = index.get(topic);
if (hits == null)
return Collections.emptySet();
if (results == null)
results = new HashSet<Integer>(hits);
else
results.retainAll(hits);
if (results.isEmpty())
return Collections.emptySet(); // exit early
}
return results;
}
This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.

Binary Search multidimensional array in string

Actually, I'm working on my homework assignment. And, I'm really stuck.
I need to learn Java the right way. My teacher hasn't been teaching us about Binary Search with String. So, I had to end up at least few hours researching about the topic.
I need some simple explanation and code.
for example :
String[][] data={{"John abc","123"},{"Nike cbd","321"}};
I need input for searching 'John' and it will show the output 'John abc, 123'.
Can somebody suggest some guidance on the principles of binary search?

Strings can be sorted and compared just like numbers, using alphabetic string comparison. let's assume only English for simplicity, "ABD" is bigger than "ABC" and so forth.
So any binary search algorithm example that you find for numbers will work on strings, provided that the list you have is sorted of course. the idea is simple of course - narrow your candidates in half for each iteration until you find the right one.

Arrays.binarySearch is currently supported for one dimensional array.
So you have to narrow down your array to one dimension then call binarySearch().
Example:
for(String[] oneDimension : multiDimension ){
Arrays.sort(oneDimension);
Arrays.binarySearch(oneDimension, 'search-field');
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.