Java - Search keywords list in another string list - java

I have a list of keywords in a List and I have data coming from some source which will be a list too.
I would like to find if any of keywords exists in the data list, if yes add those keywords to another target list.
E.g.
Keywords list = FIRSTNAME, LASTNAME, CURRENCY & FUND
Data list = HUSBANDFIRSTNAME, HUSBANDLASTNAME, WIFEFIRSTNAME, SOURCECURRENCY & CURRENCYRATE.
From above example, I would like to make a target list with keywords FIRSTNAME, LASTNAME & CURRENCY, however FUND should not come as it doesn't exists in the data list.
I have a solution below that works by using two for loops (one inside another) and check with String contains method, but I would like to avoid two loops, especially one inside another.
for (int i=0; i<dataList.size();i++) {
for (int j=0; j<keywordsList.size();j++) {
if (dataList.get(i).contains(keywordsList.get(j))) {
targetSet.add(keywordsList.get(j));
break;
}
}
}
Is there any other alternate solution for my problem?

Here's a one loop approach using regex. You construct a pattern using your keywords, and then iterate through your dataList and see if you can find a match.
public static void main(String[] args) throws Exception {
List<String> keywords = new ArrayList(Arrays.asList("FIRSTNAME", "LASTNAME", "CURRENCY", "FUND"));
List<String> dataList = new ArrayList(Arrays.asList("HUSBANDFIRSTNAME", "HUSBANDLASTNAME", "WIFEFIRSTNAME", "SOURCECURRENCY", "CURRENCYRATE"));
Set<String> targetSet = new HashSet();
String pattern = String.join("|", keywords);
for (String data : dataList) {
Matcher matcher = Pattern.compile(pattern).matcher(data);
if (matcher.find()) {
targetSet.add(matcher.group());
}
}
System.out.println(targetSet);
}
Results:
[CURRENCY, LASTNAME, FIRSTNAME]

Try Aho–Corasick algorithm. This algorithm can get the count of appearance of every keyword in the data (You just need whether it appeared or not).
The Complexity is O(Sum(Length(Keyword)) + Length(Data) + Count(number of match)).
Here is the wiki-page:
In computer science, the Aho–Corasick algorithm is a string searching
algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is
a kind of dictionary-matching algorithm that locates elements of a
finite set of strings (the "dictionary") within an input text. It
matches all patterns simultaneously. The complexity of the algorithm
is linear in the length of the patterns plus the length of the
searched text plus the number of output matches.
I implemented it(about 200 lines) years ago for similar case, and it works well.
If you just care keyword appeared or not, you can modify that algorithm for your case with a better complexity:
O(Sum(Length(Keyword)) + Length(Data)).
You can find implementation of that algorithm from internet everywhere but I think it's good for you to understand that algorithm and implement it by yourself.
EDIT:
I think you want to eliminate two-loops, so we need find all keywords in one loop. We call it Set Match Problem that a set of patterns(keywords) to match a text(data). You want to solve Set Match Problem, then you should choose Aho–Corasick algorithm which is particularly designed for that case. In that way, we will get one loop solution:
for (int i=0; i < dataList.size(); i++) {
targetSet.addAll(Ac.run(keywordsList));
}
You can find a implementation from here.

Related

Iterate through only part of a large list in java

I'm trying to make a Boggle game in Java, and for my program once I randomize the board I have a method which iterates through the possible combinations and compares each one to a dictionary list to check if it's a valid word, and if yes, I put it in the key. It works fine, however the program takes three or four minutes to generate the key, which is mostly due to the size of the dictionary. The one I'm using has about 19k words and comparing every combination takes up a ton of time. Here's the part of the code I'm trying to make faster:
if (str.length()>3&&!key.contains(str)&&prefixes.contains(str.substring(0,3))&&dictionary.contains(str)){
key.add(str);
}
where str is the combination generated. prefixes is a list I generated based on dictionary that goes like this:
public void buildPrefixes(){
for (String word:dictionary){
if(!prefixes.contains(word.substring(0,3))){
prefixes.add(word.substring(0,3));
}
}
}
which just adds all the three letter prefixes in the dictionary such as "abb" and "mar" so that when str is jibberish like "xskfjh" it won't get checked against the whole dictionary, just prefixes which is something like 1k words.
What I'm trying to do is cut down on time by iterating through only the words in the dictionary that have the same first letter as str, so if str is "abbey" then it will only check str against words that start with "a" instead of the whole list, which would cut down on time significantly. Or even better, it only checks str against words that have the same prefix. I am pretty new to Java so I would really appreciate if you're very descriptive in your answers, thanks!
What comments are trying to say is - do not reinvent wheel. Java is not Assembler or C and it is powerful enough to handle such trivial cases.
Here is simple code which shows that simple Set can handle your vocabulary easy:
import java.util.Set;
import java.util.TreeSet;
public class Work {
public static void main(String[] args) {
long startTime=System.currentTimeMillis();
Set<String> allWords=new TreeSet<String>();
for (int i=0; i<20000;i++){
allWords.add(getRandomWord());
}
System.out.println("Total words "+allWords.size()+" in "+(System.currentTimeMillis()-startTime)+" milliseconds");
}
static String getRandomWord() {
int length=3+(int)(Math.random()*10);
String r = "";
for(int i = 0; i < length; i++) {
r += (char)(Math.random() * 26 + 97);
}
return r;
}
}
On my computer it shows
Total words 19875 in 47 milliseconds
As you can see 125 words out of 20,000 were duplicated. And it took not only time to generate 20,000 words in very inefficient way but store them as well as check for duplicates.

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?
There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.
I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).
What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}
I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string
First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.
If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.
Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!
If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).
I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Fastest way to find String in List in Java

I want to use the fastest possible method to match a String with a String in List.
Im iterating trough a list to match productname and set price for that product.
Im trying to match every 400 000 items by name in another list where i could find the price, that list also contains 400 000 items.
Doing a "contains()" on String to match 400 000 items 400 000 times takes a long time to finish.
I did also try startsWith() as i dont search by substring, im using the String because there is for sure a full match in the second list.
It just has to be a faster way to find a match in the inner for loop to get the price?
ProductData t = null;
for (int i = 0; i < ParseCSV.products.size(); i++) { // List of 400K+ items
t = ParseCSV.products.get(i);
for (int j = 0; j < ParseCSVprice.productPrice.size(); j++) { // another List of 400K+ items
if (ParseCSVprice.productPrice.get(i).getpairID()
.contains(t.pairID)) {
t.price = ParseCSVprice.productPrice.get(i).getPrice();
}
}
You need to use another structure probably.
Possibly a HashMap or a HashSet.
There's no much faster way by using a List.
Searching in a List is O(N).
If you only expect one or zero matches, you can increase the speed of your code in some cases by stopping your loops using the break keyword after you have found the match.
Also you might consider changing your id fields to contain numeric values which would be faster to compare than strings.
Because you are having to call a method on each object in the List in order to make the comparison, there isn't much else you can do to speed this up

Faster String Matching/Iteration Method?

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:
String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();
for (int i = 0; i < listOfWords.size(); i++) {
String[] words = listOfWords.get(i).split(" ");
outerloop:
for (int j = 0; j < words.length; j++) {
for (int k = 0; k < targetWords.length; k++) {
if (words[j].equalsIgnoreCase(targetWords[k])) {
extractedStrings.add(listOfWords.get(i));
break outerloop;
}
}
}
}
The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!
Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.
You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.
The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.
You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.
I would suggest that you combine your targetWords into one (compiled) regular expression:
(?xi) # turn on comments, use case insensitive matching
\b # word boundary, i.e. start/end of string, whitespace
( # begin of group containing 'inventory' or 'finished' or 'goods'
inventory|finished|goods # bar separates alternatives
) # end of group
\b # word boundary
Don't forget to double-quote the backspaces in your regular expression string.
import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
if (targetPattern.matcher(singleString).find()) {
extractedStrings.add(singleString);
}
}
If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.
I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.
String[] targetWords= new String[]{"inventory", "finished", "goods"};
List<String> listOfWords = new ArrayList<String>();
// build lookup map
Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
for(String words : listOfWords) {
for(String word : words.split(" ")) {
if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
lookupMap.get(word).add(words);
}
}
// find phrases
Set<String> extractedStrings = new HashSet<String>();
for(String target : targetWords) {
if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
}
I would try to implement it with ExecutorService to parallelize search for each word.
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html
For example with fixed thread pool size:
Executors.newFixedThreadPool(20);

Find a char optimization

So this part of the homework wants us to take a Set of Strings and we will return a List of Strings. In the String Set we will have email addresses ie myname#uark.edu. We are to pull the first part of the email address; the name and put it in the String List.From the above example myname would be put into the List.
The code I currently have uses an iterator to pull a string from the Set. I then use the String.contains("#") as an error check to make sure the String has an # symbol in it. I then start at the end of the string and use the string.charAt("#") to check each char. Once It's found i then make a substring with the correct part and send it to the List.
My problem is i wanted to use something recursive and cut down on operations. I was thinking of something that would divide the string.length()/2 and then use String.contains("#") on the second half first. If that half does contain the # symbol then it would call the functions recursively agin. If the back half did not contain the # symbol then the front half would have it and we would call the function recursively sending it.
So my problem is when I call the function recursively and send it the "substring" once I find the # symbol I will only have the index of the substring and not the index of the original string. Any ideas on how to keep track of it or maybe a command/method I should be looking at. Below is my original code. Any advice welcome.
public static List<String> parseEmail(Set<String> emails)
{
List<String> _names = new LinkedList<String>();
Iterator<String> eMailIt=emails.iterator();
while(eMailIt.hasNext())
{
String address=new String(eMailIt.next());
boolean check=true;
if(address.contains("#"))//if else will catch addresses that do not contain '#' .
{
String _address="";
for(int i=address.length(); i>0 && check; i--)
{
if('#'==address.charAt(i-1))
{
_address=new String(address.substring(0,i-1));
check=false;
}
}
_names.add(_address);
//System.out.println(_address);//fill in with correct sub string
}
else
{
//System.out.println("Invalid address");
_names.add("Invalid address");//This is whats shownn when you have an address that does not have an # in it.
} // could have it insert some other char i.e. *%# s.t. if you use the returned list it can skip over invalid emails
}
return _names;
}
**It was suggested I use the String.indexOf("#") BUT according to the API this method only gives back the first occurrence of the symbol and I have to work on the assumption that there could be multiple "#" in the address and I have to use the last one. Thank you for the suggestion though. Am looking at the other suggestion and will report back.
***So there is a string.lastindexOf() and that was what I needed.
public static List<String> parseEmail(Set<String> emails)
{
List<String> _names = new LinkedList<String>();
Iterator<String> eMailIt=emails.iterator();
while(eMailIt.hasNext())
{
String address=new String(eMailIt.next());
if(address.contains("#"))//if else will catch addresses that do not contain '#' .
{
int endex=address.lastIndexOf('#');
_names.add(address.substring(0,endex-1));
// System.out.println(address.substring(0,endex));
}
else
{
// System.out.println("Invalid address");
_names.add("Invalid address");//This is whats shownn when you have an address that does not have an # in it.
} // could have it insert some other char i.e. *%# s.t. if you use the returned list it can skip over invalid emails
}
return _names;
}
Don't reinvent the wheel (unless you were asked too of course). Java already has a built-in function for what you are attempting String.indexOf(String str). Use it.
final String email = "someone#example.com";
final int atIndex = email.lastIndexOf("#");
if(atIndex != -1) {
final String name = email.substring(0, atIndex);
}
I agree to the previous two answers, if you are allowed to use the built-in functions split or indexOf then you should. However if it is part of your homework to find the substrings yourself you should definitely just go through the string's characters and stop when you found the # aka linear search.
You should definitely not under no circumstances try to do this recursively: The idea of divide and conquer should not be abused in a situation where there is nothing to gain: Recursion means function-call overhead and doing this recursively would only have a chance of being faster than a simple linear search if the sub-strings were searched in-parallel; and even then: the synchronization overhead would kill the speedup for all but the most gigantic strings.
Unless recursion is specified in the homework, you would be best served by looking into String.split. It will split the String into a String array (if you specify it to be around '#'), and you can access both halves of the e-mail address.

Categories