Finding a loose match for a string in arraylist - java

I have a huge array list which contains 1000 entries out of which one of the entry is "world". And, I have a word "big world". I want to get the word "big world" matched with "world" in the arraylist.
What is the most cost effective way of doing it? I cannot use .contains method of array list, and If I traverse all the 1000 entries and match them by pattern its going to be very costly in terms of performance. I am using Java for this.
Could you please let me know what is the best way for this?
Cheers,
J

You can split up every single element of the ArrayList into words and stop as soon as you find one of them.
I suppose by your profile you develop in Java, with Lucene you would easily do something like that
public class NodesAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer(reader)
TokenFilter lowerCaseFilter = new LowerCaseFilter(tokenizer)
TokenFilter stopFilter = new StopFilter(lowerCaseFilter, Data.stopWords.collect{ it.text } as String[])
SnowballFilter snowballFilter = new SnowballFilter(stopFilter, new org.tartarus.snowball.ext.ItalianStemmer())
return snowballFilter
}
}
Analyzer analyzer = new NodesAnalyzer()
TokenStream ts = analyzer.tokenStream(null, new StringReader(str));
Token token = ts.next()
while (token != null) {
String cur = token.term()
token = ts.next();
}
Note: this is Groovy code that I copied from a personal project so you will have to translate things like Data.stopWords.collect{ it.text } as String[] to use with plain Java

Assuming you dont know the content of the arraylist elements. you will have to traverse the whole arraylist.
Traversing the arraylist would cost you O(n).
Sorting the arraylist wouldnt help you because you are talking about a searching a string in a set of strings. and still sorting would be more expensive. O(nlogn)

If you have to search the list repeatedly, it may make sense to use the sort() and binarySearch() methods of Collections.
Addendum: As noted by #user177883, the cost of an O(n log n) sort must be weighed against the benefit of subsequent O(log n) searches.
The word "heart" matches the [word] "ear".
As an exact match is insufficient, this approach would be inadequate.

I had a very similar issue.
Solved it by using this if/else if statement.
if (myArrayList.contains(wordThatIsEntered)
&& wordThatCantBeMatched.equals(wordThatIsEntered)) {
Toast.makeText(getApplicationContext(),
"WORD CAN'T BE THE SAME OR THAT WORD ISN'T HERE",
Toast.LENGTH_SHORT).show();
}
else if (myArrayList.contains(wordThatIsEntered)) {
Toast.makeText(getApplicationContext(),
"FOUND THE EXACT WORD YOU ARE LOOKING FOR!",
Toast.LENGTH_SHORT).show();
}

Related

Word list search with hashmap in Java

I have a word list and there are more than 50,000 words in my word list. As you can see, I read my words and add them to an Array List, but after this process, when I want to read my words, it happens very slowly. That's why Hashmap came to my mind. I want to read my words and when I receive a word input from the user, I want to have it checked whether it is in the HashMap. Even though I did research, I could not find how to do it exactly. How can I do this?
public ArrayList<String> wordReader () throws FileNotFoundException {
File txt = new File(path);
Scanner scanner = new Scanner(txt);
ArrayList <String> words = new ArrayList<String>();
while (scanner.hasNextLine()) {
String data = scanner.nextLine();
words.add(data);
}
scanner.close();
return words;
}
If I have understood your problem correctly, you're having performance issues in traversing an ArrayList filled with 50.000 words when you're trying to check if a specific word exists in your list or not.
This is because looking for an element in an unsorted List has O(n) complexity. You could improve the performances by employing a sorted data structure like a BST (a Binary Search Tree) which will improve the research operation with a O(log n) complexity.
Also, your idea of using a Map is definitely viable, since a HashMap grants a complexity for add and get operations between O(1) (for theoretically perfect hashing algorithm with no collisions at all among the keys) and O(n) (for bad hashing algorithms with a high chance of collision). Besides, since Java 8, it has been introduced an optimization in the HashMap implementation, where under high collision conditions with multiple elements added to the same bucket, the data structure corresponding to a bucket is actually implemented as a Balanced Tree rather than a list, granting a O(log n) complexity in the worst case.
https://www.logicbig.com/tutorials/core-java-tutorial/java-collections/java-map-cheatsheet.html
However, using a HashMap for what I assume is a dictionary (only distinct words) could be unnecessary, since you would use a word as both a key and a value. Instead of a HashMap, you could use a Set as others have pointed out, or better a HashSet. As in fact, a HashSet is implemented via a HashMap instance under the hood, which will give us all the performance and advantages previously discussed (this is why I wrote that preface).
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/HashSet.html
Your implementation could look like this:
public Set<String> wordReader(String path) throws FileNotFoundException {
File txt = new File(path);
Scanner scanner = new Scanner(txt);
Set<String> words = new HashSet<>();
while (scanner.hasNextLine()) {
String data = scanner.nextLine();
words.add(data);
}
scanner.close();
return words;
}
public boolean isWordContained(Set<String> set, String word) {
return set.contains(word);
}
Since you will be checking whether the word input is present in your list of words read from the file, you can use a HashSet<String> instead of using an ArrayList<String>.
Your method would then become
public HashSet<String> wordReader () throws FileNotFoundException {
File txt = new File(path);
Scanner scanner = new Scanner(txt);
HashSet <String> words = new HashSet<String>();
while (scanner.hasNextLine()) {
String data = scanner.nextLine();
words.add(data);
}
scanner.close();
return words;
}
Now after you read the word input, you can check whether it is present in the HashSet. This would be a much faster operation as lookup would take constant time.
public boolean isWordPresent(String word, HashMap<String> words){
return words.contains(word);
}
As a side note, HashSet internally uses a HashMap to perform the operations.
I would use a Set, not a List since sets automatically ignore duplicates when you add them to the set. If it wasn't present it returns true and adds it, otherwise false.
public Set<String> wordReader () throws FileNotFoundException {
File txt = new File(path);
Scanner scanner = new Scanner(txt);
Set <String> words = new HashSet<>();
while (scanner.hasNextLine()) {
String data = scanner.nextLine();
if(!words.add(data)) {
// present - Do something
}
}
scanner.close();
return words;
}
because sets are not ordered they are not random access collections. So you can add the set to a list as follows:
Set<String> words = wordReader();
List<String> wordList = new ArrayList<>(words);
Now you can retrieve them with an index.
you may want to make your method more versatile by passing the file name as an argument.

Efficient way to test if a string is substring of any in a list of strings

I want to know the best way to compare a string to a list of strings. Here is the code I have in my mind, but it's clear that it's not good in terms of time complexity.
for (String large : list1) {
for (String small : list2) {
if (large.contains(small)) {
// DO SOMETHING
} else {
// NOT FOR ME
}
}
// FURTHER MANIPULATION OF STRING
}
Both lists of strings can contain more than thousand values, so the worst case complexity can rise to 1000×1000×length which is a mess. I want to know the best way to perform the task of comparing a string with a list of strings, in the given scenario above.
You could just do this:
for (String small : list2) {
if (set1.contains(small)) {
// DO SOMETHING
} else {
// NOT FOR ME
}
}
set1 should be the larger list of String, and instead of keeping it as a List<String>, use a Set<String> or a HashSet<String>
Thanks to the first answer by sandeep. Here is the solution:
List<String> firstCollection = new ArrayList<>();
Set<String> secondCollection = new HashSet<>();
//POPULATE BOTH LISTS HERE.
for(String string: firstCollection){
if(secondCollection.contains(string)){
//YES, THE STRING IS THERE IN THE SECOND LIST
}else{
//NOPE, THE STRING IS NOT THERE IN THE SECOND LIST
}
}
This is, unfortunately, a difficult and messy problem. It's because you're checking whether a small string is a substring of a bunch of large strings, instead of checking that the small string is equal to a bunch of large strings.
The best solution depends on exactly what problem you need to solve, but here is a reasonable first attempt:
In a temporary place, concatenate all the large strings together, then construct a suffix tree on this long concatenated string. With this structure, we should be able to find all the substring matches of any given small among all the large quickly.

Java: Getting the 500 most common words in a text via HashMap

I'm storing my wordcount into the value field of a HashMap, how can I then get the 500 top words in the text?
public ArrayList<String> topWords (int numberOfWordsToFind, ArrayList<String> theText) {
//ArrayList<String> frequentWords = new ArrayList<String>();
ArrayList<String> topWordsArray= new ArrayList<String>();
HashMap<String,Integer> frequentWords = new HashMap<String,Integer>();
int wordCounter=0;
for (int i=0; i<theText.size();i++){
if(frequentWords.containsKey(theText.get(i))){
//find value and increment
wordCounter=frequentWords.get(theText.get(i));
wordCounter++;
frequentWords.put(theText.get(i),wordCounter);
}
else {
//new word
frequentWords.put(theText.get(i),1);
}
}
for (int i=0; i<theText.size();i++){
if (frequentWords.containsKey(theText.get(i))){
// what to write here?
frequentWords.get(theText.get(i));
}
}
return topWordsArray;
}
One other approach you may wish to look at is to think of this another way: is a Map really the right conceptual object here? It may be good to think of this as being a good use of a much-neglected-in-Java data structure, the bag. A bag is like a set, but allows an item to be in the set multiple times. This simplifies the 'adding a found word' very much.
Google's guava-libraries provides a Bag structure, though there it's called a Multiset. Using a Multiset, you could just call .add() once for each word, even if it's already in there. Even easier, though, you could throw your loop away:
Multiset<String> words = HashMultiset.create(theText);
Now you have a Multiset, what do you do? Well, you can call entrySet(), which gives you a collection of Multimap.Entry objects. You can then stick them in a List (they come in a Set), and sort them using a Comparator. Full code might look like (using a few other fancy Guava features to show them off):
Multiset<String> words = HashMultiset.create(theWords);
List<Multiset.Entry<String>> wordCounts = Lists.newArrayList(words.entrySet());
Collections.sort(wordCounts, new Comparator<Multiset.Entry<String>>() {
public int compare(Multiset.Entry<String> left, Multiset.Entry<String> right) {
// Note reversal of 'right' and 'left' to get descending order
return right.getCount().compareTo(left.getCount());
}
});
// wordCounts now contains all the words, sorted by count descending
// Take the first 50 entries (alternative: use a loop; this is simple because
// it copes easily with < 50 elements)
Iterable<Multiset.Entry<String>> first50 = Iterables.limit(wordCounts, 50);
// Guava-ey alternative: use a Function and Iterables.transform, but in this case
// the 'manual' way is probably simpler:
for (Multiset.Entry<String> entry : first50) {
wordArray.add(entry.getElement());
}
and you're done!
Here you can find a guide how to sort a HashMap by the values. After the sorting you can just iterate over the first 500 entries.
Take a look at the TreeBidiMap provided by the Apache Commons Collections package. http://commons.apache.org/collections/api-release/org/apache/commons/collections/bidimap/TreeBidiMap.html
It allows you to sort the map according to both the key or the value set.
Hope it helps.
Zhongxian

Searching a Hashmap

Hi I am populating a Hashmap with a dictionary.txt file and I am splitting the hashmap into sets of word lengths.
Im having trouble searching the Hashmap for a pattern of "a*d**k";
Can anyone help me?
I need to know how to search a Hashmap?
I would really appreciate if you could help me.
Thank you.
A HashMap is simply the wrong data structure for a pattern search.
You should look into technologies that feature pattern searching out of the box, like Lucene
And in answer to this comment:
Im using it for Android, and its the
fastest way of searching.
HashMaps are awfully fast, that's true, but only if you use them as intended. In your scenario, hash codes are not important, as you know that all keys are numeric and you probably won't have any word that's longer than, say, 30 letters.
So why not just use an Array or ArrayList of Sets instead of a HashMap and replace map.get(string.length()) with list.get(string.length()-1) or array[string.length()-1]. I bet the performance will be better than with a HashMap (but we won't be able to tell the difference unless you have a reaaaallly old machine or gazillions of entries).
I'm not saying my design with a List or Array is nicer, but you are using a data structure for a purpose it wasn't intended for.
Seriously: How about writing all your words to a flat file (one word per line, sorted by word length and then by alphabetically) and just running the regex query on that file? Stream the file and search the individual lines if it's too large, or read it as a String and keep that in memory if IO is too slow.
Or how about just using a TreeSet with a custom Comparator?
Sample code:
public class PatternSearch{
enum StringComparator implements Comparator<String>{
LENGTH_THEN_ALPHA{
#Override
public int compare(final String first, final String second){
// compare lengths
int result =
Integer.valueOf(first.length()).compareTo(
Integer.valueOf(second.length()));
// and if they are the same, compare contents
if(result == 0){
result = first.compareTo(second);
}
return result;
}
}
}
private final SortedSet<String> data =
new TreeSet<String>(StringComparator.LENGTH_THEN_ALPHA);
public boolean addWord(final String word){
return data.add(word.toLowerCase());
}
public Set<String> findByPattern(final String patternString){
final Pattern pattern =
Pattern.compile(patternString.toLowerCase().replace('*', '.'));
final Set<String> results = new TreeSet<String>();
for(final String word : data.subSet(
// this should probably be optimized :-)
patternString.replaceAll(".", "a"),
patternString.replaceAll(".", "z"))){
if(pattern.matcher(word).matches()){
results.add(word);
}
}
return results;
}
}

how to do sorting using java

I have text file with list of alphabets and numbers. I want to do sorting w.r.t this number using java.
My text file looks like this:
a--->12347
g--->65784
r--->675
I read the text file and i split it now. But i dont know how to perform sorting . I am new to java. Please give me a idea.
My output want to be
g--->65784
a--->12347
r--->675
Please help me. Thanks in advance.
My coding is
String str = "";
BufferedReader br = new BufferedReader(new FileReader("counts.txt"));
while ((str = br.readLine()) != null) {
String[] get = str.split("---->>");
When i search the internet all suggest in the type of arrays. I tried. But no use.How to include the get[1] into array.
int arr[]=new int[50]
arr[i]=get[1];
for(int i=0;i<50000;i++){
for(int j=i+1;j<60000;j++){
if(arr[i]>arr[j]){
System.out.println(arr[i]);
}
}
You should use the Arrays.sort() or Collections.sort() methods that allows you to specify a custom Comparator, and implement such a Comparator to determine how the strings should be compared for the purpose of sorting (since you don't want the default lexicographic order). It looks like that should involve parsing them as integers.
Your str.split looks good to me. Use Integer.parseInt to get an int out of the string portion representing the number. Then put the "labels" and numbers in a TreeMap as described below. The TreeMap will keep the entries sorted according to the keys (the numbers in your case).
import java.util.TreeMap;
public class Test {
public static void main(String[] args) {
TreeMap<Integer, String> tm = new TreeMap<Integer, String>();
tm.put(12347, "a");
tm.put(65784, "g");
tm.put(675, "r");
for (Integer num : tm.keySet())
System.out.println(tm.get(num) + "--->" + num);
}
}
Output:
r--->675
a--->12347
g--->65784
From the API for TreeMap:
The map is sorted according to the natural ordering of its keys, or by a Comparator provided at map creation time, depending on which constructor is used.
you can use TreeMap and print its content with iterator for keys. You may have to implement your own Comparator.
rather than give you the code, I would point you on the following path: TreeMap. Read, learn, implement
What you want to do is:
1) convert the numbers into integers
2) Store them in a collection
3) use Collections.sort() to sort the list.
I assume that you are an absolute beginner.
You are correct till the split part.
You need to place the split number immediately into a string or object (custom object)
You would create something like:
class MyClass //please, a better name,
{
//and better field names, based on your functionality
int number;
String string;
}
Note: You have to implement equals and hashCode
After the split (your first snippet), create an object of this class, place get[0] into string and get[1] into number (after converting the string to integer)
You place this object into an TreeMap.
Now you have a sorted list.
I have deliberately not specified the details. Feel free to google for any term/phrase you dont understand. By this way you understand, rather than copy pasting some code.

Categories