Efficient data structure that checks for existence of String - java

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?

If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}

It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...

A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.

As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

Related

Efficient way for checking if a string is present in an array of strings [duplicate]

This question already has answers here:
How do I determine whether an array contains a particular value in Java?
(30 answers)
Closed 2 years ago.
I'm working on a little project in java, and I want to make my algorithm more efficient.
What I'm trying to do is check if a given string is present in an array of strings.
The thing is, I know a few ways to check if a string is present in an array of strings, but the array I am working with is pretty big (around 90,000 strings) and I am looking for a way to make the search more efficient, and the only ways I know are linear search based, which is not good for an array of this magnitude.
Edit: So I tried implementing the advices that were given to me, but the code i wrote accordingly is not working properly, would love to hear your thoughts.`
public static int binaryStringSearch(String[] strArr, String str) {
int low = 0;
int high = strArr.length -1;
int result = -1;
while (low <= high) {
int mid = (low + high) / 2;
if (strArr[mid].equals(str)) {
result = mid;
return result;
}else if (strArr[mid].compareTo(str) < 0) {
low = mid + 1;
}else {
high = mid - 1;
}
}
return result;
}
Basically what it's supposed to do is return the index at which the string is present in the array, and if it is not in the array then return -1.
So you have a more or less fixed array of strings and then you throw a string at the code and it should tell you if the string you gave it is in the array, do I get that right?
So if your array pretty much never changes, it should be possible to just sort them by alphabet and then use binary search. Tom Scott did a good video on that (if you don't want to read a long, messy text written by someone who isn't a native english speaker, just watch this, that's all you need). You just look right in the middle and then check - is the string you have before or after the string in the middle you just read? If it is already precisely the right one, you can just stop. But in case it isn't, you can eliminate every string after that string in case it's after the string you want to find, otherwise every string that's before the just checked string. Of course, you also eliminate the string itself if it's not equal because - logic. And then you just do it all over again, check the string in the middle of the ones which are left (btw you don't have to actually delete the array items, it's enough just to set a variable for the lower and upper boundary because you don't randomly delete elements in the middle) and eliminate based on the result. And you do that until you don't have a single string in the list left. Then you can be sure that your input isn't in the array. So this basically means that by checking and comparing one string, you can't just eliminate 1 item like you could with checking one after the other, you can remove more then half of the array, so with a list of 256, it should only take 8 compares (or 9, not quite sure but I think it takes one more if you don't want to find the item but know if it exists) and for 65k (which almost matches your number) it takes 16. That's a lot more optimised.
If it's not already sorted and you can't because that would take way too long or for some reason I don't get, then I don't quite know and I think there would be no way to make it faster if it's not ordered, then you have to check them one by one.
Hope that helped!
Edit: If you don't want to really sort all the items and just want to make it a bit (26 times (if language would be random)) faster, just make 26 arrays for all letters (in case you only use normal letters, otherwise make more and the speed boost will increase too) and then loop through all strings and put them into the right array matching their first letter. That way it is much faster then sorting them normally, but it's a trade-off, since it's not so neat then binary search. You pretty much still use linear search (= looping through all of them and checking if they match) but you already kinda ordered the items. You can imagine that like two ways you can sort a buncha cards on a table if you want to find them quicker, the lazy one and the not so lazy one. One way would be to sort all the cards by number, let's just say the cards are from 1-100, but not continuously, there are missing cards. But nicely sorting them so you can find any card really quickly takes some time, so what you can do instead is making 10 rows of cards. In each one you just put your cards in some random order, so when someone wants card 38, you just go to the third row and then linearly search through all of them, that way it is much faster to find items then just having them randomly on your table because you only have to search through a tenth of the cards, but you can't take shortcuts once you're in that row of cards.
Depending on the requirements, there can be so many ways to deal with it. It's better to use a collection class for the rich API available OOTB.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and the insertion order does not matter: Use Set<String> set = new HashSet<>() and then you can use Set#contains to check the presence of a particular string.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and also the insertion order needs to be preserved: Use Set<String> set = new LinkedHashSet<>() and then you can use Set#contains to check the presence of a particular string.
Can the list contain duplicate strings. If yes, you can use a List<String> list = new ArrayList<>() to benefit from its rich API as well as get rid of the limitation of fixed size (Note: the maximum number of elements can be Integer.MAX_VALUE) beforehand. However, a List is navigated always in a sequential way. Despite this limitation (or feature), the can gain some efficiency by sorting the list (again, it's subject to your requirement). Check Why is processing a sorted array faster than processing an unsorted array? to learn more about it.
You could use a HashMap which stores all the strings if
Contains query is very frequent and lookup strings do not change frequently.
Memory is not a problem (:D) .

Map a set of strings with similarities to shorter strings

i have a set of strings each of same length (10chars) with the following properties.
The size of the set is around 5000 - 10,000 strings. The data set can change frequently.
Although each string is unique, a sub string of a particular pattern would appear in most of these strings not necessarily at the same position.
Some examples are
123abc7gh0
t123abcmla
wp12123abc
123abc being the substring which appears in most of the strings
The problem is to map each string to a shorter string, and such mapping should be deterministic.
I could use a simple enumeration algorithm which maps each string encountered to an incremented counter value(on set of sorted strings). But since the set is bound to change frequently, i cannot use this algorithm to compute the map in a deterministic way for various runs.
I could also use data compression algorithm like Huffman encoding to compress each individual string. But i do not believe that would be effective as each string in itself has very less duplicate characters.
what should be the approach i should adapt to solve the problem by taking advantage of the properties of the data set? Note that i do not want to compress the whole set of data but would like to map each string in the set to a shortened string.
Replace the 'common string' by a character not appearing elsewhere in any string.
Do a probabilistic analysis of all strings
Create a Hufman tree based on the analysis, i.e. most frequent characters are at the top of the tree, resulting in short codes.
Replace sample strings by their hufman encoding based on the tree of #3 and compare the resulting size with the original. If most of the characters are uniformly spread even between the strings, then the Hufman coding will not reduce but increase the size.
If Hufman does not gain any improvement, you might try LZW or any other dictionary based compression method. However, this only works, if the structure of the strings (i.e. the distribution of characters/substrings) does not completely change over time. For example, if the strings would consist of english words, the substring dictionary compression (LZW) might be a good candidate.
But if the distribution changes or the character distribution is merely equal over all characters, I am afraid there is no compression method suitable of reducing the string size.
But the last question remains: What for? Why bother compressing 10000 strings?
Edit: The answer is: The strings are used to create folder names (paths). As there is a restriction on the total length, it should be as compact as possible.
You might try to create a database (i.e. dictionary) and use the index (coded e.g. as Base64) as a compressed string. This gives you a maximum of 5 chars when assuming a maximum dictionary size of 2^32-1.
If you can pre-process the set of strings and could know the pattern which occurs in each of the strings, you could treat that as a single character (use some encoding) which would shorten that string.
I'm confronted with the same kind of task and wonder whether it is possible to achieve the mapping without making use of persistence.
If persisting the mappings in (previous) use is allowed, then the solution is simple:
you can just assign a number to each of the strings (using a representation of a sufficiently high base so that you get the required maximum size of the numbers' string representation). For each of the source strings you would assign a next number and using the persisted mappings make sure not to use the same number a second time.
This policy would give you consistent results, even if you go through the procedure multiple times with a changing set of data: a string occurring for the first time would receive its private number and this private number would stay reserved to it forever - numbers that are no longer in use would never be reused.
The more challenging question is: is it possible to guarantee uniqueness without the aid of a persisted mapping? I'm afraid it is not, since size reduction is always prone to lead to collisions.

Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.
I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.
My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:
public static class ShortForm{
public String fullword;
public String abbreviation;
}
List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited
And some code that uses the list
for (ShortForm f: shortForms){
if (address.contains(f.abbreviation+","))
address = address.replace(f.abbreviation+",", f.fullword+",");
}
Now this works, but it's slow. Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?
====== UPDATE
Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.
StringBuilder fullFormed = new StringBuilder();
for (String s: Splitter.on(" ").split(add)){
if (shortFormMap.containsKey(s))
fullFormed.append(shortFormMap.get(s));
else
fullFormed.append(s);
fullFormed.append(" ");
}
return fullFormed.toString().trim();
Testing shows this to be over 13x faster that the original approach. Cheers davecom!
It would already be a bit faster if you skip contains() part :)
What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.
I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all:
http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html
Here's what I'm thinking:
Put abbreviation/full word pairs into TreeMap
Tokenize the address into words.
Check each word to see if it is a key in the TreeMap
Replace it if it is
Put the corrected tokens back together as an address
I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.
This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.

How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs.
I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set.
If add()'ed successfully, it means the latest paragraph is a duplicate one.
Is there any risk of this way?
Except String.equals(), is there any other way to do it?
Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace.
After normalizing, paragraphs that only differ there would get the same hash.
If the MD5 hash is not yet in the set, it means the paragraph is unique. But the opposite is not true. So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. This would be very unlikely, but you'll have to test that paragraph against all others to be sure. For that String.equals would do.
Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method.
There's no need to calculate the MD5 hash, just use a HashSet and try to put the strings itself into this set. This will use the String#hashCode() method to compute a hash value for the String and check if it's already in the set.
public Set removeDuplicates(String[] paragraphs) {
Set<String> set = new LinkedHashSet<String>();
for (String p : paragraphs) {
set.add(p);
}
return set;
}
Using a LinkedHashSet even keeps the original order of the paragraphs.
As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same.
Perhaps you should consider a less brittle metric, such as eg. the Cosine Similarity which is well suited for matching paragraphs.
Cheers,
I think this is a good way. However, there are some things to keep in mind:
Please note that calculating a hash is a heavy operation. This could render your program slow, if you had to repeat it for millions of paragraphs.
Even in this way, you could end up with slightly different paragraphs (with typos, for examplo) going undetecetd. If this is the case, you should normalize the paragraphs before calculaing the hash (putting it into lower case, removing extra-spaces and so on).

Efficient way to implement a String Array "is in" method using Java

I have a requirement to present highly structured information picked from a highly un-structured web service. In order to display the info correctly, I have to do a lot of String matches and duplicate removals to ensure I'm picking the right combination of elements.
One of my challenges involves determining if a String is in an Array of Strings.
My dream is to do "searchString.isIn(stringArray);" but I realize the String class doesn't provide for that.
Is there a more efficient way of doing this beyond this stub?:
private boolean isIn(String searchString, String[] searchArray)
{
for(String singleString : searchArray)
{
if (singleString.equals(searchString)
return true;
}
return false;
}
Thanks!
You may want to look into HashMap or HashSet, both of which give constant time retrieval, and it's as easy as going:
hashSet.contains(searchString)
Additionally, HashSet (and HashMap for its keys) prevents duplicate elements.
If you need to keep them in order of insertion, you can look into their Linked counterparts, and if you need to keep them sorted, TreeSet and TreeMap can help (note, however, that the TreeSet and TreeMap do not provide constant time retrieval).
Everybody else seems to be viewing this question in a broader scope (which is certainly valid). I am only answering this bit:
One of my challenges involves
determining if a String is in an Array
of Strings.
That's simple:
return Arrays.asList(arr).contains(str)
Reference:
Arrays.asList(array)
If you are doing this a lot, you can initially sort the array and do a binary search for your strings.
As mentioned a HashMap or HashSet can provide reasonable performance above what you've mentioned. It depends greatly on how well distributed your hash algorithm is and how many buckets are in the Map.
You could also keep a sorted list and perform a binary search on that list which could perform slightly better, though you pay the cost of sorting. If it's a one time sort, then that's not a big deal. If the list is constantly changing, you may pay a larger cost.
Lastly, you could consider a Trie structure. I think this would be the fastest way to search, but that's a gut reaction. I don't have the numbers to support that.
As explained before you can use a Set (see http://download.oracle.com/javase/1.5.0/docs/api/java/util/Set.html and specially the boolean contains(Object o) method) for that purpose. Here is a quick 'n dirty example that demonstrates this:
String[] a = {"a", "2"};
Set<String> hashSet = new HashSet<String>();
Collections.addAll(hashSet, a);
System.out.println(hashSet.contains("a")); // Returns true
System.out.println(hashSet.contains("2")); // Returns true
System.out.println(hashSet.contains("e")); // Returns false
Hope this helps ;)
As Zach has pointed out , you can use hashset to prevent duplicate, and use contains method to search for a string , which returns true when a match is found.You also need to override equals in ur class.
public boolean equals(Object other) {
return other != null && other instanceof L && this.l == ((L)other).l;
If the search space (your collection of strings) is limited than I agree with the answers already posted. If, however, you have a large collection of strings and need to perform a sufficient number of searches on it (to outweigh the setup overhead), you might also consider encoding the search strings in a trie data structure. Again this would only be advantageous if there are enough strings and you search enough times to justify the setup overhead.

Categories