Efficient way to implement a String Array "is in" method using Java

Efficient way to implement a String Array "is in" method using Java - java

I have a requirement to present highly structured information picked from a highly un-structured web service. In order to display the info correctly, I have to do a lot of String matches and duplicate removals to ensure I'm picking the right combination of elements.
One of my challenges involves determining if a String is in an Array of Strings.
My dream is to do "searchString.isIn(stringArray);" but I realize the String class doesn't provide for that.
Is there a more efficient way of doing this beyond this stub?:
private boolean isIn(String searchString, String[] searchArray)
{
for(String singleString : searchArray)
{
if (singleString.equals(searchString)
return true;
}
return false;
}
Thanks!

You may want to look into HashMap or HashSet, both of which give constant time retrieval, and it's as easy as going:
hashSet.contains(searchString)
Additionally, HashSet (and HashMap for its keys) prevents duplicate elements.
If you need to keep them in order of insertion, you can look into their Linked counterparts, and if you need to keep them sorted, TreeSet and TreeMap can help (note, however, that the TreeSet and TreeMap do not provide constant time retrieval).

Everybody else seems to be viewing this question in a broader scope (which is certainly valid). I am only answering this bit:
One of my challenges involves
determining if a String is in an Array
of Strings.
That's simple:
return Arrays.asList(arr).contains(str)
Reference:
Arrays.asList(array)

If you are doing this a lot, you can initially sort the array and do a binary search for your strings.

As mentioned a HashMap or HashSet can provide reasonable performance above what you've mentioned. It depends greatly on how well distributed your hash algorithm is and how many buckets are in the Map.
You could also keep a sorted list and perform a binary search on that list which could perform slightly better, though you pay the cost of sorting. If it's a one time sort, then that's not a big deal. If the list is constantly changing, you may pay a larger cost.
Lastly, you could consider a Trie structure. I think this would be the fastest way to search, but that's a gut reaction. I don't have the numbers to support that.

As explained before you can use a Set (see http://download.oracle.com/javase/1.5.0/docs/api/java/util/Set.html and specially the boolean contains(Object o) method) for that purpose. Here is a quick 'n dirty example that demonstrates this:
String[] a = {"a", "2"};
Set<String> hashSet = new HashSet<String>();
Collections.addAll(hashSet, a);
System.out.println(hashSet.contains("a")); // Returns true
System.out.println(hashSet.contains("2")); // Returns true
System.out.println(hashSet.contains("e")); // Returns false
Hope this helps ;)

As Zach has pointed out , you can use hashset to prevent duplicate, and use contains method to search for a string , which returns true when a match is found.You also need to override equals in ur class.
public boolean equals(Object other) {
return other != null && other instanceof L && this.l == ((L)other).l;

If the search space (your collection of strings) is limited than I agree with the answers already posted. If, however, you have a large collection of strings and need to perform a sufficient number of searches on it (to outweigh the setup overhead), you might also consider encoding the search strings in a trie data structure. Again this would only be advantageous if there are enough strings and you search enough times to justify the setup overhead.

Related

How to increase efficiency

I have the following homework question:
Suppose you are given two sequences S1 and S2 of n elements, possibly containing duplicates, on which a total order relation is defined. Describe an efficient algorithm for determining if S1 and S2 contain the same set of elements. Analyze the running time of this method
To solve this question I have compared elemements of the two arrays using retainAll and a HashSet.
Set1.retainAll(new HashSet<Integer>(Set2));
This would solve the problem in constant time.
Do I need to sort the two arrays before the retainAll step to increase efficiency?

I suspect from the code you've posted that you are missing the point of the assignment. The idea is not to use a Java library to check if two collections are equal (for that you could use collection1.equals(collections2). Rather the point is to come up with an algorithm for comparing the collections. The Java API does not specify an algorithm: it's hidden away in the implementation.
Without providing an answer, let me give you an example of an algorithm that would work, but is not necessarily efficient:
for each element in coll1
if element not in coll2
return false
remove element from coll2
return coll2 is empty
The problem specifies that the sequences are ordered (i.e. total order relation is defined) which means you can do much better than the algorithm above.
In general if you are asked to demonstrate an algorithm it's best to stick with native data types and arrays - otherwise the implementation of a library class can significantly impact efficiency and hide the data you want to collect on the algorithm itself.

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?

If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}

It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...

A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.

As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

Best pratice for using array as the key of memoization in Java

I am doing some algorithm problems in Java, and from time to time the problem needs memoization to optimize speed. And often times, the key is an array. What I usually uses is
HashMap<ArrayList<Integer>, Integer> mem;
The main reason here to use ArrayList<Integer> instead of int[] is that the hashCode() of an primitive array is calculated based on the reference, but for ArrayList<Integer> the value of the actual array is compared, which is desired behavior.
However, it is not very efficient and code can be pretty lengthy as well. So I am wondering if there is any best practice for this kind of memoization in Java? Thanks.
UPDATE: As many have pointed this out in the comments: it is a very bad idea to use mutable objects as the key of a HashMap, which I totally agree.
And I am going to clarify the question a little bit more: when I use this type of memoization, I will NOT change the ArrayList<Integer> once it is inserted to the map. Normally the array represents some status, and I need to cache the corresponding value for that status in case it is visited again.
So please do not focus on how bad it is to use a mutable object as the key to a HashMap. Do suggest some better way to do this kind of memoization please.
UPDATE2: So at last I choose the Arrays.toString() approach since I am doing algorithm problems on TopCoder/Codeforces, and it is just dirty and fast to code.
However, I do think HashMap is the more reasonable and readable way to do this.

You can create a new class - Key, put an array with some numbers as a field and implement your own hascode() based on the contents of the array.
It will improve the readability as well:
HashMap<Key, Integer> mem;

If your ArrayList contains usually 3-4 elements,
I would not worry much about performance. Your approach is OK.
But as others pointed out, your key is thus mutable which is
a bad idea.
Another approach is to append all elements of the ArrayList
together using some separator (say #) and thus have this kind
of string for key: 123#555#66678 instead of an ArrayList of
these 3 integers. You can just call Arrays.toString(int[])
and get a decent string key out of an array of integers.
I would choose the second approach.

If the input array is large, the main problem seems to be the efficiency of lookup. On the other hand, your computation is probably much more expensive than that, so you've got same CPU cycles to spare.
Lookup time will depend both on the hashcode calculation and on the brute-force equals needed to pinpoint the key in a hash bucket. This is why the array as a key is out of the question.
The suggestion already given by user:XpressOneUp, creating a class which wraps the array and provides its custom hash code, seems like your best bet and you can optimize hashcode calculation to involve only some array elements. You'll know best which elements are the most salient.

If the values in the array are small integer than here is way to do it efficiently :-
HashMap<String,Integer> Map
public String encode(ArrayList arr) {
String key = "";
for(int i=0;i<arr.size();i++) {
key = key + arr.get(i) + ",";
}
return(key);
}
Use the encode method to convert your array to unique string use to add and lookup the values in HashMap

How should I map string keys to values in Java in a memory-efficient way?

I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.
Currently I'm going along the line of:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.

Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).
Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.
The following SO question is related (but far from identical to this one).
What is the most efficient Java Collections library?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.
Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}

Use the Trove library.
The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.

First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.
UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.

What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.
Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).
In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.

Have you looked at tries. I've not used them but they may fit with what you're doing.

A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.

As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.
However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:
If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
You could write the map to a file and then memory map a portion or all of the file.
You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.

maybe you can go with a RadixTree?

Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.

I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.

I'd consider using some cache as these often have the overflow-to-disk ability.

You might create a key class that matches your needs. Perhaps like this:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}

try this one
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}

Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.

Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values.
the simplest idea will be
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns.
e.g. if you use division method
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.
another method is to weigh the characters based on their respective position.
e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence.
After all,What method you would choose totally depends on the nature of your application.

The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.

Searching a HashSet for any element in a string array

I have a HashSet of strings and an array of strings. I want to find out if any of the elements in the array exists in the HashSet. I have the following code that work, but I feel that it could be done faster.
public static boolean check(HashSet<String> group, String elements[]){
for(int i = 0; i < elements.length; i++){
if(group.contains(elements[i]))
return true;
}
return false;
}
Thanks.

It's O(n) in this case (array is used), it cannot be faster.
If you just want to make the code cleaner:
return !Collections.disjoint(group, Arrays.asList(elements));

That seems somewhat reasonable. HashSet has an O(1) (usually) contains() since it simply has to hash the string you give it to find an index, and there is either something there or there isn't.
If you need to check each element in your array, there simply isn't any faster way to do it (sequentially, of course).

... but I feel that it could be done faster.
I don't think there is a faster way. Your code is O(N) on average, where N is the number of strings in the array. I don't think that you can improve on that.

As others have said, the slowest part of the algorithm is iterating over every element of the array. The only way you could make it faster would be if you knew some information about the contents of the array beforehand which allowed you to skip over certain elements, like if the array was sorted and had duplicates in known positions or something. If the input is essentially random, then there's not a lot you can do.

If you know that the set is a sorted set, and that the array is sorted, you can get the interval set from the start to the end to possibly do better than O(|array| * access-time(set)), and which especially allows for some better than O(|array|) negative results, but if you're hashing you can't.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient way to implement a String Array "is in" method using Java - java

Everybody else seems to be viewing this question in a broader scope (which is certainly valid). I am only answering this bit: One of my challenges involves determining if a String is in an Array of Strings. That's simple: return Arrays.asList(arr).contains(str) Reference: Arrays.asList(array)

If you are doing this a lot, you can initially sort the array and do a binary search for your strings.

Related

How to increase efficiency

Efficient data structure that checks for existence of String

Best pratice for using array as the key of memoization in Java

How should I map string keys to values in Java in a memory-efficient way?

Searching a HashSet for any element in a string array

Categories

Resources