I have a HashMap of 60k key/value pairs.
I have 100 strings and out of those 100 strings one has a substring which exists in HashMap.
I would have to repeat this process thousand times. Is there is an efficient approach to do this?
Let's say, the hash contains like:
journal of america, rev su arabia, comutational journal, etc..
And the strings like:
published in rev su arabia
the publication event happened in
computationl journal 230:34
The first and third string contains the key/value in the hash and I need to find out those.
Code (not efficient)
private String contains(String candidateLine)
{
Iterator<String> it = journalName.iterator();
while (it.hasNext())
{
String journalName = it.next();
if (candidateLine.contains(journalName))
return journalName;
}
return null;
}
Please suggest.
Given your requirements, the only answer is: wrong design point. You are basically asking how to efficiently support "full text" search capabilities. And for that problem, the answer is: don't do it yourself.
Meaning: forget about re-inventing the wheel here. Instead, pick up an existing solution, such as Lucene (library) or products such as Solr or ElasticSearch ( see here for more information).
You see, most likely we are looking at a "real world" production problem here. So even when you find a clever way to build your own data structure to support your current requirements, chances are high that sooner or later "more" requirements will be coming your way.
Therefore I seriously suggest that clarify the exact problem to solve, and then identify that existing product that best solves the problem. Otherwise you will be fighting uphill battles like forever.
Related
I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}
It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...
A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.
As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.
I have a question regarding Lucene/Solr.
I am trying to solve a general (company) name matching problem.
Let me present one oversimplified example:
We have two (possibly large) lists of names viz., list_A and list_B.
We want to find the intersection of the two lists, but the names in the two lists may not always exactly match. For each distinct name in list_A, we will want to report one or more best matches from list_B.
I have heard that Lucene/Solr can solve this problem. Can you tell me if this is true? If it is, please point me to some minimal working example(s).
Thanks and regards,
Dibyendu
You could solve this with Lucene, yes, but if you just need to solve this one problem, creating a Lucene index would be a bit of a roundabout way to do it.
I'dd be more inclined to take a simpler approach. You could just find a library for fuzzy comparison between strings, and iterate through your lists and return only those under a certain threshold of similarity as matches.
org.apache.commons.lang3.StringUtils comes to mind, something like:
for (String a : alist) {
for (String b : blist) {
int dist = StringUtils.getLevenshteinDistance(a,b)
if (dist < threshold) {
//b is a good enough match for a, do something with it!
}
}
}
Depending on your intent, other algorithms might be more appropriate (Soundex or Metaphone, for instance)
SOLR can solve your problem. Index the list_B in SOLR. Now do a search for every item in list_A in SOLR, you will get one or more likely match from the list_B.
You need to configure analyzers and filters for the field according to your data set and what kind of similar result you want.
I am trying to do something similar, and I would like to point out to the other commenters that their proposed solutions (like Levenshtein Distance or Soundex) may not be appropriate, if the problem is matching of accurate names, as opposed to mis-spelled names.
For example: I doubt either one is much use for matching
John S W Edward
with
J Samuel Woodhouse Edward
I suppose it is possible, but this is a different class of problem than what they were intended to accomplish.
I ran across some code that was doing something like this:
Map<String,String> fullNameById = buildMap1(dataSource1);
Map<String,String> nameById = buildMap2(dataSource2);
Map<String,String> nameByFullName = new HashMap<String,String>();
Map<String,String> idByName = new HashMap<String,String>();
Set<String> ids = fullNameById.keySet();
for (String nextId : ids) {
String name = nameById.get(nextId);
String fullName = fullNameById.get(nextId);
nameByFullName.put(fullName, name);
idByName.put(name, nextId);
}
I had to stare at it for several minutes to figure out what was going on. All of that amounts to a join operation on id's and an inversion of one of the original maps. Since Id, FullName and Name are always 1:1:1 it seemed to me that there should be some way to simplify this. I also discovered that the first two maps are never used again, and I find that the above code is a bit hard to read. So I'm considering replacing it with something like this that (to me) reads much cleaner
Table<String, String, String> relations = HashBasedTable.create();
addRelationships1(dataSource1, relations);
addRelationships2(dataSource2, relations);
Map<String,String> idByName = relations.column("hasId");
Map<String,String> nameByFullName = relations.column("hasName");
relations = null; // not used hereafter
In addRelationships1 I do
relations.put(id, "hasFullName", fullname);
And in addRelationships2 where my query yields values for id and name I do
relations.put(relations.remove(id,"hasFullName"), "hasName", name);
relations.put(name, "hasId", id);
So my questions are these:
Is there a lurking inefficiency in what I have done either via processor or memory, or GC load? I don't think so, but I'm not that familiar with the efficiency of Table. I am aware that the Table object won't be GC'd after relations = null, I just want to communicate that it's not used again in the rather lengthy section of code that follows.
Have I gained any efficiency? I keep convincing and unconvincing myself that I have and have not.
Do you find this more readable? Or is this only easy for me to read because I wrote it? I'm a tad worried on that front due to the fact Table is not well known. On the other hand, the top level now pretty clearly says, "gather data from two sources and make these two maps from it." I also like the fact that it doesn't leave you wondering if/where the other two maps are being used (or not).
Do you have an even better, cleaner, faster, simpler way to do it than either of the above?
Please Lets not have the optimize early/late discussion here. I'm well aware of that pitfall. If it improves readability without hurting performance I am satisfied. Performance gain would be a nice bonus.
Note: my variable and method names have been sanitized here to keep the business area from distracting from the discussion, I definitely won't name them addRelationships1 or datasource1! Similarly, the final code will of course use constants not raw strings.
So I did some mini benchmarking myself and came up with the conclusion that there is little difference in the two methods in terms of execution time. I kept the total size of the data being processed constant by trading runs for data-set size. I did 4 runs and chose the lowest time for each implementation from among all 4 runs. Re-reassuringly both implementations were always fastest on the same run. My code can be found here. Here are my results:
Case Maps (ms) Table (ms) Table vs Maps
100000 runs of size 10 2931 3035 104%
10000 runs of size 100 2989 3033 101%
1000 runs of size 1000 3129 3160 101%
100 runs of size 10000 4126 4429 107%
10 runs of size 100000 5081 5866 115%
1 run of size 1000000 5489 5160 94%
So using Table seems to be slightly slower for small data sets. Something interesting happens around 100,000 and then by 1 million the table is actually faster. My data will hang out in the 100 to 1000 range, so at least in execution time the performance should be nearly identical.
As for readability, my opinion is that if someone is trying to figure out what is happening near by and reads the code it will be significantly easier to see the intent. If they have to actually debug this bit of code it may be a bit harder since Table is less common, and requires some sophistication to understand.
Another thing I am unsure of is whether or not it's more efficient to create the hash maps, or to just query the table directly in the case where all keys of the map will subsequently be iterated. However that's a different question :)
And the comedic ending is that in fact as I analyzed the code further (hundreds of lines), I found that the only significant use of nameByFullname.get() outside of logging (of questionable value) was to pass the result of the to idByName.get(). So in the end I'll actually be building an idByFullName map and an idByName map instead with no need for any joining, and dropping the whole table thing anyway. But it made for an interesting SO question I guess.
tl;dr, but I'm afraid that you'd need to make a bigger step away from the original design. Simulating DB tables might be a nice exercise, but for me your code isn't really readable.
Is there a lurking inefficiency in what I have done... No idea.
Have I gained any efficiency? I'm afraid you need to measure it first. Removing some indirections surely helps, but using a more complicated data structure might offset it. And performance in general is simply too complicated.
Do you find this more readable? I'm afraid not.
Do you have an even better, cleaner, faster, simpler way to do it than either of the above? I hope so....
Where I get lost in such code is the use of strings for everything - it's just too easy to pass a wrong string as an argument. So I'd suggest to aggregate them into an object and provide maps for accessing the objects via any part of them. Something as trivial as this should do:
class IdNameAndFullName {
String id, name, fullName;
}
class IdNameAndFullNameMaps {
Map<String, IdNameAndFullName> byId;
Map<String, IdNameAndFullName> byName;
Map<String, IdNameAndFullName> byFullName;
}
You could obviously replace the class IdNameAndFullNameMaps by a Table. However, besides using a nice pre-existing data structure I see no advantages therein. The disadvantages are:
loss of efficiency
loss of readability (I wouldn't use Table here for the very same reason Tuple should be avoided)
use of String keys (your "hasId" and "hasName").
Edit: I should have probably mentioned that I am extremely new to Java programming. I just started with the language about two weeks ago.
I have tried looking for an answer to this questions, but so far I haven't found one so that is why I am asking it here.
I writing java code for an Dungeons and Dragons Initiative Tracker and I am using a TreeMap for its ability to sort on entry. I am still very new to java, so I don't know everything that is out there.
My problem is that when I have two of the same keys, the tree merges the values such that one of the values no longer exists. I understand this can be desirable behavior but in my case I cannot have that happen. I was hoping there would be an elegant solution to fix this behavior. So far what I have is this:
TreeMap<Integer,Character> initiativeList = new TreeMap<Integer,Character>(Collections.reverseOrder());
Character [] cHolder = new Character[3];
out.println("Thank you for using the Initiative Tracker Project.");
cHolder[0] = new Character("Fred",2);
cHolder[1] = new Character("Sam",3,23);
cHolder[2] = new Character("John",2,23);
for(int i = 0; i < cHolder.length; ++i)
{
initiativeList.put(cHolder[i].getInitValue(), cHolder[i]);
}
out.println("Initiative List: " + initiativeList);
Character is a class that I have defined that keeps track of a player's character name and initiative values.
Currently the output is this:
Initiative List: {23=John, 3=Fred}
I considered using a TreeMap with some sort of subCollection but I would also run into a similar problem. What I really need to do is just find a way to disable the merge. Thank you guys for any help you can give me.
EDIT: In Dungeons and Dragons, a character rolls a 20 sided dice and then added their initiative mod to the result to get their total initiative. Sometimes two players can get the same values. I've thought about having the key formatted like this:
Key = InitiativeValue.InitiativeMod
So for Sam his key would be 23.3 and John's would be 23.2. I understand that I would need to change the key type to float instead of int.
However, even with that two players could have the same Initiative Mod and roll the same Initiative Value. In reality this happens more than you might think. So for example,
Say both Peter and Scott join the game. They both have an initiative modifier of 2, and they both roll a 10 on the 20 sided dice. That would make both of their Initiative values 12.
When I put them into the existing map, they both need to show up even though they have the same value.
Initiative List: {23=John, 12=Peter, 12=Scott, 3=Fred}
I hope that helps to clarify what I am needing.
If I understand you correctly, you have a bunch of characters and their initiatives, and want to "invert" this structure to key by initiative ID, with the value being all characters that have that initiative. This is perfectly captured by a MultiMap data structure, of which one implementation is the Guava TreeMultimap.
There's nothing magical about this. You could achieve something similar with a
TreeMap<Initiative,List<Character>>
This is not exactly how a Guava multimap is implemented, but it's the simplest data structure that could support what you need.
If I were doing this I would write my own class that wrapped the above TreeMap and provided an add(K key, V value) method that handled the duplicate detection and list management according to your specific requirements.
You say you are "...a TreeMap for its ability to sort on entry..." - but maybe you could just use a TreeSet instead. You'll need to implement a suitable compareTo method on your Character class, that performs the comparison that you want; and I strongly recommend that you implement hashCode and equals too.
Then, when you iterate through the TreeSet, you'll get the Character objects in the appropriate order. Note that Map classes are intended for lookup purposes, not for ordering.
I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.
Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF