I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?
It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.
If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.
Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.
I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.
Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.
Related
I have following:
private static List<Pattern> pats;
This list contains around 90 patterns that is instantiated before iteration. The patterns are complex, like:
System.out.println("pat: " + pats.get(0).toString());
// pat: \bsingle1\b|\bsingle2\b|(?=.*\bcombo1\b)(?=.*\bcombo2\b)|\bsingle3\b|\bwild.*card\b ...
Some of the patterns contains around 40-50 single words or combination of words, as the regex above shows. The words can contain wildcards.
Now, I have a list of strings, sentences on around 30-60 characters each. I iterate through them and for every string in the list, I iterate them through the list of patterns and perform a pattern.match("This is one of the strings in my list").find() until I get a match, which I mark down and save somewhere else, then I break out of iteration through patterns and continue with the next string in the list.
This is a categorization job, so several strings can match on the same pattern.
My problem is that this of course takes a lot of execution time, I am looking for a more efficient way to solve this problem.
Any suggestions?
One thing that solved my problem (to 90%) was to give up regex partially where String.indexOf() made more sense out of a performance perspective.
This post inspired me: Quickest way to return list of Strings by using wildcard from collection in Java
I wrote my own implementation since the one in the link handles only full words, while I'm dealing with sentences.
It helped with wildcards "*" and pipes "hel(l|lo)" in the performance perspective, the former more than the latter.
Reason for this direction was several recommendations, and it improved performance by cutting down time on 200000 sentences from 1.5 hour down to 15 minutes.
You could also offload the regular expression in a dedicated service ? I believe that it could be faster (and perhaps safer) than giving up regexp partially ?
If your app is intended to run on multiple server, you may also gain performances by centralizing the computation cost.
Here is an example of such implementation via a REST api : http://www.rex-daemon.com/tutorial/more-advanced-queries/
i have a set of strings each of same length (10chars) with the following properties.
The size of the set is around 5000 - 10,000 strings. The data set can change frequently.
Although each string is unique, a sub string of a particular pattern would appear in most of these strings not necessarily at the same position.
Some examples are
123abc7gh0
t123abcmla
wp12123abc
123abc being the substring which appears in most of the strings
The problem is to map each string to a shorter string, and such mapping should be deterministic.
I could use a simple enumeration algorithm which maps each string encountered to an incremented counter value(on set of sorted strings). But since the set is bound to change frequently, i cannot use this algorithm to compute the map in a deterministic way for various runs.
I could also use data compression algorithm like Huffman encoding to compress each individual string. But i do not believe that would be effective as each string in itself has very less duplicate characters.
what should be the approach i should adapt to solve the problem by taking advantage of the properties of the data set? Note that i do not want to compress the whole set of data but would like to map each string in the set to a shortened string.
Replace the 'common string' by a character not appearing elsewhere in any string.
Do a probabilistic analysis of all strings
Create a Hufman tree based on the analysis, i.e. most frequent characters are at the top of the tree, resulting in short codes.
Replace sample strings by their hufman encoding based on the tree of #3 and compare the resulting size with the original. If most of the characters are uniformly spread even between the strings, then the Hufman coding will not reduce but increase the size.
If Hufman does not gain any improvement, you might try LZW or any other dictionary based compression method. However, this only works, if the structure of the strings (i.e. the distribution of characters/substrings) does not completely change over time. For example, if the strings would consist of english words, the substring dictionary compression (LZW) might be a good candidate.
But if the distribution changes or the character distribution is merely equal over all characters, I am afraid there is no compression method suitable of reducing the string size.
But the last question remains: What for? Why bother compressing 10000 strings?
Edit: The answer is: The strings are used to create folder names (paths). As there is a restriction on the total length, it should be as compact as possible.
You might try to create a database (i.e. dictionary) and use the index (coded e.g. as Base64) as a compressed string. This gives you a maximum of 5 chars when assuming a maximum dictionary size of 2^32-1.
If you can pre-process the set of strings and could know the pattern which occurs in each of the strings, you could treat that as a single character (use some encoding) which would shorten that string.
I'm confronted with the same kind of task and wonder whether it is possible to achieve the mapping without making use of persistence.
If persisting the mappings in (previous) use is allowed, then the solution is simple:
you can just assign a number to each of the strings (using a representation of a sufficiently high base so that you get the required maximum size of the numbers' string representation). For each of the source strings you would assign a next number and using the persisted mappings make sure not to use the same number a second time.
This policy would give you consistent results, even if you go through the procedure multiple times with a changing set of data: a string occurring for the first time would receive its private number and this private number would stay reserved to it forever - numbers that are no longer in use would never be reused.
The more challenging question is: is it possible to guarantee uniqueness without the aid of a persisted mapping? I'm afraid it is not, since size reduction is always prone to lead to collisions.
I need to look for all web requests received by Application Server to check if the URL has extensions like .css, .gif, etc
Referred how tomcat is listening for every request and they pick the right configured Servlet to serve.
CharChunk , MessageBytes , Mapper
Here is my idea to implement:
Load all the extensions we like to compare and get the byte
representation of them.
get a unique value for this xtension by summing up the bytes in the byte Array // eg: "css".getBytes()
Add the result value to Sorted List
Whenever we receive the request, get the byte representation of the URL // eg: "flipkart.com/eshopping/images/theme.css".getBytes()
Start summing the bytes from the byte array's last index and break when we encounter "." dot byte value
Search for existence of the value thus summed with the Sorted List // Use binary Search here
Kindly give your feed backs about the implementation and issues if any.
-With thanks, Krishna
This sounds way more complicated than it needs to be.
Use String.lastIndeXOf to find the last dot in the URL
Use String.substring to get the extension based on that
Have a HashSet<String> for a set of supported extensions, or a HashMap<String, Whatever> if you want to map the extension to something else
I would be absolutely shocked to discover that this simple approach turned out to be a performance bottleneck - and indeed I suspect it would be more efficient than the approach you suggested, given that it doesn't require the entire URL to be converted into a byte array... (It's not clear why your approach uses byte arrays anyway instead of forming the hash from char values.)
Fundamentally, my preferred approach to performance is:
Do up-front design and testing around things which are hard to change later, architecturally
For everything else:
Determine the performance criteria first so you know when you can stop
Write the simplest code that works
Test it with realistic data
If it doesn't perform well enough, use profilers (etc) to work out where the bottleneck is, and optimize that making sure that you can prove the benefits using your existing tests
I'm writing a routine that takes a string and formats it as quoted printable. And it's got to be as fast as possible. My first attempt copied characters from one stringbuffer to another encoding and line wrapping along the way. Then I thought it might be quicker to just modify the original stringbuffer rather than copy all that data which is mostly identical. Turns out the inserts are far worse than copying, the second version (with the stringbuffer inserts) was 8 times slower, which makes sense, as it must be moving a lot of memory.
What I was hoping for was some kind of gap buffer data structure so the inserts wouldn't involve physically moving all the characters in the rest of the stringbuffer.
So any suggestions about the fastest way to rip through a string inserting characters every once in a while?
Suggestions to use the standard mimeutils library are not helpful because I'm also dot escaping the string so it can be dumped out to an smtp server in one shot.
At the end, your gap data structure would have to be transformed into a String, which would need assembling all the chunks in a single array by appending them to a StringBuilder.
So using a StringBuilder directly will be faster. I don't think you'll find a faster technique than that. Make sure to initialize the StringBuilder with a large enough size to avoid copies of the whole buffer once the capacity is exhausted.
So taking the advice of some of the other answers here I've been writing many versions of this function, seeing what goes quickest and for future reference if anybody can gain from what I found:
1) The slowest: stringbuffer.append() but we knew that.
2) Almost twice as fast: stringbuilder.append(). locks are very expensive it seems.
3) another 20% faster is.... copying from one char[] to another.
4) and finally, coming in three times faster than even that... a JNI call to the exact same code compiled in C that copies from one char array to another.
You may consider #4 cheating, but cheaters win. It is by far the fastest way to go.
There is a risk of the GetCharArrayElements call causing the java char array to be copied so it can be handed to the C program, but I can't tell if that's happening, and it's still wicked fast compared to any java implementation.
I think a good balance between speed and coding grace would be using Matcher.appendReplacement. Formulate a regex that will catch all insertion points. In a loop you use find, analyze Matcher.group() to see what exactly has matched, and use your program logic to decide what to give to appendReplacement.
In any case, it is important not to copy the text over char by char. You must copy in the largest chunks possible.
The Matcher API is quite unfortunately bound to the StringBuffer, but, as you find, that only steels the final 5% from you.
I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.
At the moment I have one of Google's Multimaps, specifically:
LinkedHashMultimap<String, String>
with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.
Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?
If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.
If it grows bigger and/or is too slow, you can apply some easy optimizations:
Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").
In my experience, these these basic optimizations already get you quite far.
Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.
You'll want to look into using the Trie data structure.
Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.
You definetely need a Suffix Tree.. (wiki)
(i think this implementation could be ok: link)
EDIT:
I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?
Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).
Wouter Coekaerts has a good approach, but I would go a bit further.
Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
Precompute the results for two characters. When there are two characters bring up the precomputed list.
When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
Similarly for four characters etc.
As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.