Adding a prefix to the String in Java? - java

I know that adding a character in a string should take O(1) time. for eg:-
String S = "abc"
S = S+'z';
What if I want to do vice-versa, concatenating a String into char. Is it possible like this?
S = 'z'+S;
If yes, then how much time will it take? Does Java copies whole content of String S{O(n)} or just do adjust pointers in memory {O(1)}?
Thanks!

String is immutable. Thus there's no way this operation (of adding the prefix) to be O(1). It is at least linear with respect to the size of S. And... as it makes no sense (think about it) to be O(f(N)) where O(f(N)) > O(N), it means it's O(N). Pretty sure about this just from common sense.

The order of concat does not matter. In recent versions of compilers this (usually) turns into byte code which uses a StringBuilder.

Related

Opertion performance between ArrayList or single String

Performance wise, is it better to use ArrayLists to store a list of values, or a String (using concat/+)? Intuitively, I think Strings would perform better since it'd likely use less overhead than ArrayLists but I haven't been able to find anything online.
Also, the entries wouldn't be anything too large (~10).
ArrayList operations
You can get a value from an ArrayList in O(1) and add a value in O(1).
Furthermore, ArrayList has already build-in operations that help you to retrieve and add elements.
String operations
Concatenation: With a concat and slice operations, it will be worst. A String is roughly speaking arrays of characters. For example, "Hello" + "Stack" can be represented as array ['H', 'e', 'l', 'l', 'o'] and array ['S', 't', 'a', 'c', 'k'].
Now, if you want to concat these two String, you will have to combined all elements of both arrays. It will give you an array of length 10. Therefore, the concatenation - or creating your new char array - is operation in O(n + m).
Worst, if you are concatening n String, you will have a complexity of O(n^2).
Splitting: the complexity of splitting a String is usually O(N) or more. It depends on the regex you will give for the split-operation.
Operations with String are often not that readable and can be stricky to debug.
Long story short
An ArrayList is usually better than operation with String. But all depend on your use case.
Just use ArrayList, it stores the references to your object values, and the reference is not big at all, and that's the point to use references. I keep wondering why will you want to store the value inside a String... that's just odd. ArrayList to store values and get them are faster enough, and String implementation, inside uses arrays also... so... Use ArrayList.
Java performance doesn't come out of "clever" Java source code.
It comes out of the JIT doing a good job at runtime.
You know what the JIT needs to be able to do a good job?
code that looks like code everybody else is writing (it is optimised to produce optimal results for the sort of code everybody else writes)
many thousands of method invocations.
Meaning: the JIT decides whether it makes sense to re-compile code. When the JIT decides to do so, and only then, you want to make sure it can do that.
"Custom" clever java source code ideas, such as what you are proposing here might achieve the opposite.
Thus: don't invent a clever strategy to mangle values into a String. Write simple, human understandable code that uses a List. Because Lists are the concept that Java offers here.
The only exception would be: if you experience a real performance bottleneck, and you did good profiling, and then you figure: the list isn't good enough, then you would start making experiments using other approaches. But I guess that didn't happen yet. You assume you have a performance problem, and you assume that you should fix it that way. Simply wrong, a waste of time and energy.
ArrayList is better choice because String is underneath an array of chars. So every concatenation is just copying the whole old string to a new place, with added new value - time O(n) for each operation.
When you use ArrayList, it has an initial capacity - until it is filled - every add operation is in time O(1). adding new String into ArrayList is only slower when array is full, then it needs to be copied into new place with more capacity. But only references needs to be moved, not the whole Strings - which is way faster than moving whole Strings.
You can even make ArrayList performance better by setting initial capacity when you know how many element do you have:
List<String> list = new ArrayList<String>(elementsCount);

String concatenation for repeated string values in Java

I have two JAVA code snippets below, want to know which is better in terms of memory/performance.
First snippet:
String s1 = "USER.DELETE";
String s2 = "RESOURCE.DELETE";
String s3 = "ENTITY.DELETE";
Second snippet:
one static final variable
private static final String DELETE = ".DELETE";
and then using this variable
String s1 = "USER" + DELETE;
String s2 = "RESOURCE" + DELETE;
String s3 = "ENTITY" + DELETE;
First approach will create 3 String object instance in memory.
The second approach will create 4 String object instance in memory.
Performance impact:
There will not be any impact from performance point of view as string concatenation will be done at compile time in given scenario as value is already known.
Java spec:
Strings computed by constant expressions (ยง15.28) are computed at compile time and then treated as if they were literals.
http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.10.5
Memory Impact :
There will be one extra string created inside the java heap memory space with second approach.
From code maintainability point of view I will go with second approach.
Suppose later if we want to change .DELETE to .ASYNCDELETE.
We have to make only one place change with second approach.
But with first approach we have to make 3 modification.
Actually there is no any difference. Compiler will make concatenation and store resulting string.
So choose according to your style.
The second snippet will store 4 strings in memory while the first will store three.
You'll "waste" the space required to store the ".DELETE".
You have a good article about String concatenation here
Little difference in this scenario, as others described above.
However if the subject is of interest to you in a wider usage, for example if you were to be creating lots of strings dynamically based on more combinations of static data, check out the String intern() method. It helps use the string class as a factory so you'll get the same string object for the same string contents, hurts performance a bit but can save a lot of memory usage and garbage collection overhead if you're working with lots of data, and can also make hash lookups faster if you always intern the keys, in specific situations you can override equals and hashCode / comparators to only use the builtin Object '==' comparison, so the comparator does not need to compare the string contents.

Performance of HashMap

I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?
It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.
If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.
Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.
I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.
Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}
It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...
A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.
As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

Why does appending "" to a String save memory?

I used a variable with a lot of data in it, say String data.
I wanted to use a small part of this string in the following way:
this.smallpart = data.substring(12,18);
After some hours of debugging (with a memory visualizer) I found out that the objects field smallpart remembered all the data from data, although it only contained the substring.
When I changed the code into:
this.smallpart = data.substring(12,18)+"";
..the problem was solved! Now my application uses very little memory now!
How is that possible? Can anyone explain this? I think this.smallpart kept referencing towards data, but why?
UPDATE:
How can I clear the big String then? Will data = new String(data.substring(0,100)) do the thing?
Doing the following:
data.substring(x, y) + ""
creates a new (smaller) String object, and throws away the reference to the String created by substring(), thus enabling garbage collection of this.
The important thing to realise is that substring() gives a window onto an existing String - or rather, the character array underlying the original String. Hence it will consume the same memory as the original String. This can be advantageous in some circumstances, but problematic if you want to get a substring and dispose of the original String (as you've found out).
Take a look at the substring() method in the JDK String source for more info.
EDIT: To answer your supplementary question, constructing a new String from the substring will reduce your memory consumption, provided you bin any references to the original String.
NOTE (Jan 2013). The above behaviour has changed in Java 7u6. The flyweight pattern is no longer used and substring() will work as you would expect.
If you look at the source of substring(int, int), you'll see that it returns:
new String(offset + beginIndex, endIndex - beginIndex, value);
where value is the original char[]. So you get a new String but with the same underlying char[].
When you do, data.substring() + "", you get a new String with a new underlying char[].
Actually, your use case is the only situation where you should use the String(String) constructor:
String tiny = new String(huge.substring(12,18));
When you use substring, it doesn't actually create a new string. It still refers to your original string, with an offset and size constraint.
So, to allow your original string to be collected, you need to create a new string (using new String, or what you've got).
I think this.smallpart kept
referencing towards data, but why?
Because Java strings consist of a char array, a start offset and a length (and a cached hashCode). Some String operations like substring() create a new String object that shares the original's char array and simply has different offset and/or length fields. This works because the char array of a String is never modified once it has been created.
This can save memory when many substrings refer to the same basic string without replicating overlapping parts. As you have noticed, in some situations, it can keep data that's not needed anymore from being garbage collected.
The "correct" way to fix this is the new String(String) constructor, i.e.
this.smallpart = new String(data.substring(12,18));
BTW, the overall best solution would be to avoid having very large Strings in the first place, and processing any input in smaller chunks, aa few KB at a time.
In Java strings are imutable objects and once a string is created, it remains on memory until it's cleaned by the garbage colector (and this cleaning is not something you can take for granted).
When you call the substring method, Java does not create a trully new string, but just stores a range of characters inside the original string.
So, when you created a new string with this code:
this.smallpart = data.substring(12, 18) + "";
you actually created a new string when you concatenated the result with the empty string.
That's why.
As documented by jwz in 1997:
If you have a huge string, pull out a substring() of it, hold on to the substring and allow the longer string to become garbage (in other words, the substring has a longer lifetime) the underlying bytes of the huge string never go away.
Just to sum up, if you create lots of substrings from a small number of big strings, then use
String subtring = string.substring(5,23)
Since you only use the space to store the big strings, but if you are extracting a just handful of small strings, from losts of big strings, then
String substring = new String(string.substring(5,23));
Will keep your memory use down, since the big strings can be reclaimed when no longer needed.
That you call new String is a helpful reminder that you really are getting a new string, rather than a reference to the original one.
Firstly, calling java.lang.String.substring creates new window on the original String with usage of the offset and length instead of copying the significant part of underlying array.
If we take a closer look at the substring method we will notice a string constructor call String(int, int, char[]) and passing it whole char[] that represents the string. That means the substring will occupy as much amount of memory as the original string.
Ok, but why + "" results in demand for less memory than without it??
Doing a + on strings is implemented via StringBuilder.append method call. Look at the implementation of this method in AbstractStringBuilder class will tell us that it finally do arraycopy with the part we just really need (the substring).
Any other workaround??
this.smallpart = new String(data.substring(12,18));
this.smallpart = data.substring(12,18).intern();
Appending "" to a string will sometimes save memory.
Let's say I have a huge string containing a whole book, one million characters.
Then I create 20 strings containing the chapters of the book as substrings.
Then I create 1000 strings containing all paragraphs.
Then I create 10,000 strings containing all sentences.
Then I create 100,000 strings containing all the words.
I still only use 1,000,000 characters. If you add "" to each chapter, paragraph, sentence and word, you use 5,000,000 characters.
Of course it's entirely different if you only extract one single word from the whole book, and the whole book could be garbage collected but isn't because that one word holds a reference to it.
And it's again different if you have a one million character string and remove tabs and spaces at both ends, making say 10 calls to create a substring. The way Java works or worked avoids copying a million characters each time. There is compromise, and it's good if you know what the compromises are.

Categories