After reading answers on this old question, I'm a bit curious to know if there are any frameworks now, that provide for storing large no.(millions) of small size(15-25 chars long) Strings more efficiently than java.lang.String.
If possible I would like to store represent the string using byte[] instead of char[].
My String(s) are going to be constants & I don't really require numerous utility methods as provided by java.lang.String class.
Java 6 does this with -XX:+UseCompressedStrings which is on by default in some updates.
Its not in Java 5.0 or 7. It is still listed as on by default, but its not actually supported in Java 7. :P
Depending on what you want to do you could write your own classes, but if you only have a few 100 MBs of Strings I suspect its not worth it.
Most likely this optimization is not worth the effort and complexity it brings with it. Either live with what the VM offers you (as Peter Lawrey suggests), or go through great lengths to work your own solution (not using java.lang.String).
There is an interface CharSequence your own String class could implement. Unfortunately very few JRE methods accept a CharSequence, so be prepared that toString() will need to be used frequently on your class if you need to pass any of your 'Strings' to any other API.
You could also hack String to create your Strings in a more memory efficient (and less GC friendly way). String has a (package access level) constructor String(offset, count, char[]) that does not copy the chars but just takes the char[] as direct reference. You could put all your strings into one big char[] array and construct the strings using reflection, this would avoid much of the overhead normally introduced by the char[] array in a string. I can't really recommend this method, since it relies on JRE private functionality.
Related
JDK9 team puts effort into helping us removing non-public dependencies (using jdeps). I am using Unsafe class for faster access to Strings inner char array - without creating new char array. If I want to drop dependency on Unsafe class, I would need to load it dynamically and call Unsafe.getObject and other methods using reflection.
Now I wonder the performances: now when I use reflection with Unsafe, how this matches the String.toCharArray performances? Would it make sense to keep using Unsafe?
I assume JDK >= 7.
EDIT
Yes, I totally know that everyone can write these tests using eg JMH; but it takes a lot of time to measure with different inputs and different VM versions (7,8). So I wonder if someone already did this; as many libraries are using the Unsafe.
There is a chance that there will be no backing char[] array at all in Java 9 version of String, see JEP 254. That is, toCharArray() will be your only option.
Generally you should never use Unsafe APIs unless you are absolutely sure it is neccessary. But since you are asking this question, I guess you are not. On my laptop, toCharArray() completes in 25 nanoseconds for 100-chars string, i.e. I could call this 40 million times a second! Do you really have such kind of workloads?
If absolutely needed, use MethodHandles instead of both Reflection and Unsafe. MethodHandles are as fast as direct field access, but unlike Unsafe they are public, supported and safe API.
Java strings are immutable, and instantiating multiple Strings with the same values returns the same object pointer. (Is there a term for this? "pooling" seems to fit, but that already refers to doing caching to save time by doing fewer instantiations.)
Does Java also do this (the thing without a term) with other (user-defined) classes that are immutable? Can Java even detect that a class is immutable, or is this something unique to the string class?
Wrt. Strings, the word you're looking for is interning.
Java won't do this for your own immutable objects. It does have cached versions of boxed primitives, though. See this article on wrapper class caching for more info.
As others here have said this process with Strings is known as interning.
Its worth mentioning that the behaviour of Strings with the same literal values being the same object may or may not be true in Java 7. From 7 onwards:
In JDK 7, interned strings are no longer allocated in the permanent generation of the Java heap, but are instead allocated in the main part of the Java heap (known as the young and old generations), along with the other objects created by the application. This change will result in more data residing in the main Java heap, and less data in the permanent generation, and thus may require heap sizes to be adjusted. Most applications will see only relatively small differences in heap usage due to this change, but larger applications that load many classes or make heavy use of the String.intern() method will see more significant differences.
Take a look at Java SE 7 RFE for the full details on this.
With regards to your own immutable objects Java doesnt do anything special with them - it doesnt know that they're immutable. It may inline methods a little more than otherwise if it can detect that its worthwhile/possible, but as far at the compiler and JVM are concerned they're just another object.
The term you are lookig is itering. Java optimize strings "automatically", during compilation and give the developer possibility to do it on runtime. (The details about what is optimized when depend on JVM version.)
As far it goes for immutable objects. I do not think that Java support any type of mechanism that will resolve same instace. String type is not exeption of this rule.
Reason why, is that you have to use operator new to create a instance. If you use new to create string instance, you will always get two different objects.
The intering is avaiable only for String type. But the concept is free, you can add to your immutable class such method and write an compled method that will do the same thing.
String interning. Wikipedia: String Interning
String Interning is unique to String class only. I suppose that JVM does not apply these rules for a user defined classes.
In Thread.java, line 146, I have noticed that the author used a char[] instead of a String for the name field. Are there any performance reasons that I am not aware of? getName() also wraps the character in a String before returning the name. Isn't it better to just use a String?
In general, yes. I suspect char[] was used in Thread for performance reasons, back in the days when such things in Java required every effort to get decent performance. With the advent of modern JVMs, such micro-optimizations have long since become unimportant, but it's just been left that way.
There's a lot of weird code in the old Java 1.0-era source, I wouldn't pay too much attention to it.
Hard to say. Perhaps they had some optimizations in mind, perhaps the person who wrote this code was simply more used to the C-style char* arrays for strings, or perhaps by the time this code was written they were not sure if strings will be immutable or not. But with this code, any time a Thread.getName() is called, a new char array is created, so this code is actually heavier on the GC than just using a string.
Maybe the reason was security protection? String can be changed with reflection, so the author wants copy on read and write. If you are doing that, you might as well use char array for faster copying.
Was recently reviewing some Java Swing code and saw this:
byte[] fooReference;
String getFoo() {
returns new String(fooReference);
}
void setFoo(String foo) {
this.fooReference = foo.getBytes();
}
The above can be useful to save on your memory foot print or so I'm told.
Is this overkill is anyone else encapsulating their Strings in this way?
That's a really, really bad idea. Don't use the platform default encoding. There's nothing to say that if you call setFoo and then getFoo that you'll get back the same data.
If you must do something like this, then use UTF-8 which can represent the whole of Unicode for certain... but I really wouldn't do it. It potentially saves some memory, but at the cost of performing conversions unnecessarily for most of the time - and being error-prone, in terms of failing to use an appropriate encoding.
I dare say there are some applications where this would be appropriate, but for 99.99% of them, it's a terrible idea.
This is not really useful:
1. You are copying the string every time getFoo or setFoo are called, therefore increasing both CPU and memory usage
2. It's obscure
A little historical excursion...
Using byte arrays instead of String objects actually used to have some considerable advantages in the early days of Java (1.0/1.1) if you could be sure that you would never need anything outside of ISO-8859-1. With the VMs of that time it was more than 10 times faster to use drawBytes() compared to drawString() and it actually does save memory which was still very scarce at that time and applets used to have a hard coded memory barrier of 32 and later 64 MB anyway. Not only is a byte[] smaller than the embedded char[] of String objects but you could also save the comparatively heavy String object itself which did make quite a difference if you had lots of short strings. Besides that accessing a plain byte array is also faster than using the accessor methods of String with all their extra bounds checks.
But since drawBytes ceased to be any faster in Java 1.2 and since current JITs are much better than the Symantec JIT of that time the remaining minimal performance advantage of byte[] arrays over strings is no longer worth the hassle. The memory advantage is still there and it might thus still be an option in some very rare extreme scenarios but nowadays it's nothing that should be considered if it's not really necessary.
It may well be overkill, and it may even consume more memory, since you now have two copies of the string. How long the actual string lives depends upon the client, but as with many such hacks, it smells a lot like premature optimization.
If you anticipate that you'll have a lot of identical strings, another much better way you can save memory is with the String.intern() method.
Each call to getFoo() is instantiating a new String. How is this saving memory? If anything you're adding additional overhead for your garbage collector to go and clean up these new instances when these new references become unreferenced
This does indeed not make any sense. If it were a compile time constant which you don't need to massage back to a String, then it would make a bit more sense. You still have the character encoding problem.
It would make more sense to me if it were a char[] constant. In real world there are several JSP compilers which optimizes String constants away into a char[] which in turn can easily be written to a Writer#write(char[]). This is finally "slightly" more efficient, but those little bits counts a lot in large and heavily used applications like Google Search and so on.
Tomcat's JSP compiler Jasper does this as well. Check the genStringAsCharArray setting. It does then like so
static final char[] text1 = "some static text".toCharArray();
instead of
static final String text1 = "some static text";
which ends up with less overhead. It doesn't need a whole String instance around those characters.
If, after profiling your code, you find that memory usage for strings is a problem, you're much better off using a general string compressor and storing compressed strings, rather than trying to use UTF-8 strings for the minor reduction in space they give you. With English language strings, you can generally compress them to 1-2 bits per character; most other languages are probably similar. Getting to <1 bit per character is hard, but possible if you have a lot of data.
Anybody knows a faster way to do what java.nio.charset.Charset.decode(..)/encode(..) does?
It's currently one of the bottleneck of a technology that I'm using.
[EDIT]
Specifically, in my application, I changed one segment from a java-solution to a JNI-solution (because there was a C++ technology that was most suitable for my needs than the Java technology that I was using).
This change brought about some significant decrease in speed (and significant increase in cpu & mem usage).
Looking deeper into the JNI-solution that I used, the java application is communicating with the C++ application via byte[]. These byte[] are produced by Charset.encode(..) from the java side and passed to the C++ side. Then when the C++ response with a byte[], it gets decoded in the java side via Charset.decode(..).
Running this against a profiler, I see that Charset.decode(..) and Charset.encode(..) both took a significantly long time compared to the whole execution time of the JNI-solution (I profiled only the JNI-solution because it's something I could whip up quite quickly. I'll profile the whole application on a latter date once I free up my schedule :-) ).
Upon reading further regarding my problem, it's seems that it's a known problem with Charset.encode(..) and decode(..) and it's being addressed in Java7. However, moving to Java7 is not an option for me (for now) due to some constraints.
Which is why I ask here if somebody knows a Java5 solution / alternative to this (Sorry, should have mentioned that this was for Java5 sooner) ? :-)
The javadoc for encode() and decode() make it clear that these are convenience methods. For example, for encode():
Convenience method that encodes
Unicode characters into bytes in this
charset.
An invocation of this method upon a
charset cs returns the same result as
the expression
cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(bb);
except that it is potentially more
efficient because it can cache
encoders between successive
invocations.
The language is a bit vague there, but you might get a performance boost by not using these convenience methods. Create and configure the encoder once, and then re-use it:
CharsetEncoder encoder = cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
It always pays to read the javadoc, even if you think you already know the answer.
First part - it is bad idea in general to pass arrays into JNI code. Because of GC, Java has to copy arrays. In the worth case array will be copied two times - on the way to JNI code and on the way back :)
Because of that Buffer class hierarchy was introduced. And of course Java dev team creates a nice way to encode/decode chars:
Charser#newDecoder returns you CharsetDecoder, which could be used to comvert ByteBuffer to CharBuffer according to a Charset. There are two main method versions:
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
CharBuffer decode(ByteBuffer in)
For the max performance you need the first one. It has no hidden memory allocations inside.
You need to note that Encoder/Decoder could maintance internal state, so be careful (for example if you map from 2byte encoding and input buffer has one half of char...). Also encoder/decoder are not threadsafe
There are very few reasons to "squeeze" a string in a byte array.
I would recommend to write the C functions to take utf-16 strings as parameters.
This way there is no need for any conversion.