Why is Java's String memory usage said to be high?

Why is Java's String memory usage said to be high? - java

On this blog post, it's said that the minimum memory usage of a String is:
8 * (int) ((((no chars) * 2) + 45) / 8) bytes.
So for the String "Apple Computers", the minimum memory usage would be 72 bytes.
Even if I have 10,000 String objects of twice that length, the memory usage would be less than 2Mb, which isn't much at all. So does that mean I'm underestimating the amount of Strings present in an enterprise application, or is that formula wrong?
Thanks

String storage in Java depends on how the string was obtained. The backing char array can be shared between multiple instances. If that isn't the case, you have the usual object overhead plus storage for one pointer and three ints which usually comes out to 16 bytes overhead. Then the backing array requires 2 bytes per char since chars are UTF-16 code units.
For "Apple Computers" where the backing array is not shared, the minimum cost is going to be
backing array for 16 chars -- 32B which aligns nicely on a word boundary.
pointer to array - 4 or 8B depending on the platform
three ints for the offset, length, and memoized hashcode - 12B
2 x object overhead - depends on the VM, but 8B is a good rule of thumb.
one int for the array length.
So roughly 72B of which the actual payload constitutes 44.4%. The payload constitutes more for longer strings.
In Java7, some JDK implementations are doing away with backing array sharing to avoid pinning large char[]s in memory. That allows them to do away with 2 of the three ints.
That changes the calculation to 64B for a string of length 16 of which the actual payload constitutes 50%.

Is it possible to save character data using less memory than a Java String? Yes.
Does it matter for "enterprise" applications (or even Android or J2ME applications, which have to get by on a lot less memory)? Almost never.
Premature optimization is the root...

Compared to a other data types that you have, it is definitely high. The other primitives use 32 bits,64 bits,etc.
And given that String is immutable, every time you perform any operation on it, you end up creating a new String object, consuming even more memory.

Related

How do you tell if the ratio of Strings to char[] in a heap dump is indicative of an issue?

An application is exceeding the expected amount of memory allocated to it, and the top three entries in the heap dump are as follows:
num #instances #mb class name
----------------------------------------------
1: 11759890 465.61 [C
2: 3659043 292.89 [Ljava.util.HashMap$Node;
3: 11762204 282.29 java.lang.String
As strings are represented by char arrays, it is expected to see a correlation between the two, and in this case the number of instances of each are fairly similar, with only a difference of ~3000 instances.
What appears confusing is the disparity in usage of the two, given char[] constitute over 50% more bytes than Strings.
As the program in question does not directly contain instances of char arrays, (although there may be a dependency which does that I have overlooked), it seems as if the char arrays are the sole source, but due to the memory usage I am unsure of if this is expected.
What are the expected ratios between char arrays and strings with respect to instances and memory usage? At what point would the data suggest a leak with one of the two?

Before Java 7, update 6 a String instance consisted of reference to a char[] array and two int fields, offset and length. Together with the JVM specific object overhead of at least two words for HotSpot, it can explain the average of 24 bytes per String in your statistics.
In contrast, a char[] array has the JVM specific object overhead, a length stored as int and a variable sized content. So seeing a significantly bigger memory usage, like the average of 40 bytes per char[] array shown in your statistics, is not unusual. Without knowing more about your specific JVM configuration, it’s a bit guesswork, but if we assume 16 bytes for the array header, including length and perhaps padding, the remaining 24 bytes suggest an average array size of 12 chars, which is not unreasonable.
Starting with Java 7, update 6, the offset and length fields of String are gone, which should reduce the size to 16 bytes per instance for most configurations, which could save you ~100MB for the string instances. And change the ratio significantly, so char[] instances dominate the heap even more.
Which is the reason why starting with Java 9, String instances consist of just a reference to a byte[] array. If the string consists of iso-latin-1 characters only, which applies to a lot of strings in typical applications, it will use only one byte per character, halving the amount of memory for these strings (for their array, to be precise). Since there will be also strings with characters outside that charset, of course, the saving won’t be exactly like that. But perhaps another ~100MB for your application.
But still, assuming the average string length guessed above, the memory used by these arrays will be significantly bigger than the memory used by the string instances, even under the most recent implementations. That’s normal.

Java hashmap with single byte[] array for all keys

I have a rather large dataset consisting of 2.3GB worth of data spread over 160 Million byte[] arrays with an average data length of 15 bytes. The value for each byte[] key is only an int so memory usage of nearly half the hashmap (which is over 6GB) is made up of the 16 byte overhead of each byte array
overhead = 8 byte header + 4 byte length rounded up by VM to 16 bytes.
So my overhead is 2.5GB.
Does anybody know of a hashmap implementation that stores its (variable length) byte[] keys in one single large byte array so there would be no overhead (apart from 1 byte length field)?
I would rather not use an in memory DB as they usually have a performance overhead compared to a plain Trove TObjectIntHashMap that i'm using and I value cpu cycles even more then memory usage.
Thanks in advance

Since most personal computers have 16GB these days and servers often 32 - 128 GB or more, is there a real problem with there being a degree of bookkeeping overhead?
If we consider the alternative: the byte data concatenated into a single large array -- we should think about what individual values would have to look like, to reference a slice of a larger array.
Most generally you would start with:
public class ByteSlice {
protected byte[] array;
protected int offset;
protected int len;
}
However, that's 8 bytes + the size of a pointer (perhaps just 4 bytes?) + the JVM object header (12 bytes on a 64-bit JVM). So perhaps total 24 bytes.
If we try and make this single-purpose & minimalist, we're still going to need 4 bytes for the offset.
public class DedicatedByteSlice {
protected int offset;
protected byte len;
protected static byte[] getArray() {/*somebody else knows about the array*/}
}
This is still 5 bytes (probably padded to 8) + the JVM object header. Probably still total 20 bytes.
It seems that the cost of dereferencing with an offset & length, and having an object to track that, are not substantially less than the cost of storing the small array directly.
One further theoretical possibility -- de-structuring the Map Key so it is not an object
It is possible to conceive of destructuring the "length & offset" data such that it is no longer in an object. It then is passed as a set of scalar parameters eg (length, offset) and -- in a hashmap implementation -- would be stored by means of arrays of the separate components (eg. instead of a single Object[] keyArray).
However I expect it is extremely unlikely that any library offers an existing hashmap implementation for your (very particular) usecase.
If you were talking about the values it would probably be pointless, since Java does not offer multiple returns or method OUT parameters; which makes communication impractical without "boxing" destructured data back to an object. Since here you are asking about Map Keys specifically, and these are passed as parameters but need not be returned, such an approach may be theoretically possible to consider.
[Extended]
Even given this it becomes tricky -- for your usecase the map API probably has to become asymmetrical for population vs lookup, as population has to be by (offset, len) to define keys; whereas practical lookup is likely still by concrete byte[] arrays.
OTOH: Even quite old laptops have 16GB now. And your time to write this (times 4-10 to maintain) should be worth far more than the small cost of extra RAM.

size of java byte[] in memory

So i created a very large array of byte[] so something like byte[50,000,000][16].. so according to my math thats 800,000,000 bytes which is 0.8GB plus some overhead.
To my surprise when to do
memoryBefore = runtime.totalMemory() - runtime.freeMemory()
its using 1.8GB of memory. when i point a profiler at it i get this
https://www.dropbox.com/s/el9vlav5ylg781z/Screen%20Shot%202015-08-30%20at%202.55.00%20pm.png?dl=0
I can see most byte[] are 24bytes not 16 as expected and i see quite a few much larger byte[] of size 472 or more.. Does anyone know whats going on here?
Thanks for reading

All objects have an overhead to maintain the object, information like "type" aka Class. See What is the memory consumption of an object in Java?.
Arrays also have a length, and the overhead might be bigger in 64-bit Java.
Since you are allocating 50,000,000 arrays of 16 bytes, you get 50_000_000 * (16 + overhead) + (overhead + 50_000_000 * pointerSize). Second part is the outer array of arrays.
Depending on your requirements, you might be able to improve this in one of two ways:
Flip the indices of you 2-dimensional array to byte[16][50_000_000]. This reduces overhead from 50,000,001 to 17 and reduced outer array size.
Use a single array byte[16 * 50_000_000] and do offset logic yourself. This will keep your 16 bytes contiguous and eliminate all overhead.

I can see most byte[] are 24bytes not 16
An object in Java has a couple of words of header information, in addition to the space that holds the object's fields. In the case of an array, there is an additional word to hold the array's length field. The length actually needs 4 bytes ('cos length is an int), but the JVM is most likely aligning to an 8 byte boundary on your platform.
If you are seeing 16 byte arrays occupying 24 bytes, then that accounting of space is most likely including (just) the length word.
(Note that the actual space occupied by object / array headers is JVM and platform specific.)
... I see quite a few much larger byte[] of size 472 or more.
Those are unrelated. If some other part of your code is not creating them explicitly, they are most likely created by Java runtime libraries.

Does using small datatypes reduce memory usage (From memory allocation not efficiency)?

Actually, my question is very similar with this one, but the post is focus on the C# only. Recently I read an article said that java will 'promote' some short types (like short) to 4 bytes in memory even if some bits are not used, so it can't reduce usage. (is it true ?)
So my question is how languages, especially C, C++ and java (as Manish said in this post talked about java), handles memory allocation of small datatypes. References or any approaches to figure out it are preferred. Thanks

C/C++ uses only the specified amount of memory but aligns the data (by default) to an address that is a multiple of some value, typically 4 bytes for 32 bit applications or 8 bytes for 64 bit.
So for example if the data is aligned on a 4 or 8 byte boundary then a "char" uses only one byte. An array of 5 chars will use 5 bytes. But the data item that is allocated after the 5 byte char array is placed at an address that skips 3 bytes to keep it correctly aligned.
This is for performance on most processors. There are usually pragmas like "pack" and "align" that can be used to change the alignment or disable it.

In C and C++, different approaches may be taken depending on how you've requested the memory.
For T* p = (T*)malloc(n * sizeof(T)); or T* p = new T[n]; then the data will occupy sizeof(T)*n bytes of memory, so if sizeof(T) is reduced (e.g. to int16_t instead of int32_t) then that space is reduced accordingly. That said, heap allocations tend to have some overheads, so few large allocations are better than a great many allocations for individual data items or very small arrays, where the overheads may be much more significant than small differences in sizeof(T).
For structures, static and stack usage, padding is more significant than for large arrays, as the following data item might be of a different type with different alignment requirements, resulting in more padding.
At the other extreme, you can apply bitfields to effectively pack values into the minimum number of bits they need - very dense compression indeed, though you need to rely on compiler pragmas/attributes if you want explicit control - the Standard leaves it unspecified when a bitfield might start in a new memory "word" (e.g. 32 bit memory word for a 32 bit process, 64 for 64) or wrap across separate words, where in the words the bits hold data vs padding etc.). Data types like C++ bitsets and vector<bool> may be more efficient than arrays of bool (which may well use an int for each element, but it's unspecified in the C++03 Standard).`

What is the memory consumption of an object in Java?

Is the memory space consumed by one object with 100 attributes the same as that of 100 objects, with one attribute each?
How much memory is allocated for an object?
How much additional space is used when adding an attribute?

Mindprod points out that this is not a straightforward question to answer:
A JVM is free to store data any way it pleases internally, big or little endian, with any amount of padding or overhead, though primitives must behave as if they had the official sizes.
For example, the JVM or native compiler might decide to store a boolean[] in 64-bit long chunks like a BitSet. It does not have to tell you, so long as the program gives the same answers.
It might allocate some temporary Objects on the stack.
It may optimize some variables or method calls totally out of existence replacing them with constants.
It might version methods or loops, i.e. compile two versions of a method, each optimized for a certain situation, then decide up front which one to call.
Then of course the hardware and OS have multilayer caches, on chip-cache, SRAM cache, DRAM cache, ordinary RAM working set and backing store on disk. Your data may be duplicated at every cache level. All this complexity means you can only very roughly predict RAM consumption.
Measurement methods
You can use Instrumentation.getObjectSize() to obtain an estimate of the storage consumed by an object.
To visualize the actual object layout, footprint, and references, you can use the JOL (Java Object Layout) tool.
Object headers and Object references
In a modern 64-bit JDK, an object has a 12-byte header, padded to a multiple of 8 bytes, so the minimum object size is 16 bytes. For 32-bit JVMs, the overhead is 8 bytes, padded to a multiple of 4 bytes. (From Dmitry Spikhalskiy's answer, Jayen's answer, and JavaWorld.)
Typically, references are 4 bytes on 32bit platforms or on 64bit platforms up to -Xmx32G; and 8 bytes above 32Gb (-Xmx32G). (See compressed object references.)
As a result, a 64-bit JVM would typically require 30-50% more heap space. (Should I use a 32- or a 64-bit JVM?, 2012, JDK 1.7)
Boxed types, arrays, and strings
Boxed wrappers have overhead compared to primitive types (from JavaWorld):
Integer: The 16-byte result is a little worse than I expected because an int value can fit into just 4 extra bytes. Using an Integer costs me a 300 percent memory overhead compared to when I can store the value as a primitive type
Long: 16 bytes also: Clearly, actual object size on the heap is subject to low-level memory alignment done by a particular JVM implementation for a particular CPU type. It looks like a Long is 8 bytes of Object overhead, plus 8 bytes more for the actual long value. In contrast, Integer had an unused 4-byte hole, most likely because the JVM I use forces object alignment on an 8-byte word boundary.
Other containers are costly too:
Multidimensional arrays: it offers another surprise.
Developers commonly employ constructs like int[dim1][dim2] in numerical and scientific computing.
In an int[dim1][dim2] array instance, every nested int[dim2] array is an Object in its own right. Each adds the usual 16-byte array overhead. When I don't need a triangular or ragged array, that represents pure overhead. The impact grows when array dimensions greatly differ.
For example, a int[128][2] instance takes 3,600 bytes. Compared to the 1,040 bytes an int[256] instance uses (which has the same capacity), 3,600 bytes represent a 246 percent overhead. In the extreme case of byte[256][1], the overhead factor is almost 19! Compare that to the C/C++ situation in which the same syntax does not add any storage overhead.
String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
Alignment
Consider this example object:
class X { // 8 bytes for reference to the class definition
int a; // 4 bytes
byte b; // 1 byte
Integer c = new Integer(); // 4 bytes for a reference
}
A naïve sum would suggest that an instance of X would use 17 bytes. However, due to alignment (also called padding), the JVM allocates the memory in multiples of 8 bytes, so instead of 17 bytes it would allocate 24 bytes.

It depends on the CPU architecture and JDK. For a modern JDK and 64-bit architecture, an object has 12-byte header and padding of 8 bytes - so the minimum object size is 16 bytes. You can use a tool called Java Object Layout to determine a size and get details about any entity's object layout and internal structure or guess this information by class reference. Example of output for Integer instance on my environment:
Running 64-bit HotSpot VM.
Using compressed oop with 3-bit shift.
Using compressed klass with 3-bit shift.
Objects are 8 bytes aligned.
Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
java.lang.Integer object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 int Integer.value N/A
Instance size: 16 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
For Integer, the instance size is 16 bytes: 4-bytes int is placed in place right after the 12-byte header. And it doesn't need any additional "padding", because 16 is a multiple of 8 (which is a RAM word size on 64-bits architecture) without remainder.
Code sample:
import org.openjdk.jol.info.ClassLayout;
import org.openjdk.jol.util.VMSupport;
public static void main(String[] args) {
System.out.println(VMSupport.vmDetails());
System.out.println(ClassLayout.parseClass(Integer.class).toPrintable());
}
If you use maven, to get JOL:
<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.3.2</version>
</dependency>

Each object has a certain overhead for its associated monitor and type information, as well as the fields themselves. Beyond that, fields can be laid out pretty much however the JVM sees fit (I believe) - but as shown in another answer, at least some JVMs will pack fairly tightly. Consider a class like this:
public class SingleByte
{
private byte b;
}
vs
public class OneHundredBytes
{
private byte b00, b01, ..., b99;
}
On a 32-bit JVM, I'd expect 100 instances of SingleByte to take 1200 bytes (8 bytes of overhead + 4 bytes for the field due to padding/alignment). I'd expect one instance of OneHundredBytes to take 108 bytes - the overhead, and then 100 bytes, packed. It can certainly vary by JVM though - one implementation may decide not to pack the fields in OneHundredBytes, leading to it taking 408 bytes (= 8 bytes overhead + 4 * 100 aligned/padded bytes). On a 64 bit JVM the overhead may well be bigger too (not sure).
EDIT: See the comment below; apparently HotSpot pads to 8 byte boundaries instead of 32, so each instance of SingleByte would take 16 bytes.
Either way, the "single large object" will be at least as efficient as multiple small objects - for simple cases like this.

It appears that every object has an overhead of 16 bytes on 32-bit systems (and 24-byte on 64-bit systems).
http://algs4.cs.princeton.edu/14analysis/ is a good source of information. One example among many good ones is the following.
http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf is also very informative, for example:

The total used / free memory of a program can be obtained in the program via
java.lang.Runtime.getRuntime();
The runtime has several methods which relate to the memory. The following coding example demonstrates its usage.
public class PerformanceTest {
private static final long MEGABYTE = 1024L * 1024L;
public static long bytesToMegabytes(long bytes) {
return bytes / MEGABYTE;
}
public static void main(String[] args) {
// I assume you will know how to create an object Person yourself...
List <Person> list = new ArrayList <Person> ();
for (int i = 0; i <= 100_000; i++) {
list.add(new Person("Jim", "Knopf"));
}
// Get the Java runtime
Runtime runtime = Runtime.getRuntime();
// Run the garbage collector
runtime.gc();
// Calculate the used memory
long memory = runtime.totalMemory() - runtime.freeMemory();
System.out.println("Used memory is bytes: " + memory);
System.out.println("Used memory is megabytes: " + bytesToMegabytes(memory));
}
}

Is the memory space consumed by one object with 100 attributes the same as that of 100 objects, with one attribute each?
No.
How much memory is allocated for an object?
The overhead is 8 bytes on 32-bit, 12 bytes on 64-bit; and then rounded up to a multiple of 4 bytes (32-bit) or 8 bytes (64-bit).
How much additional space is used when adding an attribute?
Attributes range from 1 byte (byte) to 8 bytes (long/double), but references are either 4 bytes or 8 bytes depending not on whether it's 32bit or 64bit, but rather whether -Xmx is < 32Gb or >= 32Gb: typical 64-bit JVM's have an optimisation called "-UseCompressedOops" which compress references to 4 bytes if the heap is below 32Gb.

No, registering an object takes a bit of memory too. 100 objects with 1 attribute will take up more memory.

The question will be a very broad one.
It depends on the class variable or you may call as states memory usage in java.
It also has some additional memory requirement for headers and referencing.
The heap memory used by a Java object includes
memory for primitive fields, according to their size (see below for Sizes of primitive types);
memory for reference fields (4 bytes each);
an object header, consisting of a few bytes of "housekeeping" information;
Objects in java also requires some "housekeeping" information, such as recording an object's class, ID and status flags such as whether the object is currently reachable, currently synchronization-locked etc.
Java object header size varies on 32 and 64 bit jvm.
Although these are the main memory consumers jvm also requires additional fields sometimes like for alignment of the code e.t.c.
Sizes of primitive types
boolean & byte -- 1
char & short -- 2
int & float -- 4
long & double -- 8

I've gotten very good results from the java.lang.instrument.Instrumentation approach mentioned in another answer. For good examples of its use, see the entry, Instrumentation Memory Counter from the JavaSpecialists' Newsletter and the java.sizeOf library on SourceForge.

In case it's useful to anyone, you can download from my web site a small Java agent for querying the memory usage of an object. It'll let you query "deep" memory usage as well.

no, 100 small objects needs more information (memory) than one big.

The rules about how much memory is consumed depend on the JVM implementation and the CPU architecture (32 bit versus 64 bit for example).
For the detailed rules for the SUN JVM check my old blog
Regards,
Markus

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.