Java 7 String - substring complexity

Java 7 String - substring complexity - java

Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?

Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:
When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.
There's also this note, noting that what once was an optimization had become a pessimization according to tests:
For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.

If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.
The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.
Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.

This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

It's just their crappy way of fixing some JVM garbage collection limitations.
Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:
String smallStr = new String(largeStr.substring(0,2));
But now, we can no longer have a constant time subString. What a disaster.

The main motivation, I believe, is the eventual "co-location" of String and its char[]. Right now they locate in a distance, which is a major penalty on cache lines. If every String owns its char[], JVM can merge them together, and reading will be much faster.

Related

Difference between ArrayList.TrimToSize() and Array?

Generally, They say that we have moved from Array to ArrayList for the following Reason
Arrays are fixed size where as Array Lists are not .
One of the disadvantages of ArrayList is:
When it reaches it's capacity , ArrayList becomes 3/2 of it's actual size. As a result , Memory can go wasted if we donot utilize the space properly.In this scenario, Arrays are preferred.
If we use ArrayList.TrimSize(), will that make Array List a unanimous choice? Eliminating the only advantage(fixed size) Array has over it?

One short answer would be: trimToSize doesn't solve everything, because shrinking an array after it has grown - is not the same as preventing growth in the first place; the former has the cost of copying + garbage collection.
The longer answer would be: int[] is low level, ArrayList is high level which means it's more convenient but gives you less control over the details. Thus in a business-oriented code (e.g. manipulating a short list of "Products") i'll prefer ArrayList so that I can forget about the technicalities and focus on the business. In a mathematically-oriented code i'll probably go for int[].
There are additional subtle differences, but i'm not sure how relevant they are to you. E.g. concurrency: if you change the data of ArrayList from several threads simultaneously, it will intentionally fail, because that's the intuitive requirement for most business code. An int[] will allow you to do whatever you want, leaving it up to you to make sure it makes sense. Again, this can all be summarized as "low level"...

If you are developing an extremely memory critical application, need resizability as well and performance can be traded off, then trimming array list is your best bet. This is the only time, array list with trimming will be unanimous choice.
In other situations, what you are actually doing is:
You have created an array list. Default capacity of the list is 10.
Added an element and applied trim operation. So both size and capacity is now 1. How trim size works? It basically creates a new array with actual size of the list and copies old array data to new array. Old array is left for grabage collection.
You again added a new element. Since list is full, it will be reallocated with more 50% spaces. Again, procedure similar to 2 will be followed.
Again you call TrimSize and it follows same procedure as 2.
Things repeats...
So you see, we are incurring lots of performance overhead just to keep list capacity and size same. Fixed size is not offering you anything advantageous here except saving few more extra spaces which is hardly an issue in modern machines.
In a nutshell, if you want resizability without writing lots of boilerplate code, then array list is unanimous choice. But if size never changes and you don't need any dynamic function such as removal operation, then array is better choice. Few extra bytes are hardly an issue.

Which is the best way to reduce complexity of time and space in Java?

I am writing a program and I am concerned about the time it takes to run and the space it takes.
In the program I used a variable to store the length of an array:
int len=newarray3.length;
Now, I wanted to know that whether I will be able to reduce the space complexity by not using the len variable and instead call the newarray3.length; whenever need.
Note: There are only two occasions when the length needs to be used.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth
First off, a single int variable uses a negligible amount of space. More generally, don't worry about these minor performance ("time and space") issues. Focus on writing clear, readable programs and efficient algorithms. If performance becomes an issue down the line, you'll be able to optimize then as long as your general architecture is solid. So focus on that architecture.

I really doubt that this is the thing that will improve your performance.
Anyway, I would use newarray3.length (instead of assigning to a new variable which takes memory and adds another operation). This is a field anyway (not even a method call). Reading it is the same as reading the int value you copied but you spare the copying and the 4 bytes of memory that int consumes.

Java arrays are not resizable so the length member variable is constant for that object, this can lead to various optimizations by JIT.

Java Character vs char: what about memory usage?

In one of my classes I have a field of type Character. I preferred it over char because sometimes the field has "no value" and null seams to me the cleanest way to represent this (lack of) information.
However I'm wondering about the memory footprint of this approach. I'm dealing with hundred of thousands of objects and the negligible difference between the two options may now deserve some investigation.
My first bet is that a char takes two bytes whereas a Character is an object, and so it takes much more in order to support its life cycle. But I know boxed primitives like Integer, Character and so on are not ordinary classes (think about boxing and unboxing), so I wonder if the JVM can make some kind of optimization under the hood.
Furthermore, are Characters garbage collected like the other stuff or have a different life cycle? Are they pooled from a shared repository? Is this standard or JVM implementation-dependent?
I wasn't able to find any clear information on the Internet about this issue. Can you point me to some information?

If you are you using Character to create character then prefer to use
Character.valueOf('c'); // it returns cached value for 'c'
Character c = new Character('c');// prefer to avoid
Following is an excerpt from javadoc.
If a new Character instance is not required, this method Character.valueOf() should generally be used in preference to the constructor Character(char), as this method is likely to yield significantly better space and time performance by caching frequently requested values.

Use int instead. Use -1 to represent "no char".
Lots of precedence of this pattern, for example int read() in java.io.Reader

As you stated, a Character object can be null, so it has to take more place in RAM than a regular char which cannot be null : in a way, Characters are a superset of chars.
However, in a given part of your code, the JIT compiler might be able to detect that you Character is never null an is always used as a regular char and optimize that part so that your Character uses no more RAM or execution is no slower. I'm just speculating on this very point, though, I don't know if a JIT can actually perform this precise optimization.

Using boolean instead of byte or int in Java

Is using boolean instead of byte(if I need 2 states) in Java useful for performance or it's just illusion... Does all the space profit leveled by alignment?

You should use whichever is clearer, unless you have profiled your code, and decided that making this optimization is worth the cost in readability. Most of the time, this sort of micro-optimization isn't worth the performance increase.
According to Oracle,
boolean: ... This data type represents one bit of information, but its
"size" isn't something that's precisely defined.

To give you an idea, I once consulted in a mini-shop (16-bit machines).
Sometimes people would have a "flag word", a global int containing space for 16 boolean flags.
This was to save space.
Never mind that to test a flag required two 16-bit instructions, and to set a flag required three or more.

Yes, boolean may use only 1 bit. But more important, it makes it clearer for another developer reading your code that there are only two possible states.

The answer depends on your JVM, and on your code. The only way to find out for sure is by profiling your actual code.

If you only have 2 states that you want to represent, and you want to reduce memory usage you can use a java.util.BitSet.

Only most JVMs a boolean uses the same amount of space as a byte. Accessing a byte/boolean can be more work than accessing an int or long, so if performance is the only consideration, a int or long can be faster. When you share a value between threads, there can be an advantage to reserving a whole cache line to the field (in the most extreme cases) This is 64-bytes on many CPUs.

Remove unused allocated Memory from HashMaps

I want to read some XML-files and convert it to a graph (no graphics, just a model). But because the files are very large (2,2 GB) my model object, which holds all the information, becomes even larger (4x the size of the file...).
Googling through the net I tried to find ways to reduce the object size. I tried different collection types but would like to stick to a HashMap (because I have to have random access). The actuall keys and values make up just a small amount of the allocated memory. Most of the hash table is empty...
If I'm not totally wrong a garbage collection doesn't help me to free the allocated memory and reduce the size of the hashmap. Is there and other way to release unused memory and shrink the hashmap? Or is there a way to do perfect hashing? Or shoud I just use another collection?
Thanks in advance,
Sebastian

A HashMap is typically just a large array of references filled to a certain percentage of capacity. If only 80% of the map is filled, the remaining 20% of the array cells are unused (i.e., are null). The extra overhead is really only just the empty (null) cells.
On a 32-bit CPU, each array cell is usually 4 bytes in size (although some JVM implementations may allocate 8 bytes). That's not really that much unused space overall.
Once your map is filled, you can copy it to another HashMap with a more appropriate (smaller) size giving a larger fill percentage.
Your question seems to imply that there are more allocated but unused objects that you're worried about. But how is that the case?
Addendum
Once a map is filled almost to capacity (typically more than 95% or so), a larger array is allocated, the old array's contents are copied to the new array, and then the smaller array is left to be garbage collected. This is obviously an expensive operation, so choosing a reasonably large initial size for the map is key to improving performance.
If you can (over)estimate the number of cells needed, preallocating a map can reduce or even eliminate the resizing operations.

What you are asking is not so clear, it is not clear if memory is taken by the objects that you put inside the hasmap or by the hashmap itself, which shouldn't be the case since it only holds references.
In any case take a look at the WeakHashMap, maybe it is what you are looking for: it is an hashmap which doesn't guarantee that keys are kept inside it, it should be used as a sort of cache but from your description I don't really know if it is your case or not.

If you get nowhere with reducing the memory footprint of your hashmap, you could always put the data in a database. Depending on how the data is accessed, you might still get reasonable performance if you introduce a cache in front of the db.

One thing that might come into play is that you might have substrings that are referencing old larger strings, and those substrings are then making it impossible for the GC to collect the char arrays that are too big.
This happens when you are using some XML parsers that are returning attributes/values as substring from a larger string. (A substring is only a limited view of the larger string).
Try to put your strings in the map by doing something like this:
map.put(new String(key), new String(value));
Note that the GC then might get more work to do when you are populating the map, and this might not help you if you don't have that many substrings that are referencing larger strings.

If you're really serious about this and you have time to spare, you can make your own implementation of the Map interface based on minimal perfect hashing
If your keys are Strings, then there apparently is a map available for you here.
I haven't tried it myself but it brags about reduced memory usage.

You might give the Trove collections a shot. They advertise it as a more time and space efficient drop-in replacement for the java.util Collections.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java 7 String - substring complexity - java

Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?

This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

The main motivation, I believe, is the eventual "co-location" of String and its char[]. Right now they locate in a distance, which is a major penalty on cache lines. If every String owns its char[], JVM can merge them together, and reading will be much faster.

Related

Difference between ArrayList.TrimToSize() and Array?

Which is the best way to reduce complexity of time and space in Java?

Java Character vs char: what about memory usage?

Using boolean instead of byte or int in Java

Remove unused allocated Memory from HashMaps

Categories

Resources