Consider the below code in any method.
count += new String(text.getBytes()).length()
I am facing memory issue.
I am using this to count number of characters in file. When I am fetching heap dump I am getting huge amount of memory occupied by String Objects. Is it because of this line of code? I am just looking for suggestions.
Assuming text is a String this code is roughly equivalent to count +=text.length(). The difference are mostly:
it needlessly requires more memory (and CPU time) by basically encoding the code in the platform default encoding and decoding it again
if the platform default encoding can't represent any specific characters in text, then those will be replaced with a ?. If those characters aren't in the BMP then this can actually result in a decreased length.
So it's arguably strictly worse than just taking the length() of text (if the second thing is actually intentional, then there's more efficient ways to check for that).
Apart from that, the major problem is probably the size of the content of text: if it's a whole file or some other huge junk of data, then keeping it all in memory instead of processing it as a stream will always produce some memory pressure. You "just" increased it with this code, but the fundamental solution is to not keep the whole thing in memory in the first place (which is possible more often than not).
I think you can get the character count like this
for(int i=0; i<text.length(); i++) {
count++;
}
Related
As per the API, these are the facts:
The seek(long bytePosition) method simply put, moves the pointer to
the position specified with the bytePosition parameter.
When the bytePosition is greater than the file length, the file
length does not change unless a byte is written at the (new) end.
If data is present in the length skipped over, such data is left
untouched.
However, the situation I'm curious about is: When there is a file with no data (0 bytes) and I execute the following code:
file.seek(100000-1);
file.write(0);
All the 100,000 bytes are filled with 0 almost instantly. I can clock over 200GB in say, 10 ms.
But when I try to write 100000 bytes using other methods such as BufferedOutputStream the same process takes an almost infinitely longer time.
What is the reason for this difference in time? Is there a more efficient way to create a file of n bytes and fill it with 0s?
EDIT:
If the data is not actually written, how is the file filled with data?
Sample this code:
RandomAccessFile out=new RandomAccessFile("D:/out","rw");
out.seek(100000-1);
out.write(0);
out.close();
This is the output:
Plus, If the file is huge enough I can no longer write to the disk due to lack of space.
When you write 100,000 bytes to a BufferedOutputStream, your program is explicitly accessing each byte of the file and writing a zero.
When you use a RandomAccessFile.seek() on a local file, you are indirectly using the C system call fseek(). How that gets handled depends on the operating system.
In most modern operating systems, sparse files are supported. This means that if you ask for an empty 100,000 byte file, 100,000 bytes of disk space are not actually used. When you write to byte 100,001, the OS still doesn't use 100,001 bytes of disk. It allocates a small amount of space for the block containing "real" data, and keeps track of the empty space separately.
When you read a sparse file, for example, by fseek()ing to byte 50,000, then reading, the OS can say "OK, I have not allocated disk space for byte 50,000 because I have noted that bytes 0 to 100,000 are empty. Therefore I can return 0 for this byte.". This is invisible to the caller.
This has the dual purpose of saving disk space, and improving speed. You have noticed the speed improvement.
More generally, fseek() goes directly to a position in a file, so it's O(1) rather than O(n). If you compare a file to an array, it's like doing x = arr[n] instead of for(i = 0; i<=n; i++) { x = arr[i]; }
This description, and that on Wikipedia, is probably sufficient to understand why seeking to byte 100,000 then writing is faster than writing 100,000 zeros. However you can read the Linux kernel source code to see how sparse files are implemented, you can read the RandomAccessFile source code in the JDK, and the JRE source code, to see how they interact. However, this is probably more detail than you need.
Your operating system and filesystem support sparse files and when it's the case, seek is implemented to make use of this feature.
This is not really related to Java, it's just a feature of fseek and fwrite functions from C library, which are most likely the backend behind File implementation on the JRE you are using.
more info: https://en.wikipedia.org/wiki/Sparse_file
Is there a more efficient way to create a file of n bytes and fill it with 0s?
On operating systems that support it, you could truncate the file to the desired size instead of issuing a write call. However, this seems to be not available in Java APIs.
I am writing a simulation that will use a lot of memory and needs to be fast. Instead of using ints I am using chars (8 bits not 32). I need to operate on them as if these chars were ints.
To achieve that I have done something like
char a = 1;
char b = 2;
System.out.println(a*1 + b*1); //it give me 3 in console so it has int-like behavior;
I don't know what's going on "under the mask" when I multiply char with an integer. Is this the fastest way to do it?
Thank you for help!
Performance wise it's not worth using char instead of int, because all modern hardware architectures are optimized for 32- or 64-bit wide memory and register access.
Only reason to use char would be if you want to reduce memory footprint, i.e. if you work with large amount of data.
Additional info: Performance of built-in types : char vs short vs int vs. float vs. double
A char is simply a number (not a 32bit int but a number) which is normally represented ascii-encoded. By multiply it with an integer the compiler does an implicit cast from char to int that's why you get printed 3 on the console instead of the ascii representative.
You should use ints. the size of the character will not affect the fetch speed of the data, nor the processing time unless it is composed of multiple computer words, which is unlikely. There is no reason to do this. Moreover, chars are stored as integers and are therefore the same size. an alternative would be to use small ints but again this is not helpful for what you want.
Also, I notice your code has 'System.Out.Println' which means you are using java. There are several implications of this, the first is that no, you are not going to be going fast or using very little memory. Period, it will not happen. There is a large amount of overhead involved in running the JVM, JIT, garbage collector and other parts of the Java platform. If efficiency is a relavent factor you are starting off wrong. The second implication is that your choice of datatypes will have no impact on processing time because they will be identical to the physical hardware. Only the Virtual machine will distinguish between them, and in the case of primitives, there is no difference anyways.
Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?
Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:
When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.
There's also this note, noting that what once was an optimization had become a pessimization according to tests:
For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.
If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.
The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.
Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.
This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.
It's just their crappy way of fixing some JVM garbage collection limitations.
Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:
String smallStr = new String(largeStr.substring(0,2));
But now, we can no longer have a constant time subString. What a disaster.
The main motivation, I believe, is the eventual "co-location" of String and its char[]. Right now they locate in a distance, which is a major penalty on cache lines. If every String owns its char[], JVM can merge them together, and reading will be much faster.
I am writing a program and I am concerned about the time it takes to run and the space it takes.
In the program I used a variable to store the length of an array:
int len=newarray3.length;
Now, I wanted to know that whether I will be able to reduce the space complexity by not using the len variable and instead call the newarray3.length; whenever need.
Note: There are only two occasions when the length needs to be used.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth
First off, a single int variable uses a negligible amount of space. More generally, don't worry about these minor performance ("time and space") issues. Focus on writing clear, readable programs and efficient algorithms. If performance becomes an issue down the line, you'll be able to optimize then as long as your general architecture is solid. So focus on that architecture.
I really doubt that this is the thing that will improve your performance.
Anyway, I would use newarray3.length (instead of assigning to a new variable which takes memory and adds another operation). This is a field anyway (not even a method call). Reading it is the same as reading the int value you copied but you spare the copying and the 4 bytes of memory that int consumes.
Java arrays are not resizable so the length member variable is constant for that object, this can lead to various optimizations by JIT.
In one of my classes I have a field of type Character. I preferred it over char because sometimes the field has "no value" and null seams to me the cleanest way to represent this (lack of) information.
However I'm wondering about the memory footprint of this approach. I'm dealing with hundred of thousands of objects and the negligible difference between the two options may now deserve some investigation.
My first bet is that a char takes two bytes whereas a Character is an object, and so it takes much more in order to support its life cycle. But I know boxed primitives like Integer, Character and so on are not ordinary classes (think about boxing and unboxing), so I wonder if the JVM can make some kind of optimization under the hood.
Furthermore, are Characters garbage collected like the other stuff or have a different life cycle? Are they pooled from a shared repository? Is this standard or JVM implementation-dependent?
I wasn't able to find any clear information on the Internet about this issue. Can you point me to some information?
If you are you using Character to create character then prefer to use
Character.valueOf('c'); // it returns cached value for 'c'
Character c = new Character('c');// prefer to avoid
Following is an excerpt from javadoc.
If a new Character instance is not required, this method Character.valueOf() should generally be used in preference to the constructor Character(char), as this method is likely to yield significantly better space and time performance by caching frequently requested values.
Use int instead. Use -1 to represent "no char".
Lots of precedence of this pattern, for example int read() in java.io.Reader
As you stated, a Character object can be null, so it has to take more place in RAM than a regular char which cannot be null : in a way, Characters are a superset of chars.
However, in a given part of your code, the JIT compiler might be able to detect that you Character is never null an is always used as a regular char and optimize that part so that your Character uses no more RAM or execution is no slower. I'm just speculating on this very point, though, I don't know if a JIT can actually perform this precise optimization.