How does RandomAccessFile.seek() work? - java

As per the API, these are the facts:
The seek(long bytePosition) method simply put, moves the pointer to
the position specified with the bytePosition parameter.
When the bytePosition is greater than the file length, the file
length does not change unless a byte is written at the (new) end.
If data is present in the length skipped over, such data is left
untouched.
However, the situation I'm curious about is: When there is a file with no data (0 bytes) and I execute the following code:
file.seek(100000-1);
file.write(0);
All the 100,000 bytes are filled with 0 almost instantly. I can clock over 200GB in say, 10 ms.
But when I try to write 100000 bytes using other methods such as BufferedOutputStream the same process takes an almost infinitely longer time.
What is the reason for this difference in time? Is there a more efficient way to create a file of n bytes and fill it with 0s?
EDIT:
If the data is not actually written, how is the file filled with data?
Sample this code:
RandomAccessFile out=new RandomAccessFile("D:/out","rw");
out.seek(100000-1);
out.write(0);
out.close();
This is the output:
Plus, If the file is huge enough I can no longer write to the disk due to lack of space.

When you write 100,000 bytes to a BufferedOutputStream, your program is explicitly accessing each byte of the file and writing a zero.
When you use a RandomAccessFile.seek() on a local file, you are indirectly using the C system call fseek(). How that gets handled depends on the operating system.
In most modern operating systems, sparse files are supported. This means that if you ask for an empty 100,000 byte file, 100,000 bytes of disk space are not actually used. When you write to byte 100,001, the OS still doesn't use 100,001 bytes of disk. It allocates a small amount of space for the block containing "real" data, and keeps track of the empty space separately.
When you read a sparse file, for example, by fseek()ing to byte 50,000, then reading, the OS can say "OK, I have not allocated disk space for byte 50,000 because I have noted that bytes 0 to 100,000 are empty. Therefore I can return 0 for this byte.". This is invisible to the caller.
This has the dual purpose of saving disk space, and improving speed. You have noticed the speed improvement.
More generally, fseek() goes directly to a position in a file, so it's O(1) rather than O(n). If you compare a file to an array, it's like doing x = arr[n] instead of for(i = 0; i<=n; i++) { x = arr[i]; }
This description, and that on Wikipedia, is probably sufficient to understand why seeking to byte 100,000 then writing is faster than writing 100,000 zeros. However you can read the Linux kernel source code to see how sparse files are implemented, you can read the RandomAccessFile source code in the JDK, and the JRE source code, to see how they interact. However, this is probably more detail than you need.

Your operating system and filesystem support sparse files and when it's the case, seek is implemented to make use of this feature.
This is not really related to Java, it's just a feature of fseek and fwrite functions from C library, which are most likely the backend behind File implementation on the JRE you are using.
more info: https://en.wikipedia.org/wiki/Sparse_file
Is there a more efficient way to create a file of n bytes and fill it with 0s?
On operating systems that support it, you could truncate the file to the desired size instead of issuing a write call. However, this seems to be not available in Java APIs.

Related

System.String is consuming too much heap memory

Consider the below code in any method.
count += new String(text.getBytes()).length()
I am facing memory issue.
I am using this to count number of characters in file. When I am fetching heap dump I am getting huge amount of memory occupied by String Objects. Is it because of this line of code? I am just looking for suggestions.
Assuming text is a String this code is roughly equivalent to count +=text.length(). The difference are mostly:
it needlessly requires more memory (and CPU time) by basically encoding the code in the platform default encoding and decoding it again
if the platform default encoding can't represent any specific characters in text, then those will be replaced with a ?. If those characters aren't in the BMP then this can actually result in a decreased length.
So it's arguably strictly worse than just taking the length() of text (if the second thing is actually intentional, then there's more efficient ways to check for that).
Apart from that, the major problem is probably the size of the content of text: if it's a whole file or some other huge junk of data, then keeping it all in memory instead of processing it as a stream will always produce some memory pressure. You "just" increased it with this code, but the fundamental solution is to not keep the whole thing in memory in the first place (which is possible more often than not).
I think you can get the character count like this
for(int i=0; i<text.length(); i++) {
count++;
}

What happens if a file doesn't end exactly at the last byte?

For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?
You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.
Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.

Java - Large array advice on how to break it down [duplicate]

I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?
You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)
Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.
Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.
900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.
If you don't need it all loaded in memory at once, you could segment it into files and store on disk.
What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.
You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).
If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.
Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.
I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.
I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.
Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.
could you get by with 900 million bits? (maybe stored as a byte array).
You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list
Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.

Combining text- and bit-information in a file in Java?

Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).

Reading N lines with optimization - Java

I have a large text file with N number of lines. Now I have to read these lines in i iterations. Which means that I have to read n = Math.floor(N/i) lines in a single iteration. Now in each iteration I have to fill a string array of n length. So the basic question is that how should I read n lines in optimum time? The simplest way to do this is to use a BufferedReader and read one line at time with BufferedReader.readLine() but it will significantly decrease performance if n is too large. Is there a way to read exactly n lines at a time?
To read n lines from a text file, from a system point of view there is no other way than reading as many characters as necessary until you have seen n end-of-line delimiters (unless the file has been preprocessed to detect these, but I doubt this is allowed here).
As far as I know, no file I/O system in the world does support a function to read "until the nth occurrence of some character", nor "the n following lines" (but I am probably wrong).
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
I agree with Yves Daoust's answer, except for the paragraph recommending
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
There's no need to "detect the end-of-lines yourself". Something like
new BufferedReader(new InputStreamReader(is, charset), 8192);
creates a reader with a buffer of 8192 chars. The question is how useful this is for reading data in blocks. For this a byte[] is needed and there's a sun.nio.cs.StreamDecoder in between which I haven't looked into.
To be sure use
new BufferedReader(new InputStreamReader(new BufferedInputStream(is, 8192), charset));
so you get a byte[] buffer.
Note that 8192 is the default size for both BufferedReader and InputStreamReader, so leaving it out would change nothing in my above examples. Note that using much larger buffers makes no sense and can even be detrimental for performance.
Update
So far you get all the buffering needed and this should suffice. In case it doesn't, you can try:
to avoid the decoder overhead. When your lines are terminated by \n, you can look for (byte) \n in the file content without decoding it (unless you're using some exotic Charset).
to prefetch the data yourself. Normally, the OS should take care of it, so when your buffer becomes empty, Java calls into the OS, and it has the data already in memory.
to use a memory mapped file, so that no OS calls are needed for fetching more data (as all data "are" there when the mapping gets created).

Categories