I am implementing some kind of file viewer/file explorer as a Web-Application. Therefore I need to read files from the hard disk of the system. Of course I have to deal with small files and large files and I want the fastest and most performant way of doing this.
Now I have the following code and want to ask the "big guys" who have a lot of knowledge about efficiently reading (large) files if I am doing it the right way:
RandomAccessFile fis = new RandomAccessFile(filename, "r");
FileChannel fileChannel = fis.getChannel();
// Don't load the whole file into the memory, therefore read 4096 bytes from position on
MappedByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
byte[] buf = new byte[4096];
StringBuilder sb = new StringBuilder();
while (mappedByteBuffer.hasRemaining()) {
// Math.min(..) to avoid BufferUnderflowException
mappedByteBuffer.get(buf, 0, Math.min(4096, map1.remaining()));
sb.append(new String(buf));
}
LOGGER.debug(sb.toString()); // Debug purposes
I hope you can help me and give me some advices.
When you are going to view arbitrary files, including potentially large files, I’d assume that there’s possibility that these files are not actually text files or that you may encounter files which have different encodings.
So when you are going to view such files as text on a best-effort basis, you should think about which encoding you want to use and make sure that failures do not harm your operation. The constructor you use with new String(buf) does replace invalid characters, but it is redundant to construct a new String instance and append it to a StringBuilder afterwards.
Generally, you shouldn’t go so many detours. Since Java 7, you don’t need a RandomAccessFile (or FileInputStream) to get a FileChannel. A straight-forward solution would look like
// Instead of StandardCharsets.ISO_8859_1 you could also use Charset.defaultCharset()
CharsetDecoder decoder = StandardCharsets.ISO_8859_1.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(".");
try(FileChannel fileChannel=FileChannel.open(Paths.get(filename),StandardOpenOption.READ)) {
//Don't load the whole file into the memory, therefore read 4096 bytes from position on
ByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
CharBuffer cb = decoder.decode(mappedByteBuffer);
LOGGER.debug(cb.toString()); // Debug purposes
}
You can operate with the resulting CharBuffer directly or invoke toString() on it to get a String instance (but of course, avoid doing it multiple times). The CharsetDecoder also allows to re-use a CharBuffer, however, that may not have such a big impact on the performance. What you should definitely avoid, is to concatenate all these chunks to a big string.
Related
I have been trying to implement decompressing text compressed in GZIP format
Below we have method I implemented
private byte[] decompress(String compressed) throws Exception {
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new
ByteArrayInputStream(compressed.getBytes(StandardCharsets.UTF_8));
GZIPInputStream ungzip = new GZIPInputStream(in);
byte[] buffer = new byte[256];
int n;
while ((n = ungzip.read(buffer)) >= 0) {
out.write(buffer, 0, n);
}
return out.toByteArray();
}
And now I am testing the solution for following compressed text:
H4sIAAAAAAAACjM0MjYxBQAcOvXLBQAAAA==
And there is Not a gzip format exception.
I tried different ways but there still is this error. Maybe anyone has idea what am I doing wrong?
That's not gzip formatted. In general, compressed cannot be a string (because compressed data is bytes, and a string isn't bytes. Some languages / tutorials / 1980s thinking conflate the 2, but it's the 2020s. We don't do that anymore. There are more characters than what's used in english).
It looks like perhaps the following has occurred:
Someone has some data.
They gzipped it.
They then turned the gzipped stream (which are bytes) into characters using Base64 encoding.
They sent it to you.
You now want to get back to the data.
Given that 2 transformations occurred (first, gzip it, then, base64 it), you need to also do 2 transformations, in reverse. You need to:
Take the input string, and de-base64 it, giving you bytes.
You then need to take these bytes and decompress them.
and now you have the original data back.
Thus:
byte[] gzipped = java.util.Base64.getDecoder().decode(compressed);
var in = new GZIPInputStream(new ByteArrayInputStream(gzipped));
return in.readAllBytes();
Note:
Pushing the data from input to outputstream like this is a waste of resources and a bunch of finicky code. There is no need to write this; just call readAllBytes.
If the incoming Base64 is large, there are ways to do this in a streaming fashion. This would require that this method takes in a Reader (instead of a String which cannot be streamed), and would return an InputStream instead of a byte[]. Of course if the input is not particularly large, there is no need. The above approach is somewhat wasteful - both the base64-ed data, and the un-base64ed data, and the decompressed data is all in memory at the same time and you can't avoid this nor can the garbage collector collect any of this stuff in between (because the caller continues to ref that base64-ed string most likely).
In other words, if the compressed ratio is, say, 50%, and the total uncompressed data is 100MB in size, this method takes MORE than:
100MB (uncompressed ) + 50MB (compressed) + 50*4/3 = 67MB (compressed but base64ed) = ~ 217MB of memory.
You know better than we do how much heap your VM is running on, and how large the input data is likely to ever get.
NB: Base64 transfer is extremely inefficient, taking 4 bytes of base64 content for every 3 bytes of input data, and if the data transfer is in UTF-16, it's 8 bytes per 3, even. Ouch. Given that the content was GZipped, this feels a bit daft: First we painstakingly reduce the size of this thing, and then we casually inflate it by 33% for probably no good reason. You may want to check the 'pipe' that leads you to this, possibly you can just... eliminate the base64 aspect of this.
For example, if you have a wire protocol and someone decided that JSON was a good idea, then.. simply.. don't. JSON is not a good idea if you have the need to transfer a bunch of raw data. Use protobuf, or send a combination of JSON and blobs, etc.
I'm trying to read a large file by chunks and save them in an ArrayList of bytes.
My code, in short, looks like this:
public ArrayList<byte[]> packets = new ArrayList<>();
FileInputStream fis = new FileInputStream("random_text.txt");
byte[] buffer = new byte[512];
while (fis.read(buffer) > 0){
packets.add(buffer);
}
fis.close();
But it has a behavior that I don't know how to solve, for example: If a file has only the words "hello world", this chunk does not necessarily need to be 512 bytes long. In fact, I want each chunk to be a maximum of 512 bytes not that they all necessarily have that size.
First of all, what you are doing is probably a bad idea. Storing a file's contents in memory like this is liable to be a waste of heap space ... and can lead to OutOfMemoryError exceptions and / or a requirement for an excessively large heap if you process large (enough) input files.
The second problem is that your code is wrong. You are repeatedly reading the data into the same byte array. Each time you do, it overwrites what was there before. So you will end up will a list containing lots of reference to a single byte array ... containing just the last chunk of data that you read.
To solve the problem that you asked about1, you will need to copy the chunk that you read to a new (smaller) byte array.
Something like this:
public ArrayList<byte[]> packets = new ArrayList<>();
try (FileInputStream fis = new FileInputStream("random_text.txt")) {
byte[] buffer = new byte[512];
int len;
while ((len = fis.read(buffer)) > 0) {
packets.add(Arrays.copyOf(buffer, len));
}
}
Note that this also deals with the second problem I mentioned. And fixes a potential resource leak by using try with resource syntax to manage the closure of the input stream.
A final issue: If this is really a text file that you are reading, you probably should be using a Reader to read it, and char[] or String to hold it.
But even if you do that there are some awkward edge cases if your text contains Unicode codepoints that are not in code plane 0. For example, emojis. The edge cases will involve code points that are represented as a surrogate pair AND the pair being split on a chunk boundary. Reading and storing the text as lines would avoid that.
1 - The issue here is not the "wasted" space. Unless you are reading and caching a large number of small file, any space wastage due to "short" chunks will be unimportant. The important issue is knowing which bytes in each byte[] are actually valid data.
What are some practical areas where ByteArrayInputStream and/or ByteArrayOutputStream are used? Examples are also welcome.
If one searches for examples, one finds usually something like:
byte[] buf = { 16, 47, 12 };
ByteArrayInputStream byt = new ByteArrayInputStream(buf);
It does not help where or why should one use it. I know that they are used when working with images, ZIP files, or writing to ServletOutputStream.
ByteArrayInputStream: every time you need an InputStream (typically because an API takes that as argument), and you have all the data in memory already, as a byte array (or anything that can be converted to a byte array).
ByteArrayOutputStream: every time you need an OutputStream (typically because an API writes its output to an OutputStream) and you want to store the output in memory, and not in a file or on the network.
I am new to Java IO. Currently, I have these lines of code which generates an input stream based on string.
String sb = new StringBuilder();
for(...){
sb.append(...);
}
String finalString = sb.toString();
byte[] objectBytes = finalString.getBytes(StandardCharsets.UTF_8);
InputStream inputStream = new ByteArrayInputStream(objectBytes);
Maybe, I am misunderstanding something, but is there a better way to generate InputStream from String other than using getBytes()?
For instance, if String is really large, 50MB, and there is no way to create another copy (getBytes() for another 50MB) of it due to resource constraints, it could potentially throw an out of memory error.
I just wanted to know if above lines of code is the efficient way to generate InputStream from String. For instance, is there a way which I can "stream" String into input stream without using additional memory? Like a Reader-like abstraction on top of String?
I think what you're looking for is a StringReader which is defined as:
A character stream whose source is a string.
To use this efficiently, you would need to know exactly where the bytes are located that you wish to read. It supports both random and sequential access, so you can read the entire String, char by char, if you prefer.
You are producing data, actually writing and you want to almost immediately consume the data, reading.
The Unix technique is to pipe the output of one process to the input of an other process. In java one also needs at least two threads. They will synchronize on producing and consuming.
PipedInputStream in = new PipedInputStream();
PipedOutputStream out = new PipedOutputStream(in);
new Thread(() -> writeAllYouveGot(out)).start();
readAllYouveGot(in);
Here I started a Thread for writing with a Runnable that calls some self-defined method on out. Instead of using new Thread you might prefer an ExecutorService.
Piped I/O is rather seldomly used, though the asynchrone behaviour is optimal. One can even set the pipe's size on the PipedInputStream. The reason for that rare usage, is the need for a second thread.
To complete things, one would probably wrap the binary Input/OutputStreams in new InputStreamReader(in, "UTF-8") and new OutputStreamWriter(out, "UTF-8").
Try something like this (no promises about typos:)
BufferedReader reader = new BufferedRead(new InputStreamReader(yourInputStream), Charset.defaultCharset());
final char[] buffer = new char[8000];
int charsRead = 0;
while(true) {
charsRead = reader.read(buffer, 0, 8000);
if (charsRead == -1) {
break;
}
// Do something with buffer
}
The InputStreamReader converts from byte to char, using the Charset. BufferedReader allows you to read blocks of char.
For really large input streams, you may want to process the input in chunks, rather than reading the entire stream into memory and then processing.
Currently I do have the problem that this piece of code will be called >500k of times. The size of the compressed byte[] is less than 1KB. Every time the method is called all of the streams has to been created. So I am looking for a way to improve this code.
private byte[] unzip(byte[] data) throws IOException, DataFormatException {
byte[] unzipData = new byte[4096];
try (ByteArrayInputStream in = new ByteArrayInputStream(data);
GZIPInputStream gzipIn = new GZIPInputStream(in);
ByteArrayOutputStream out = new ByteArrayOutputStream()) {
int read = 0;
while( (read = gzipIn.read(unzipData)) != -1) {
out.write(unzipData, 0, read);
}
return out.toByteArray();
}
}
I already tried to replace ByteArrayOutputStream with a ByteBuffer, but at the time of creation I don't know how many bytes I need to allocate.
Also, I tried to use Inflater but I stumbled across the problem descriped here.
Any other idea what I could do to improve the perfomance of this code.
UPDATE#1
Maybe this lib helps someone.
Also there is an open JDK-Bug.
Profile your application, to be sure that you're really spending optimizable time in this function. It doesn't matter how many times you call this function; if it doesn't account for a significant fraction of overall program execution time, then optimization is wasted.
Pre-size the ByteArrayOutputStream. The default buffer size is 32 bytes, and resizes require copying all existing bytes. If you know that your decoded arrays will be around 1k, use new ByteArrayOutputStream(2048).
Rather than reading a byte at a time, read a block at a time, using a pre-allocated byte[]. Beware that you must use the return value from read as an input to write. Better, use something like Jakarta Commons IOUtils.copy() to avoid mistakes.
I'm not sure if it applies in your case, but I've found incredible speed difference when comparing using the default buffer size of GZIPInputStream vs increasing to 65536.
example: using a 500M input file ->
new GZIPInputStream(new FileInputStream(path.toFile())) // takes 4 mins to process
vs
new GZIPInputStream(new FileInputStream(path.toFile()), 65536) // takes 10s
J
More details can be found here http://java-performance.info/java-io-bufferedinputstream-and-java-util-zip-gzipinputstream/
Both BufferedInputStream and GZIPInputStream have internal buffers.
Default size for the former one is 8192 bytes and for the latter one
is 512 bytes. Generally it worth increasing any of these sizes to at
least 65536.
You can use the Inflater class method reset() to reuse the Inflater object without having to recreate it each time. You will have a little bit of added programming to do in order to decode the gzip header and perform the integrity check with the gzip trailer. You would then use Inflater with the nowrap option to decompress the raw deflated data after then gzip header and before the trailer.