I am new to Java IO. Currently, I have these lines of code which generates an input stream based on string.
String sb = new StringBuilder();
for(...){
sb.append(...);
}
String finalString = sb.toString();
byte[] objectBytes = finalString.getBytes(StandardCharsets.UTF_8);
InputStream inputStream = new ByteArrayInputStream(objectBytes);
Maybe, I am misunderstanding something, but is there a better way to generate InputStream from String other than using getBytes()?
For instance, if String is really large, 50MB, and there is no way to create another copy (getBytes() for another 50MB) of it due to resource constraints, it could potentially throw an out of memory error.
I just wanted to know if above lines of code is the efficient way to generate InputStream from String. For instance, is there a way which I can "stream" String into input stream without using additional memory? Like a Reader-like abstraction on top of String?
I think what you're looking for is a StringReader which is defined as:
A character stream whose source is a string.
To use this efficiently, you would need to know exactly where the bytes are located that you wish to read. It supports both random and sequential access, so you can read the entire String, char by char, if you prefer.
You are producing data, actually writing and you want to almost immediately consume the data, reading.
The Unix technique is to pipe the output of one process to the input of an other process. In java one also needs at least two threads. They will synchronize on producing and consuming.
PipedInputStream in = new PipedInputStream();
PipedOutputStream out = new PipedOutputStream(in);
new Thread(() -> writeAllYouveGot(out)).start();
readAllYouveGot(in);
Here I started a Thread for writing with a Runnable that calls some self-defined method on out. Instead of using new Thread you might prefer an ExecutorService.
Piped I/O is rather seldomly used, though the asynchrone behaviour is optimal. One can even set the pipe's size on the PipedInputStream. The reason for that rare usage, is the need for a second thread.
To complete things, one would probably wrap the binary Input/OutputStreams in new InputStreamReader(in, "UTF-8") and new OutputStreamWriter(out, "UTF-8").
Try something like this (no promises about typos:)
BufferedReader reader = new BufferedRead(new InputStreamReader(yourInputStream), Charset.defaultCharset());
final char[] buffer = new char[8000];
int charsRead = 0;
while(true) {
charsRead = reader.read(buffer, 0, 8000);
if (charsRead == -1) {
break;
}
// Do something with buffer
}
The InputStreamReader converts from byte to char, using the Charset. BufferedReader allows you to read blocks of char.
For really large input streams, you may want to process the input in chunks, rather than reading the entire stream into memory and then processing.
Related
I am implementing some kind of file viewer/file explorer as a Web-Application. Therefore I need to read files from the hard disk of the system. Of course I have to deal with small files and large files and I want the fastest and most performant way of doing this.
Now I have the following code and want to ask the "big guys" who have a lot of knowledge about efficiently reading (large) files if I am doing it the right way:
RandomAccessFile fis = new RandomAccessFile(filename, "r");
FileChannel fileChannel = fis.getChannel();
// Don't load the whole file into the memory, therefore read 4096 bytes from position on
MappedByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
byte[] buf = new byte[4096];
StringBuilder sb = new StringBuilder();
while (mappedByteBuffer.hasRemaining()) {
// Math.min(..) to avoid BufferUnderflowException
mappedByteBuffer.get(buf, 0, Math.min(4096, map1.remaining()));
sb.append(new String(buf));
}
LOGGER.debug(sb.toString()); // Debug purposes
I hope you can help me and give me some advices.
When you are going to view arbitrary files, including potentially large files, I’d assume that there’s possibility that these files are not actually text files or that you may encounter files which have different encodings.
So when you are going to view such files as text on a best-effort basis, you should think about which encoding you want to use and make sure that failures do not harm your operation. The constructor you use with new String(buf) does replace invalid characters, but it is redundant to construct a new String instance and append it to a StringBuilder afterwards.
Generally, you shouldn’t go so many detours. Since Java 7, you don’t need a RandomAccessFile (or FileInputStream) to get a FileChannel. A straight-forward solution would look like
// Instead of StandardCharsets.ISO_8859_1 you could also use Charset.defaultCharset()
CharsetDecoder decoder = StandardCharsets.ISO_8859_1.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(".");
try(FileChannel fileChannel=FileChannel.open(Paths.get(filename),StandardOpenOption.READ)) {
//Don't load the whole file into the memory, therefore read 4096 bytes from position on
ByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
CharBuffer cb = decoder.decode(mappedByteBuffer);
LOGGER.debug(cb.toString()); // Debug purposes
}
You can operate with the resulting CharBuffer directly or invoke toString() on it to get a String instance (but of course, avoid doing it multiple times). The CharsetDecoder also allows to re-use a CharBuffer, however, that may not have such a big impact on the performance. What you should definitely avoid, is to concatenate all these chunks to a big string.
I am getting a FilterInputStream object as a return type from a function. Now the file which I will be getting as an stream is a log file. So I think it can be big file. So I do not want to read the data all at once. But reading data in a loop is kind of tedious job.
I need splitting at every newline, meaning data in file is in line separated format. With a constant size byte array to be used in public int read(byte[], int off, int len) as it will give rise to many cases. I do not want to read it at once because it can be of large size.
Is there an elegant way to do this.
P.S.: I am in particularly referring to S3ObjectInputStream extended from FilterInputStream which has read() function.
Wrap a BufferedReader around an InputStreamReader around the FilterInputStream and call readLine().
OK, I am sorry, the next floor is right, you can use BufferedReader class, it has a method called readLine,and it returns a String object instead of a byte array. Just like this
BufferedReader reader = new BufferedReader(new FileReader(new File("the file path")));
String date = reader.readLine();
if(!StringUtil.isBlank(date)){
//reade the file line by line
date = reader.readLine();
}
reader.close();
I am currently using a buffered streams to read write some files. In between I do some mathematical processing where a symbol is a byte.
To read :
InputStream input = new FileInputStream(outputname)
input.read(byte[] b,int off,int len)
To write :
OutputStream output = new BufferedOutputStream(
new FileOutputStream(outputname),
OUTPUTBUFFERSIZE
)
output.write((byte)byteinsideaint);
Now I need to add some header data, and to support short symbols too. I want to use DataInputStream and DataOutputStream to avoid converting other types to bytes myself and I am wondering if what is their performance.
Do I need to use
OutputStream output = new DataOutputStream(
new BufferedOutputStream(
new FileOutputStream(outputname),
OUTPUTBUFFERSIZE
)
);
to keep the advantages of the data buffering or it is good enough to use
OutputStream output = new DataOutputStream(
new FileOutputStream(outputname)
)
You should add BufferedOutputStream in between. DataOutputStream does not implement any caching (which is good: separation of concerns) and its performance will be very poor without caching the underlying OutputStream. Even the simplest methods like writeInt() can result in four separate disk writes.
As far as I can see only write(byte[], int, int) and writeUTF(String) are writing data in one byte[] chunk. Others write primitive values (like int or double) byte-by-byte.
You absolutely need the BufferedOutputStream in the middle.
I appreciate your concern about the performance, and I have 2 suggestions:
Shrink your streams with java compression. Useful article can be found here.
Use composition instead of inheritance (which is recommended practice anyway). Create a Pipe which contains a PipedInputStream and a PipedOutputStream connected to each other, with getInputStream() and getOutputStream() methods.You can't directly pass the Pipe object to something needing a stream, but you can pass the return value of it's get methods to do it.
Using java.net, java.io, what is the fastest way to parse html from online, and load it to a file or the console? Is buffered writer/buffered reader faster than inputstreamreader/outputstreamwriter? Are writers and readers faster than outputstreams and inputstreams?
I am experiencing serious lag with the following output writer/stream:
URLConnection ii;
BufferedReader iik = new BufferedReader(new InputStreamReader(ii.getInputStream()));
String op;
while(iik.readLine()!=null) {
op=iik.readLine();
System.out.println(op);
}
But curiously i am experiencing close to no lagtime with the following code:
URLConnection ii=i.openConnection();
Reader xh=new InputStreamReader(ii.getInputStream());
int r;
Writer xy=new PrintWriter(System.out);
while((r=xh.read())!=-1) {
xy.write(r);
}
xh.close();
xy.close();
What is going on here?
Your first snippet is wrong: it reads the next line, tests if it's null, ignores it, then reads the next line without testing if it's null, and prints it.
The second code prints the integer value of every char read from the reader.
Both snippets use the same underlying streams and readers, and, if coded correctly, the first one should probably be a bit faster thanks to buffering. But of course, you'll have something printed on the screen only when the line is ended. If the server sends a single line of text of 10 MBs, you'll have to read the whole 10 MBs before something is printed to the screen.
Make sure to close the readers in finally blocks.
Readers/Writers shouldn't be inherently faster than Input/OutputStreams.
That said, going through readLine() and println() probably isn't the optimal way of transferring bytes. In your case, if the file you're loading doesn't contain many newline characters, BufferedReader will have to buffer a lot of data before readLine() will return.
The canonical non-terrible way of transferring data between streams is doing it in chunks by using a buffer:
byte[] buf = new byte[1<<12];
InputStream in = urlConnection.getInputStream();
int read = -1;
while ((read = in.read(buf) != -1) {
System.out.write(buf, 0, read);
}
It might be faster yet to use NIO, the code for it is a little less straightforward and I just use the one found in this blog post.
If you're writing to/from a file, the best method is to use a zero-copy approach, which Java makes available with FileChannel.transferFrom() and transferTo(). Sample code is available in a DeveloperWorks article.
I use BufferedReader's readLine() method to read lines of text from a socket.
There is no obvious way to limit the length of the line read.
I am worried that the source of the data can (maliciously or by mistake) write a lot of data without any line feed character, and this will cause BufferedReader to allocate an unbounded amount of memory.
Is there a way to avoid that? Or do I have to implement a bounded version of readLine() myself?
The simplest way to do this will be to implement your own bounded line reader.
Or even simpler, reuse the code from this BoundedBufferedReader class.
Actually, coding a readLine() that works the same as the standard method is not trivial. Dealing with the 3 kinds of line terminator CORRECTLY requires some pretty careful coding. It is interesting to compare the different approaches of the above link with the Sun version and Apache Harmony version of BufferedReader.
Note: I'm not entirely convinced that either the bounded version or the Apache version is 100% correct. The bounded version assumes that the underlying stream supports mark and reset, which is certainly not always true. The Apache version appears to read-ahead one character if it sees a CR as the last character in the buffer. This would break on MacOS when reading input typed by the user. The Sun version handles this by setting a flag to cause the possible LF after the CR to be skipped on the next read... operation; i.e. no spurious read-ahead.
Another option is Apache Commons' BoundedInputStream:
InputStream bounded = new BoundedInputStream(is, MAX_BYTE_COUNT);
BufferedReader reader = new BufferedReader(new InputStreamReader(bounded));
String line = reader.readLine();
The limit for a String is 2 billion chars. If you want the limit to be smaller, you need to read the data yourself. You can read one char at a time from the buffered stream until the limit or a new line char is reached.
Perhaps the easiest solution is to take a slightly different approach. Instead of attempting to prevent a DoS by limiting one particular read, limit the entire amount of raw data read. In this way you don't need to worry about using special code for every single read and loop, so long as the memory allocated is proportionate to incoming data.
You can either meter the Reader, or probably more appropriately, the undecoded Stream or equivalent.
There are a few ways round this:
if the amount of data overall is very small, load data in from the socket into a buffer (byte array, bytebuffer, depending on what you prefer), then wrap the BufferedReader around the data in memory (via a ByteArrayInputStream etc);
just catch the OutOfMemoryError, if it occurs; catching this error is generally not reliable, but in the specific case of catching array allocation failures, it is basically safe (but does not solve the issue of any knock-on effect that one thread allocating large amounts from the heap could have on other threads running in your application, for example);
implement a wrapper InputStream that will only read so many bytes, then insert this between the socket and BufferedReader;
ditch BufferedReader and split your lines via the regular expressions framework (implement a CharSequence whose chars are pulled from the stream, and then define a regular expression that limits the length of lines); in principle, a CharSequence is supposed to be random access, but for a simple "line splitting" regex, in practice you will probably find that successive chars are always requested, so that you can "cheat" in your implementation.
In BufferedReader, instead of using String readLine(), use int read(char[] cbuf, int off, int len); you can then use boolean ready() to see if you got it all and convert in into a string using the constructor String(byte[] bytes, int offset, int length).
If you don't care about the whitespace and you just want to have a maximum number of characters per line, then the proposal Stephen suggested is really simple,
import java.io.BufferedReader;
import java.io.IOException;
public class BoundedReader extends BufferedReader {
private final int bufferSize;
private char buffer[];
BoundedReader(final BufferedReader in, final int bufferSize) {
super(in);
this.bufferSize = bufferSize;
this.buffer = new char[bufferSize];
}
#Override
public String readLine() throws IOException {
int no;
/* read up to bufferSize */
if((no = this.read(buffer, 0, bufferSize)) == -1) return null;
String input = new String(buffer, 0, no).trim();
/* skip the rest */
while(no >= bufferSize && ready()) {
if((no = read(buffer, 0, bufferSize)) == -1) break;
}
return input;
}
}
Edit: this is intended to read lines from a user terminal. It blocks until the next line, and returns a bufferSize-bounded String; any further input on the line is discarded.