Why should IO as Byte be avoided in Java

Why should IO as Byte be avoided in Java - java

CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. It has been mentioned that there are streams for characters ,objects etc that should be preferred although all are built on the bytestream itself. What is a reason behind this, has it anything to do with security manager and performance related issues?
source : oracle docs

What Oracle is actually saying, is "Please do not reimplement the wheel!".
You should almost never need regular Byte streams:
Are you parsing text? Use a Character stream, which understand text encoding issues.
Are you parsing XML? Use SAX or some other library.
Are you parsing images? Use ImageIO class.
Are you copying things from A to B? Use apache commons-io FileUtils.
There are very few situations where you will actually need to use the bytestream.

From the text you quoted:
CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. Since xanadu.txt contains character data, the best approach is to use character streams, as discussed in the next section. There are also streams for more complicated data types. Byte streams should only be used for the most primitive I/O.
Usually, you don't want to work with bytes directly. There are higher-level APIs, for example to read text (i.e. character data that has to be decoded from bytes).

It works, but is very inefficient: it needs 2 method calls for every single byte it copies.
Instead, you should use a buffer (of several thousand bytes, the best size varies by what exactly you read and other conditions) and read/write the entire buffer (or as much as possible) with every method call.

Related

Object to bytes array in Java

I'm working on a proprietary TCP protocol. This protocol sends and receive messages with a specific sequence of bytes.
I should be complaiant to this protocol, and i cant change it.
So my input / output results are something like that :
\x01\x08\x00\x01\x00\x00\x01\xFF
\x01 - Message type
\x01 - Message type
\x00\x01 - Length
\x00\x00\x01 - Transaction
\xFF - Body
The sequence of field is important. And i want only the values of the fields in my serialization, and nothing about the structure of the class.
I'm working on a Java controller that use this protocol and I've thought to define the message structures in specific classes and serialize/deserialize them, but I was naive.
First of all I tried ObjectOutputStream, but it output the entire structure of the object, when I need only the values in a specific order.
Someone already faced this problem:
Java - Object to Fixed Byte Array
and solved it with a dedicated Marshaller.
But I was searching for a more flexible solution.
For text serialization and deserialization I've found:
http://jeyben.github.io/fixedformat4j/
that with annotation defines the schema of the line. But it outputs a String, not a byte[]. So 1 is output like "1" that is represented differently based on encoding, and often with more bytes.
What I was searching for is something that given the order of my class properties will convert each property in a bunch of bytes (based on the internal representation) and append them to a byte[].
Do you know some library used for that purpose?
Or a simple way to do that, without coding a serialization algorithm for each of my entities?

Serialization just isn't easy; it sounds from your question like you feel you can just invoke something and out rolls compact, simple, versionable, universal data you can then put on the wire. What you need to fix is to scratch the word 'just' from that sentence. You're going to have to invest some time and care.
As you figured out already, java's baked in serialization has a ton of downsides. Don't use that.
There are various serializers. The popular ones are things like GSON or Jackson, which lets you serialize java objects into JSON. This isn't particularly efficient, and is string based. This sounds like crucial downsides but they really aren't, see below.
You can also spend a little more time specifying the exact format and use protobuf which lets you write a quite lean and simple data protocol (and protobuf is available for many languages, if eventually you want to write an participant in this protocol in non-java later).
So, those are the good options: Go to JSON via Jackson or GSON, or, use protobuf.
But JSON is a string.
You can turn a string to bytes trivially using str.getBytes(StandardCharsets.UTF_8). This cannot fail due to charset encoding differences (as long as you also 'decode' in the same fashion: Turn the bytes into a string with new String(theBytes, StandardCharsets.UTF_8). UTF-8 is guaranteed to be available on all JVMs; if it is not there, your JVM is as broken as a JVM that is missing the String class - not something to worry about.
But JSON is inefficient.
Zip it up, of course. You can trivially wrap an InputStream and an OutputStream so that gzip compression is applied which is simple, available on just about every platform, and fast (it's not the most efficient cutting edge compression algorithm, but usually squeezing the last few bytes out is not worth it) - and zipped-up JSON can often be more efficient that carefully handrolled protobuf, even.
The one downside is that it's 'slow', but on modern hardware, note that the overhead of encrypting and decrypting this data (which you should obviously be doing!!) is usually multiple orders of magnitude more involved. A modern CPU is simply very, very fast - creating JSON and zipping it up is going to take 1% of CPU or less even if you are shipping the collected works of shakespeare every second.
If an arduino running on batteries needs to process this data, go with uncompressed, unencrypted protobuf-based data. If you are facebook and writing the whatsapp protocol, the IAAS creds saved by not having to unzip and decode JSON is tiny and pales in comparison to the creds you spend just running the servers, but at that scale its worth the development effort.
In just about every other case, just toss gzipped JSON on the line.

Java-reading a file, performance difference between Byte and Character Streams

Pretty simple question: what's the performance difference between a Byte Stream and a Character Stream?
The reason I ask is because I'm implementing level loading from a file, and initially I decided I would just use a Byte Stream for the purpose, because it's the simplest type, and thus it should perform the best. But then I figured that it might be nice to be able to read and write the level files via a text editor instead of writing a more complex level editor (to start off with). In order for it to be legible by a text editor, I would need to use Character streams instead of Byte streams, so I'm wondering if there's really any performance difference worth mentioning between the two methods? At the moment it doesn't really matter much since level loading is infrequent, but I'd be interested to know for future reference, for instances where I might need to load levels from hard drive on the fly (large levels).

Pretty simple question: what's the performance difference between a Byte Stream and a Character Stream?
I assume you are compare Input/OutputStream with Reader/Writer streams. If that is the case the performance is almost the same. Unless you have a very fast drive, the bottleneck will be almost certainly the disk, in which case it doesn't matter too much what you do in Java.
The reason I ask is because I'm implementing level loading from a file, and initially I decided I would just use a Byte Stream for the purpose, because it's the simplest type, and thus it should perform the best. But then I figured that it might be nice to be able to read and write the level files via a text editor instead of writing a more complex level editor (to start off with).
All files are actually a stream of bytes. So when you use Reader/Writer it uses an encoder to convert bytes to chars and back again. There is nothing stopping you reading and writing bytes directly which do exactly the same thing.
In order for it to be legible by a text editor, I would need to use Character streams instead of Byte streams,
You wouldn't, but it might make it easier. If you only want ASCII encoding, there is no difference. If you want UTF-8 encoding with non-ASCII characters using chars is likely to be simpler.
so I'm wondering if there's really any performance difference worth mentioning between the two methods?
I would worry about correctness first and performance second.
I might need to load levels from hard drive on the fly (large levels).
Java can read/write text at about 90 MB/s, most hard drives and networks are not that fast. However if you need to write GBs in second and you have fast SSD, then it might make a difference. SSDs can perform 500 MB/s or more and then I would suggest you use NIO to maximise performance.

Java has only one kind of stream: a byte stream. The class java.io.InputStream and java.io.OutputStream are defined in terms of bytes.
To convert bytes to characters, and eventually Strings, you will always be using the functionality in java.nio.charset. However, for your convenience, Java provides Reader and Writer methods that adapt byte streams into stream-like objects that operate on characters and Strings.
There is a CPU time cost, of course, in conversion. However, the cost is very low. If you manage to write a program that has performance dominated by this cost, you've written a very lean program indeed.

I don't know Java, so take this with a pinch of salt.
A character stream typically means each thing you read is decoded into an individual character based on the current locale, which means it's important for internationalised text data which can't be represented with just 128 or 256 different choices. The set of all possible characters is defined in the Unicode system and how you get from individual bytes to characters is defined by the encoding. More information here: http://www.joelonsoftware.com/articles/Unicode.html
A byte stream on the other hand just reads in values from 0 to 255 and doesn't try and interpret them as characters from any particular language. As such, a byte stream should always be somewhat faster. But if you had international characters in there, they'll not display properly unless you know exactly how they were encoded.
For most purposes, human-readable data can be stored in ASCII, which only uses 7 bits of data per character and gives you 128 different characters. This will be readable by any typical text editor, and since ASCII characters are a subset of Unicode and of the UTF-8 encoding, you can read an ASCII file either as bytes or as UTF-8 characters, and the content will be unchanged.
If you ever need to store binary values for more efficient serialisation (eg. to store the number 123456789 as a 4 byte integer instead of as a 9 byte string) then you'll need to switch to a byte stream, but you also give up human-readability at this point so the issue becomes somewhat irrelevant.
It's unlikely that the size of the level will ever have much effect on your loading times - a typical hard drive can read well over a hundred megabytes per second. Code whichever way is easiest for you, and only optimise this later if your profiling shows there is a problem.

Large byte array transfer to client

Let me present my situation.
I have a lot of data in bytes stored in files on server. I am writing and reading this files using AIO that is coming in JDK7. Thus, I am using ByteBuffer(s) for read and write operations.
The question is once I have performed a read on AsynchronousFileChannel I want to transfer the content of the ByteByffer that was used in read operation to the client. Thus I actually want to send the bytes.
What would be the best way to go from here. I don't want to send the ByteBuffer, because I have a pool of them that I reuse, thus this is not an option. I want to be able also to even maybe combine several reads and send the content of several ByteBuffer(s) combined at once.
So what do I send. Just a byte[] array? Or do I need some stream? What be the best solution regarding performance here.
I am using RMI for communication.
Thanx in advance.

You can simulate streams over rmi using the RMIIO library, which will allow you to stream arbitrary amounts of bytes via RMI without causing memory problems on either end.
(disclaimer, i wrote the library)

Unless there is a very good reason not to, then just send the byte array along with sufficient meta data that you can provide reliable service.
The less of the underlying implementation you need to transfer back and forth over RMI, the better. Especially when you work with Java 7 which is not yet generally available.

To use RMI you have to retrieve the contents of the buffer as a byte[], then write it to an ObjectOutputStream (the write happens under the covers). Assuming that you're currently using direct buffers, this means CPU time to create the array in the Java heap, and CPU time to garbage-collect that array once it's been written, and the possibility that the stream will hold onto the reference too long, causing an out-of-memory error.
A better approach, in my opinion, is to open a SocketChannel to the destination and use it to write the buffer's contents. Of course, to make this work you'll need to write additional data describing the size of the buffer, and this will probably evolve into a communication protocol.

What is the fastest way to output a large amount of data?

I have an JAX-RS web service that calls a db2 z/os database and returns about 240mb of data in a resultset. I am then creating an OutputStream to send this data to the client by looping through the resultset and adding a few XML tags for my output.
I am confused about what to use PrintWriter, BufferedWriter or OutputStreamWriter. I am looking for the fastest way to deliver the data. I also don't want the JVM to hold onto this data any longer than it needs to, so I don't use up it's memory.
Any help is appreciated.

You should use
BufferedWriter
Call .flush() frequently
Enable gzip for best compression
Start thinking about a different way of doing this. Can your data be paginated? Do you need all the data in one request.

If you are sending a large binary data, you probably don't want to use xml. When xml is used, binary data is usually represented using base64 which becomes larger than the original binary and uses quite a lot of CPU for the conversion into base64.
If I were you, I'd send the binary separate from the xml. If you are using WebService, MTOM attachment could help. Otherwise you could send the reference to the binary data in the xml, and let the app. download the binary data separately.
As for the fastest way to send binary, if you are using weblogic, just writing on the response's outputstram would be ok. That output stream is most probably buffered and whatever you do probably won't change the performance anyways.
Turning on gzip could also help depending on what you are sending (e.g. if you are sending jpeg (stuff that is already compressed) or something, it won't help a lot but if you are sending raw text then it can help a lot, etc.).

One solution (which might not work for you) is to spawn a job / thread that creates a file and then notifies the user when the file is ready to download, in this way you're not tied to the bandwidth of the client connection (and you can even compress the file properly, before the client downloads it)
Some Business Intelligence and data crunching applications do this, specially if the process takes some time to generate the data.

The output max speed will me limited by network bandwith and i am shure any Java OutputStream will be much more faster than you will notice the difference.
The choice depends on the data to send: is that text (lines) PrintWriter is easy, is that a byte array take OutputStream.
To hold not too much data in the buffers you should call flush() any x kb maybe.

You should never use PrintWriter to output data over a network. First of all, it creates platform-dependent line breaks. Second, it silently catches all I/O exceptions, which makes it hard for you to deal with those exceptions.
And if you're sending 240 MB as XML, then you're definitely doing something wrong. Before you start worrying about which stream class to use, try to reduce the amount of data.
EDIT:
The advice about PrintWriter (and PrintStream) came from a book by Elliotte Rusty Harold. I can't remember which one, but it was a few years ago. I think that ServletResponse.getWriter() was added to the API after that book was written - so it looks like Sun didn't follow Rusty's advice. I still think it was good advice - for the reasons stated above, and because it can tempt implementation authors to violate the API contract
in order to get predictable behavior.

Java: Where can I find advanced file manipulation source/libraries?

I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.

Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.

Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.

As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.

Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.

While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.