Java-reading a file, performance difference between Byte and Character Streams

Java-reading a file, performance difference between Byte and Character Streams - java

Pretty simple question: what's the performance difference between a Byte Stream and a Character Stream?
The reason I ask is because I'm implementing level loading from a file, and initially I decided I would just use a Byte Stream for the purpose, because it's the simplest type, and thus it should perform the best. But then I figured that it might be nice to be able to read and write the level files via a text editor instead of writing a more complex level editor (to start off with). In order for it to be legible by a text editor, I would need to use Character streams instead of Byte streams, so I'm wondering if there's really any performance difference worth mentioning between the two methods? At the moment it doesn't really matter much since level loading is infrequent, but I'd be interested to know for future reference, for instances where I might need to load levels from hard drive on the fly (large levels).

Pretty simple question: what's the performance difference between a Byte Stream and a Character Stream?
I assume you are compare Input/OutputStream with Reader/Writer streams. If that is the case the performance is almost the same. Unless you have a very fast drive, the bottleneck will be almost certainly the disk, in which case it doesn't matter too much what you do in Java.
The reason I ask is because I'm implementing level loading from a file, and initially I decided I would just use a Byte Stream for the purpose, because it's the simplest type, and thus it should perform the best. But then I figured that it might be nice to be able to read and write the level files via a text editor instead of writing a more complex level editor (to start off with).
All files are actually a stream of bytes. So when you use Reader/Writer it uses an encoder to convert bytes to chars and back again. There is nothing stopping you reading and writing bytes directly which do exactly the same thing.
In order for it to be legible by a text editor, I would need to use Character streams instead of Byte streams,
You wouldn't, but it might make it easier. If you only want ASCII encoding, there is no difference. If you want UTF-8 encoding with non-ASCII characters using chars is likely to be simpler.
so I'm wondering if there's really any performance difference worth mentioning between the two methods?
I would worry about correctness first and performance second.
I might need to load levels from hard drive on the fly (large levels).
Java can read/write text at about 90 MB/s, most hard drives and networks are not that fast. However if you need to write GBs in second and you have fast SSD, then it might make a difference. SSDs can perform 500 MB/s or more and then I would suggest you use NIO to maximise performance.

Java has only one kind of stream: a byte stream. The class java.io.InputStream and java.io.OutputStream are defined in terms of bytes.
To convert bytes to characters, and eventually Strings, you will always be using the functionality in java.nio.charset. However, for your convenience, Java provides Reader and Writer methods that adapt byte streams into stream-like objects that operate on characters and Strings.
There is a CPU time cost, of course, in conversion. However, the cost is very low. If you manage to write a program that has performance dominated by this cost, you've written a very lean program indeed.

I don't know Java, so take this with a pinch of salt.
A character stream typically means each thing you read is decoded into an individual character based on the current locale, which means it's important for internationalised text data which can't be represented with just 128 or 256 different choices. The set of all possible characters is defined in the Unicode system and how you get from individual bytes to characters is defined by the encoding. More information here: http://www.joelonsoftware.com/articles/Unicode.html
A byte stream on the other hand just reads in values from 0 to 255 and doesn't try and interpret them as characters from any particular language. As such, a byte stream should always be somewhat faster. But if you had international characters in there, they'll not display properly unless you know exactly how they were encoded.
For most purposes, human-readable data can be stored in ASCII, which only uses 7 bits of data per character and gives you 128 different characters. This will be readable by any typical text editor, and since ASCII characters are a subset of Unicode and of the UTF-8 encoding, you can read an ASCII file either as bytes or as UTF-8 characters, and the content will be unchanged.
If you ever need to store binary values for more efficient serialisation (eg. to store the number 123456789 as a 4 byte integer instead of as a 9 byte string) then you'll need to switch to a byte stream, but you also give up human-readability at this point so the issue becomes somewhat irrelevant.
It's unlikely that the size of the level will ever have much effect on your loading times - a typical hard drive can read well over a hundred megabytes per second. Code whichever way is easiest for you, and only optimise this later if your profiling shows there is a problem.

Related

Object to bytes array in Java

I'm working on a proprietary TCP protocol. This protocol sends and receive messages with a specific sequence of bytes.
I should be complaiant to this protocol, and i cant change it.
So my input / output results are something like that :
\x01\x08\x00\x01\x00\x00\x01\xFF
\x01 - Message type
\x01 - Message type
\x00\x01 - Length
\x00\x00\x01 - Transaction
\xFF - Body
The sequence of field is important. And i want only the values of the fields in my serialization, and nothing about the structure of the class.
I'm working on a Java controller that use this protocol and I've thought to define the message structures in specific classes and serialize/deserialize them, but I was naive.
First of all I tried ObjectOutputStream, but it output the entire structure of the object, when I need only the values in a specific order.
Someone already faced this problem:
Java - Object to Fixed Byte Array
and solved it with a dedicated Marshaller.
But I was searching for a more flexible solution.
For text serialization and deserialization I've found:
http://jeyben.github.io/fixedformat4j/
that with annotation defines the schema of the line. But it outputs a String, not a byte[]. So 1 is output like "1" that is represented differently based on encoding, and often with more bytes.
What I was searching for is something that given the order of my class properties will convert each property in a bunch of bytes (based on the internal representation) and append them to a byte[].
Do you know some library used for that purpose?
Or a simple way to do that, without coding a serialization algorithm for each of my entities?

Serialization just isn't easy; it sounds from your question like you feel you can just invoke something and out rolls compact, simple, versionable, universal data you can then put on the wire. What you need to fix is to scratch the word 'just' from that sentence. You're going to have to invest some time and care.
As you figured out already, java's baked in serialization has a ton of downsides. Don't use that.
There are various serializers. The popular ones are things like GSON or Jackson, which lets you serialize java objects into JSON. This isn't particularly efficient, and is string based. This sounds like crucial downsides but they really aren't, see below.
You can also spend a little more time specifying the exact format and use protobuf which lets you write a quite lean and simple data protocol (and protobuf is available for many languages, if eventually you want to write an participant in this protocol in non-java later).
So, those are the good options: Go to JSON via Jackson or GSON, or, use protobuf.
But JSON is a string.
You can turn a string to bytes trivially using str.getBytes(StandardCharsets.UTF_8). This cannot fail due to charset encoding differences (as long as you also 'decode' in the same fashion: Turn the bytes into a string with new String(theBytes, StandardCharsets.UTF_8). UTF-8 is guaranteed to be available on all JVMs; if it is not there, your JVM is as broken as a JVM that is missing the String class - not something to worry about.
But JSON is inefficient.
Zip it up, of course. You can trivially wrap an InputStream and an OutputStream so that gzip compression is applied which is simple, available on just about every platform, and fast (it's not the most efficient cutting edge compression algorithm, but usually squeezing the last few bytes out is not worth it) - and zipped-up JSON can often be more efficient that carefully handrolled protobuf, even.
The one downside is that it's 'slow', but on modern hardware, note that the overhead of encrypting and decrypting this data (which you should obviously be doing!!) is usually multiple orders of magnitude more involved. A modern CPU is simply very, very fast - creating JSON and zipping it up is going to take 1% of CPU or less even if you are shipping the collected works of shakespeare every second.
If an arduino running on batteries needs to process this data, go with uncompressed, unencrypted protobuf-based data. If you are facebook and writing the whatsapp protocol, the IAAS creds saved by not having to unzip and decode JSON is tiny and pales in comparison to the creds you spend just running the servers, but at that scale its worth the development effort.
In just about every other case, just toss gzipped JSON on the line.

Fastest way of writing the first 10 000 lines of data file to new file

I want the first ten thousand lines of a hyuuge (.csv) file.
The naive way of
1) creating a reader & writer
2) reading the original file line for line
3) writing the first ten thousand lines to a new file
can't be the fastest, can it?
This will be a common operation in my app so I'm slightly concerned about speed, but also just curious.
Thanks.

There are a few ways of doing fast I/O in Java but without benchmarking for your particular case, it's kind of difficult to shoot out a figure/advice. Here are a few ways you can try benchmarking:
Buffered reader/writers with maybe varying buffer sizes
Reading the entire file in memory (if it can be) and doing an in-memory split and writing it all in a single go
Using NIO file API for reading/writing files (look into Channels)

If you only want to read/write 10,000 lines or so:
it will probably take longer to start up a new JVM than to read / write the file,
the read / write time should be a fraction of a second ... doing it the naive way, and
the overall speed up from a copying algorithm is unlikely to be worthwhile.
Having said that, you can do better than reading a line at a time using BufferedReader.readLine() or whatever.
Depending on the character encoding of the file, you will get better performance by doing byte-wise I/O with a BufferedInputStream and BufferedOutputStream with large buffer sizes. Just write a loop to read a byte, conditionally update the line counter and write the byte ... until you have copied the requisite number of lines. (This assumes that you can detect the CR and/or LF characters by examining the bytes. This is true for all character encodings I know about.)
If you use NIO and ByteBuffers, you can further reduce the amount of in-memory copying, though the CR / LF counting logic will be more complicated.
But the first question you should ask is whether it is even worthwhile bothering to optimize this.

Are the lines the same length. If so you can use RandomAccessFile to read x bytes and then write those bytes to a new file. It may be quite memory intensive though. I suspect this would be quicker but probably worth benchmarking. This solution would only work for fixed length lines

Why should IO as Byte be avoided in Java

CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. It has been mentioned that there are streams for characters ,objects etc that should be preferred although all are built on the bytestream itself. What is a reason behind this, has it anything to do with security manager and performance related issues?
source : oracle docs

What Oracle is actually saying, is "Please do not reimplement the wheel!".
You should almost never need regular Byte streams:
Are you parsing text? Use a Character stream, which understand text encoding issues.
Are you parsing XML? Use SAX or some other library.
Are you parsing images? Use ImageIO class.
Are you copying things from A to B? Use apache commons-io FileUtils.
There are very few situations where you will actually need to use the bytestream.

From the text you quoted:
CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. Since xanadu.txt contains character data, the best approach is to use character streams, as discussed in the next section. There are also streams for more complicated data types. Byte streams should only be used for the most primitive I/O.
Usually, you don't want to work with bytes directly. There are higher-level APIs, for example to read text (i.e. character data that has to be decoded from bytes).

It works, but is very inefficient: it needs 2 method calls for every single byte it copies.
Instead, you should use a buffer (of several thousand bytes, the best size varies by what exactly you read and other conditions) and read/write the entire buffer (or as much as possible) with every method call.

Find patterns in long string?

I have 40KB HTML page and I want to find certain patterns in it.
I can read it by 1K buffer but I want to avoid situation that pattern that I'm searching would be split between two buffer reads.
How to overcome this problem?

This is easy. You count the longest pattern you will look for, then either backtrack the file pointer by that amount, or you scroll through the file, reading only the delta.
Imagine the longest pattern being 26 bytes.
Read 1k.
Check for all patterns -> nothing.
Drop 1k - 26 bytes from the buffer.
Read 1k - 26 bytes from stream and add to your buffer
Goto 2.
Edit: Let me clarify: There are two methods to do this, both have their merits. The one I documented above is best used if you are reading from a stream, which means a data source that does not support seeking. If, however, your datasource does support seeking (like a filesystem file), you can easily do the same with seeks. Check for pattern, if not found, seek back the size of your longest pattern, then start from there.
If, however, you want to support the search for patterns that are longer than your buffer size, you might need a much more clever algorithm. You would need a lookup table of all patterns that are currently "open" when you contnue to read more data, which in turn will cost more memory - you get the problem.

That's what the Scanner class is for.

You could take a look at CharBuffer, which implements CharSequence for just this purpose

Why not use a SAX parser. It is build to handle large files of mark-up. You would ony run into problems if you are trying to match across different elements along the same level. However this is not impossible to handle

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.