Why to avoid using ByteStream much in Java

Why to avoid using ByteStream much in Java - java

We shouldn't use byte Stream as Sun Doc says -
actually it represents a kind of low-level I/O that you should avoid.
What is actually low-level I/O and what is exact problem using byte stream.

So the Java docs say:
CopyBytes seems like a normal program, but it actually represents a
kind of low-level I/O that you should avoid. Since xanadu.txt contains
character data, the best approach is to use character streams, as
discussed in the next section. There are also streams for more
complicated data types. Byte streams should only be used for the most
primitive I/O.
The byte streams give you access to the file as it is. Just the bytes. No interpration of any kind. That means no character set conversion, no handling of ints or floats in binary or ascii representation, no dealing with byte orders, or any of that. The higher level streams provide some of these.
Of course a program that copies a file is actually a pretty good example of something that needs a raw byte stream, because it doesn't need or want to do any kind of intepretation of the data; it just wants to copy it verbatim.
So what the really mean is, use byte streams if you think you need them, but be sure you know what you are doing :)

The suggestion is in the context of reading a text file that is discussed in the tutorial. For that purpose it is better to use character streams to handle character set translation properly:
The Java platform stores character values using Unicode conventions.
Character stream I/O automatically translates this internal format to
and from the local character set.
A program that uses character streams in place of byte streams
automatically adapts to the local character set and is ready for
internationalization — all without extra effort by the programmer.

Related

In Java, how to copy data from String to char[]/byte[] efficiently?

I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray() was banned, since it allocates space for every String.
As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?

I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

A small stocktaking:
String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

Both Reader and Stream give the same result , what is the difference? [duplicate]

Today I got this question for which I think I answered very bad. I said stream is a data that flows and reader is a technique where we read from that is a static data. I know this is an awful answer, so please provide me the crisp difference and definitions between these two with example in Java.
Thanks.

An InputStream is byte-oriented. A Reader is character-oriented.
The javadocs are your friend, explaining the difference. Reader, InputStream

As others have said, the use cases for each are slightly different (even though they often can be used interchangeably)
Since readers are for reading characters, they are better when you are dealing with input that is of a textual nature (or data represented as characters). I say better because Readers (in the context of typical usage) are essentially streams with methods that easily facilitate reading character input.

Stream is for reading bytes, Reader is for reading characters. One character may take one byte or more, depending on character set.

Stream classes are byte-oriented classes, that mean all InputStream classes (Buffered and non-buffered) read data byte by byte from stream and all OutputStream(Buffered and non-buffered) classes writes data byte by byte to the stream. Stream classes are useful when you have small data or if you are dealing with binary files like images.
On the other handReader/Writer are character based classes. These classes read or write one character at time from or into stream. These classes extends either java.io.Reader (all character input classes) or java.io.Writer (all character output classes). These classes are useful if you are dealing with text file or other textual stream. These classes are also Buffered and Non-Buffered.

Reading different encoding from the same InputStream [duplicate]

I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.

One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.

Java I/O streams; what are the differences?

java.io has many different I/O streams, (FileInputStream, FileOutputStream, FileReader, FileWriter, BufferedStreams... etc.) and I am confused in determining the differences between them. What are some examples where one stream type is preferred over another, and what are the real differences between them?

Streams: one byte at a time. Good for binary data.
Readers/Writers: one character at a time. Good for text data.
Anything "Buffered": many bytes/characters at a time. Good almost all the time.

When learning Java I made this mental scheme about java.io:
Streams
byte oriented stream (8 bit)
good for binary data such as a Java .class file
good for "machine-oriented" data
Readers/Writers
char (utf-16) oriented stream (16 bit)
good for text such as a Java source
good for "human-oriented" data
Buffered
always useful unless proven otherwise

This is a big topic! I would recommend that you begin by reading I/O Streams:
An I/O Stream represents an input
source or an output destination. A
stream can represent many different
kinds of sources and destinations,
including disk files, devices, other
programs, and memory arrays.
Streams support many different kinds
of data, including simple bytes,
primitive data types, localized
characters, and objects. Some streams
simply pass on data; others manipulate
and transform the data in useful ways.

Separate each name into words: each capital is a different word.
File Input Stream is to get Input from a File using a Stream.
File Output Stream is to write Output to a File using a Stream
And so on and so forth
As mmyers wrote :
Streams: one byte at a time.
Readers/Writers: one character at a time.
Buffered*: many bytes/characters at a time.

The specialisations you mention are specific types used to provide a standard interface to a variety of data sources. For example, a FileInputStream and an ObjectInputStream will both implement the InputStream interface, but will operate on Files and Objects respectively.

Java input and output is defined in terms of an abstract concept called a “stream”, which is a sequence of data.
There are 2 kinds of streams.
Byte streams (8 bit bytes) Æ Abstract classes are: InputStream and OutputStream
Character streams (16 bit UNICODE) Æ Abstract classes are: Reader and Writer
java.io.* classes use the decorator design pattern. The decorator design pattern attaches
responsibilities to objects at runtime. Decorators are more flexible than inheritance because the inheritance
attaches responsibility to classes at compile time. The java.io.* classes use the decorator pattern to construct
different combinations of behavior at runtime based on some basic classes.
from the book Java/J2EE Job Interview Companion By K.Arulkumaran & A.Sivayini

Byte streams are mostly and widely used stream type in java 1.0 for both character and for byte. After java 1.0 it was deprecated and character streams plays a important role. ie., for example
BufferedReader will get the character from the source, and its constructor looks like
BufferedReader(Reader inputReader)..
Here Reader is an abstract class and the once of its concrete classes are InputStreamReader, which will converts bytes into characters and take input from the keyboard(System.in)...
BufferedReader : Contains internal Buffer that will read characters from the stream. Internal counter keeps track of next character to be supplied to the buffer thru read(). InputStreamReader will takes input as bytes and converts internally into characters.

How do I identify that I am at the last byte of a serialized Java object?

Question
What is (if there is any) terminating characters/byte sequences in serialized java objects?
Background
I'm working on a small self-education project where I would like to serialize java objects and write them to a stream where there are read and then unserialized. Since, I will need to identify the borders between serialized objects and I can't be sure that the current object is not the last one, is there a terminating character that is always there that I can use as my identifier?
I noticed that there is a magic number ACED that allows me to identify the start of the object, so how do I identify the end?
EDIT:
If there is no terminating character, is there any safe terminating characters/sequences that I can use (insert) to identify the end of the object?

In theory you should always be able to find the end of an object, in practice you cannot. I understand the problem is customised writeObject implementations that don't call either defaultReadObject or readFields have a non-standard representation.
I've played about with serialisation in the past. Including creating streams for use when I've been doing unusual things to the ObjectInputStream. It's not pleasant(!).
You can read the details in the spec, and the source is worth a read.

there are none. AFAIK the only requirement is that the deserialiser know when to stop reading, when given a corresponding serialisation. subject to that, the serialiser can write whatever it wants -- in any position not just the last.
if you're old skool dump a 32-bit length field at the beginning a refuse to handle objects bigger than 4 gig.
nu scool, you just make sure your read and your write logic are consistent and don't care about the length.

You can add a terminating object to your object stream. e.g. null or a special String.
However, I suggest that you instead convert the ObjectsStream to a byte[] and write the byte length of the byte[] followed by its data. This way each ObjectStream is independent and you always know where it finishes.

Have you considered applying a record-marking layer similar to HTTP Chunked encoding?
The Chunked encoding is intended to solve a generalization of this scenario: identifying the end of a message of indeterminate length that both itself contains no identifiable end, and is embedded in a longer stream without ending it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why to avoid using ByteStream much in Java - java

We shouldn't use byte Stream as Sun Doc says - actually it represents a kind of low-level I/O that you should avoid. What is actually low-level I/O and what is exact problem using byte stream.

Related

In Java, how to copy data from String to char[]/byte[] efficiently?

Both Reader and Stream give the same result , what is the difference? [duplicate]

Reading different encoding from the same InputStream [duplicate]

Java I/O streams; what are the differences?

How do I identify that I am at the last byte of a serialized Java object?

Categories

Resources