Best delimiter to safely parse byte arrays from a stream - java

I have a byte stream that returns a sequence of byte arrays, each of which represents a single record.
I would like to parse the stream into a list of individual byte[]s. Currently, i have hacked in a three byte delimiter so that I can identify the end of each record, but have concerns.
I see that there is a standard Ascii record separator character.
30 036 1E 00011110 RS  Record Separator
Is it safe to use a byte[] derived from this character a delimiter if the byte arrays (which were UTF-8 encoded) have been compressed and/or encrypted? My concern is that the encryption/compression output might produce the record separator for some other purpose. Please note the individual byte[] records are compressed/encrypted, rather than the entire stream.
I am working in Java 8 and using Snappy for compression. I haven't picked an encryption library yet, but it would certainly be one of the stronger, standard, private key approaches.

You can't simply declare a byte as delimiter if you're working with random unstructured data (which compressed/encrypted data resembles quite closely), because the delimiter can always appear as a regular data byte in such data.
If the size of the data is already known when you start writing, just generally write the size first and then the data. When reading back you then know you need th read the size first (e.g. 4 bytes for an int), and then as many bytes the size indicates.
This will obviously not work if you can't tell the size while writing. In that case, you can use an escaping mechanism, e.g. select a rarely appearing byte as the escapce character, escape all occurances of that byte in the data and use a different byte as end indicator.
e.g.
final static byte ESCAPE = (byte) 0xBC;
final static byte EOF = (byte) 0x00;
OutputStream out = ...
for (byte b : source) {
if (b == ESCAPE) {
// escape data bytes that have the value of ESCAPE
out.write(ESCAPE);
out.write(ESCAPE);
} else {
out.write(b);
}
}
// write EOF marker ESCAPE, EOF
out.write(ESCAPE);
out.write(EOF);
Now when reading and you read the ESCAPE byte, you read thex next byte and check for EOF. If its not EOF its an escaped ESCAPE that represents a data byte.
InputStream in = ...
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while ((int b = in.read()) != -1) {
if (b == ESCAPE) {
b = in.read();
if (b == EOF)
break;
buffer.write(b);
} else {
buffer.write(b);
}
}
If the bytes to be written are perfectly randomly distributed this will increase the stream length by 1/256, for data domains that are not completely random, you can select the byte that is least frequently appearing (by static data analysis or just educated guess).
Edit: you can reduce the escaping overhead by using more elaborate logic, e.g. the example can only create ESCAPE + ESCAPE or ESCAPE + EOF. The other 254 bytes can never follow an ESCAPE in the example, so that could be exploited to store legal data combinations.

It is completely unsafe, you never know what might turn up in your data. Perhaps you should consider something like protobuf, or a scheme like 'first write the record length, then write the record, then rinse, lather, repeat'?
If you have a length, you don't need a delimiter. Your reading side reads the length, then knows how much to read for the first record, and then knows to read the next length -- all assuming that the lengths themselves are fixed-length.
See the developers' suggestions for streaming a sequence of protobufs.

Related

What is the best way to get the size of text in bytes in Java?

I have implemented a cryptographic algorithm in Java. Now, I want to measure the size of the message before and after encryption in bytes.
How to get the size of the text in bytes?
For example, if I have a simple text Hi! I am alphanumeric (8÷4=2)
I have tried my best but can't find a good solution.
String temp = "Hi! I am alphanumeric (8÷4=2)"
temp.length() // this works because in ASCII every char takes one byte
// and in java every char in String takes two bytes so multiply by 2
temp.length() * 2
// also String.getBytes().length and getBytes("UTF-8").length
// returns same result
But in my case after decryption of message the chars becomes the mixture of ASCII and Unicode.
e.g. QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥ죉ὥ
Upper methods returns the length or length * 2
But I want to calculate the actual bytes (not in JVM). For example the char a takes one byte in general and Unicode ™ for example takes two bytes.
So how to implement this technique in Java?
I want some this likes the technique used in this website http://bytesizematters.com/
It gives me 26 bytes for this text QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥ죉ὥ although the length of text is 22.
Be aware: String is for Unicode text (being able to mix all kind of scripts) and char is two bytes UTF-16.
This means that binary data byte[] need to know its encoding/charset, and will be converted to String.
byte[] b = ...
String s = ...
b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
Without explicit charset of the bytes, the platform default is taken, which will give non-portable code.
UTF-8 will allow all text, not just some scripts, but Greek, Arab, Japanese.
However as there is a conversion involved, non-text binary data can get corrupted, will not be legal UTF-8, will cost double the memory and be slower because of the conversion.
Hence avoid String for binary data at all costs.
To respond to your question:
You might get away by StandardCharsets.ISO_8859_1 - which is a single byte encoding.
String.getBytes(StandardCharsets.ISO_8859_1).length() will then correspond to String.length() though the String might use double the memory as char is two bytes.
Alternatives to String:
byte[] themselves, Arrays provides utility functions, like arrayEquals.
ByteArrayInputStream, ByteArrayOutputStream
ByteBuffer can wrap byte[]; can read and write short/int/...
Convert the byte[] to a Base64 String using Base64.getEncoder().encode(bytes).
Converting a byte to some char
The goal is to convert a byte to a visible symbol displayable in a GUI text field, and where the length in chars is the same as the number of original bytes.
For instance the font Lucida Sans Unicode has from U+2400 symbols representing the ASCII control characters. For the bytes with an 8th bit, one could take Cyrillic, though confusion may arise because of similarity Cyrillic е and Latin e.
static char byte2char(byte b) {
if (b < 0) { // -128 .. -1
return (char)(0x400 - b);
} else if (b < 32) {
return (char)(0x2400 + b);
} else if (b == 127) {
return '\u25C1';
} else {
return (char) b;
}
}
A char is a UTF-16 encoding of Unicode, but here also correspond to a Unicode code point (int).
A byte is signed, hence ranges from -128 to 127.

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );
The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.
Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549
RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.
Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

Reading chars from a stream of ByteArrays where boundary alignment may be imperfect

I'm working with asynchronous IO on the JVM, wherein I'm occasionally handed a byte array from an incoming socket. Concatenated, these byte arrays form a stream which my overall goal is to split into strings by instance of a given character, be it newline, NUL, or something more esoteric.
I do not have any guarantee that the boundaries of these consecutive byte arrays are not part of the way through a multi-byte character.
Reading through the documentation for java.nio.CharBuffer, I don't see any explicit semantics given as to how trailing partial multibyte characters are handled.
Given a series of ByteBuffers, what's the best way to get (complete) characters out of them, understanding that a character may span the gap between two sequencial ByteBuffers?
Use a CharsetDecoder:
final Charset charset = ...
final CharsetDecoder decoder = charset.newDecoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.onMalformedInput(CodingErrorAction.REPORT);
I do have this problem in one of my projects, and here is how I deal with it.
Note line 258: if the result is a malformed input sequence then it may be an incomplete read; in that case, I set the last good offset to the last decoded byte, and start again from that offset.
If, on the next read, I fail to read again and the byte offset is the same, then this is a permanent failure (line 215).
Your case is a little different however since you cannot "backtrack"; you'd need to fill a new ByteBuffer with the rest of the previous buffer and the new one and start from there (allocate for oldBuf.remaining() + bufsize and .put() from oldBuf into the new buffer). In my case, my backend is a file, so I can .map() from wherever I want.
So, basically:
if you have an unmappable character, this is a permanent failure (your encoding just cannot handle your byte sequence);
if you have read the full byte sequence successfully, your CharBuffer will have buf.position() characters in it;
if you have a malformed input, it may mean that you have an incomplete byte sequence (for instance, using UTF-8, you have one byte out of a three byte sequence), but you need to confirm that with the next iteration.
Feel free to salvage any code you deem necessary! It's free ;)
FINAL NOTE, since I believe this is important: String's .getBytes(*) methods and constructors from byte arrays have a default CodingErrorAction of REPLACE!

Storing characters as single bytes in java

I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.

Is there a simple way to append a byte to a StringBuffer and specify the encoding?

Question
What is the simplest way to append a byte to a StringBuffer (i.e. cast a byte to a char) and specify the character encoding used (ASCII, UTF-8, etc)?
Context
I want to append a byte to a stringbuffer. Doing so requires casting the byte to a char:
myStringBuffer.append((char)nextByte);
However, the code above uses the default character encoding for my machine (which is MacRoman). Meanwhile, other components in the system/network require UTF-8. So I need to so something like:
try {
myStringBuffer.append(new String(new Byte[]{nextByte}, "UTF-8"));
} catch (UnsupportedEncodingException e) {
//handle error
}
Which, frankly, is pretty ugly.
Surely, there's a better way (other than breaking the same code into multiple lines)???????
The simple answer is 'no'. What if the byte is the first byte of a multi-byte sequence? Nothing would maintain the state.
If you have all the bytes of a logical character in hand, you can do:
sb.append(new String(bytes, charset));
But if you have one byte of UTF-8, you can't do this at all with stock classes.
It would not be terribly difficult to build a juiced-up StringBuffer that uses java.nio.charset classes to implement byte appending, but it would not be one or two lines of code.
Comments indicate that there's some basic Unicode knowledge needed here.
In UTF-8, 'a' is one byte, 'á' is two bytes, '丧' is three bytes, and '𝌎' is four bytes. The job of CharsetDecoder is to convert these sequences into Unicode characters. Viewed as a sequential operation over bytes, this is obviously a stateful process.
If you create a CharsetDecoder for UTF-8, you can feed it only byte at a time (in a ByteBuffer) via this method. The UTF-16 characters will accumulate in the output CharBuffer.
I think the error here is in dealing with bytes at all. You want to deal with strings of characters instead.
Just interpose a reader on the input and output stream to do the mapping between bytes and characters for you. Use the InputStreamReader(InputStream in, CharsetDecoder dec) form of the constructor for the input, though, so that you can detect input encoding errors via an exception. Now you have strings of characters instead of buffers of bytes. Put an OutputStreamWriter on the other end.
Now you no longer have to worry about bytes or encodings. It’s much simpler this way.

Categories