XMLStreamReader: get character offset : XML from file - java

The XMLStreamReader->Location has a method called getCharacterOffset().
Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):
The Javadoc states :
Return the byte or character offset into the input source this
location is pointing to. If the input source is a file or a byte
stream then this is the byte offset into that stream, but if the input
source is a character media then the offset is the character offset. (emphasis added)
I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.
The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].
Question
What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?
Extra blah blah:
[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]

The offset is in units of the underlying Source.
The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units.
A Stream works in units of byte and therefore you end up with a byte offset.
A Reader works in units of char and therefore you end up with an offset in char.
The docs for StreamSource are more explicit in what the terms "character media" means.
Maybe try something like
final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);

XMLInputFactory.createXMLStreamReader(java.io.InputStream) is a byte stream
XMLInputFactory.createXMLStreamReader(java.io.Reader) is a character stream

Related

Is byte stream encodes byte to characters or only operates on bytes?

We have byte and character stream, If you read some examples from internet you can find that byte stream only operates on bytes and nothing more.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
And then why we need encoding in byte stream ?
Some popular websites did not help me.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
Everything in a typical modern computer has to be represented in bytes: a file holds a sequence of bytes, a network connection lets you send a sequence of bytes, a pointer identifies the location of a byte in memory, and so on. So a byte stream — an InputStream or OutputStream or the like — provides basic processing to let you read or write a sequence of bytes, no matter what kind of data is being represented by those bytes. The data might be text encoded as UTF-8 or UTF-16 or some other encoding, or it might be an image in a GIF or PNG or JPEG or other format, or it might be audio data or video data or a PDF or a Word document or . . . well, you get the idea.
A character stream — a Reader or Writer — provides a higher level of processing specifically for text data, so that you don't need to worry about the specific bytes being used to represent the characters, you just need to worry about the characters themselves. You just need to tell the character stream which character encoding to use (or let it use an appropriate default), and it can handle the rest from there.
But there's one big complication: Java didn't introduce this distinction until version 1.1, and because Java aims for a very high degree of backward-compatibility, there are some classes that survive from version 1.0 that kind of straddle the line. In particular, there is a PrintStream class that extends OutputStream and adds special 'print' methods that take more convenient types, such as String, and handle the character encoding internally. That PrintStream class has been there since version 1.0, and is still in wide use, especially because System.out and System.err are instances of it. (In theory, we should be using PrintWriter instead.)
And then why we need encoding in byte stream ?
We need a character encoding in whatever layer is converting between character sequences and byte sequences. Normally that layer is separate from the byte stream, but as I mentioned above, there are some holdovers from version 1.0 that handle the conversion themselves, which means they need to know which encoding to use.
It is a fundamentally quite straightforward system, but due to some required existing knowledge and possible interactions of several parts it can be confusing.
Let's put down some fundamental truths/axioms:
a InputStream is fundamentally about reading bytes from somewhere.
a OutputStream is fundamentally about writing bytes to somewhere.
Reader/Writer are the equivalent of those two for chars/String/text.
In the Java world, as long as you handle only String (or its related types like StringBuilder, ...) you don't need to care about encoding. It will always look like UTF-16, but you might as well pretend no encoding happens.
if you only ever handle byte[] (and related types like ByteBuffer) then you also don't need to care about encoding.
the encoding only ever comes into play when you want to cross over from the byte[] world to the String world (or the other way around).
So some Writer classes like OutputStreamWriter take a Charset to construct. And that's precisely because it's one of those borders that I mention in the last point above: It's handling both String and byte[] (indirectly), because it is a Writer that writes to a OutputStream and for that to work it will need to convert the String that gets written to it into a byte[] that it can forward to the OutputStream.
Other Writer (such as StringWriter) don't transfer data between those two world: it takes in String and produces String, so no conversion is necessary.
On the other side a ByteArrayInputStream is an InputStream that reads from a byte[], so again: both the input and the output live in "the same world", so no conversion is necessary and thus no Charset parameter exists.
tl;dr the "purity" of InputStream/OutputStream/Reader/Writer exists as long as you look only at those interfaces. When you look at specific implementations some of those will need to convert from the text world to the binary world (or vice versa) and those implementations will need to handle both worlds.

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );
The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.
Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549
RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.
Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

Reading chars from a stream of ByteArrays where boundary alignment may be imperfect

I'm working with asynchronous IO on the JVM, wherein I'm occasionally handed a byte array from an incoming socket. Concatenated, these byte arrays form a stream which my overall goal is to split into strings by instance of a given character, be it newline, NUL, or something more esoteric.
I do not have any guarantee that the boundaries of these consecutive byte arrays are not part of the way through a multi-byte character.
Reading through the documentation for java.nio.CharBuffer, I don't see any explicit semantics given as to how trailing partial multibyte characters are handled.
Given a series of ByteBuffers, what's the best way to get (complete) characters out of them, understanding that a character may span the gap between two sequencial ByteBuffers?
Use a CharsetDecoder:
final Charset charset = ...
final CharsetDecoder decoder = charset.newDecoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.onMalformedInput(CodingErrorAction.REPORT);
I do have this problem in one of my projects, and here is how I deal with it.
Note line 258: if the result is a malformed input sequence then it may be an incomplete read; in that case, I set the last good offset to the last decoded byte, and start again from that offset.
If, on the next read, I fail to read again and the byte offset is the same, then this is a permanent failure (line 215).
Your case is a little different however since you cannot "backtrack"; you'd need to fill a new ByteBuffer with the rest of the previous buffer and the new one and start from there (allocate for oldBuf.remaining() + bufsize and .put() from oldBuf into the new buffer). In my case, my backend is a file, so I can .map() from wherever I want.
So, basically:
if you have an unmappable character, this is a permanent failure (your encoding just cannot handle your byte sequence);
if you have read the full byte sequence successfully, your CharBuffer will have buf.position() characters in it;
if you have a malformed input, it may mean that you have an incomplete byte sequence (for instance, using UTF-8, you have one byte out of a three byte sequence), but you need to confirm that with the next iteration.
Feel free to salvage any code you deem necessary! It's free ;)
FINAL NOTE, since I believe this is important: String's .getBytes(*) methods and constructors from byte arrays have a default CodingErrorAction of REPLACE!

Java regular expression on JPG carving

I'm having a few problems using regular expressions in Java. I'm attempting to search through an ISO file, and carve out any JPG images, if there are any in there.
At the moment, I'm having success with locating EXIF information within the JPG, using the following regular expression:
Pattern imageRegex = Pattern.compile("\\x45\\x78\\x69\\x66"); //Exif regex
This works fine and I can then file carve out the EXIF information.
However, if I use this regex:
Pattern imageRegex = Pattern.compile("\\xff\\xd8\\xff"); //JPG header regex
Java fails to find any matches. I can confirm that there are JPGs present within the ISO file.
I'm reading in 200 bytes of the file at a time into a byte array and then converting that to a string to be regex'd.
Can anyone advice why this is happening as it's rather confusing.
Or can anyone advise a better way of approaching the issue of file carving JPGs using regular expressions in Java?
Any advice would be greatly appreciated.
I'm reading in 200 bytes of the file at a time into a byte array and then converting that to a string to be regex'd.
Maybe all the JPEG headers are split across the N*200 borders.
Anyway, this is a rather unconventional (and inefficient) way of searching binary data. Why don't you just go through the input stream until you find the header?
If you're reading in a byte array and converting it to a string, it's possible that string encoding issues are biting you in the rear. It so happens that the EXIF pattern you're looking for is all ASCII-compatible:
0x45 0x78 0x69 0x66
E x i f
but the JPEG header isn't:
0xff 0xd8 0xff
You'd do well to folow Jakub's advice and skip the regular expressions.
Using regex to match binary sequences is rarely appropiate; I wonder if you are well aware of the conceptual differences between binary data and strings in Java (as opposed to, say, C).
A JPEG file is binary data (a sequence of bytes), to use in a pattern regex you must have it in Java as a String (a sequence of characters), they are fundamentally different entities, and to convert from one to another some charset encoding must be specified. Further, when you specify the literal \x45 inside a pattern or as a literal string, you are not meaning (as you seem to believe) "the byte with binary value 0x45" (this would not make sense, because we are not dealing with bytes ) but, "the character point number 0x45 in Unicode".
It's true that in several usual charset encodings (in particular in UTF-8 and in ISO-8859-1 and its variants) a sequence of bytes in the "ascii range" (less than 127) will be converted to a codepoint with that byte value. But for other encodings (as UTF-16) or other values (in the 128-255 range) that's not necesarily true. In particular, it's not true for UTF-8 - it's true for ISO-8859-1, but you should not rely on this "coincidence" (if your you this is a coincidence).
In your scenario, I'd say that if you specify ISO-8859-1 encoding you will probably get what you expect. But it would still smell bad.
Exercise: try to predict/understand what this code prints:
public static void main(String[] args) throws Exception {
byte[] b = { 0x30, (byte) 0xb2 };
String x = new String(b, "ISO-8859-1");
System.out.println(x.matches(".*\\x30.*"));
System.out.println(x.matches(".*\\xb2.*"));
String x2 = new String(b, "UTF-8");
System.out.println(x2.matches(".*\\x30.*"));
System.out.println(x2.matches(".*\\xb2.*"));
}
Place the mouse over below to see the answer.
true true true false

Why character streams?

I understand that Java character streams wrap byte streams such that the underlying byte stream is interpreted as per the system default or an otherwise specifically defined character set.
My systems default char-set is UTF-8.
If I use a FileReader to read in a text file, everything looks normal as the default char-set is used to interpret the bytes from the underlying InputStreamReader. If I explicitly define an InputStreamReader to read the UTF-8 encoded text file in as UTF-16, everything obviously looks strange. Using a byte stream like FileInputStream and redirecting its output to System.out, everything looks fine.
So, my questions are;
Why is it useful to use a character stream?
Why would I use a character stream instead of directly using a byte stream?
When is it useful to define a specific char-set?
Code that deals with strings should only "think" in terms of text - for example, reading an input source line by line, you don't want to care about the nature of that source.
However, storage is usually byte-oriented - so you need to create a conversion between the byte-oriented view of a source (encapsulated by InputStream) and the character-oriented view of a source (encapsulated by Reader).
So a method which (say) counts the lines of text in an input source should take a Reader parameter. If you want to count the lines of text in two files, one of which is encoded in UTF-8 and one of which is encoded in UTF-16, you'd create an InputStreamReader around a FileInputStream for each file, specifying the appropriate encoding each time.
(Personally I would avoid FileReader completely - the fact that it doesn't let you specify an encoding makes it useless IMO.)
An InputStream reads bytes, while a Reader reads characters. Because of the way bytes map to characters, you need to specify the character set (or encoding) when you create an InputStreamReader, the default being the platform character set.
When you are reading/writing text which contains characters which could be > 127 , use a char stream. When you are reading/writing binary data use a byte stream.
You cna read text as binary if you wish, but unless you make alot of assumptions it rarely gains you much.

Categories