Woodstox parser doesn't support certain unicode character - java

I am new at using WoodStox. I have to read all possible combination of Unicode character and write to XML file. WoodStox fails wile reading certain Unicode character. Can some one help me how can i either skip the character when encountered or some solution to write that Unicode character in XML file.
The exception i get is
Error on line 1 column 1404735 of 24364002-data-set-results.xml:
SXXP0003: Error reported by XML parser: Character reference "&#xfffe" is an invalid XML character.
Exception is : net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: Character reference "&#xfffe" is an invalid XML character.

I am not familiar with WoodStox either, but I can say that FFFE is indeed not a valid Unicode character, so it is probably more a problem with the input than with the parser. FFFE is most commonly used by some encoders in UTF-16 encoding to indicate the byte order (little or big endian). Depending on whether it is read back as FFFE or as FEFF, the decoder knows which byte order to choose. But it is not a standard and not all decoders support it.
When used as such, it is always the first two bytes of the file.
So, what you need to check is:
Are you using the correct character encoding (usually either UTF-8 or UTF-16)?
If using UTF-16, does your file start with FFFE or FFEF?
Does WoodStox have a setting that enables automatically detection of byte order?
Worst case, if your file starts with FFFE or FFEF, simply remove it from the file before you feed it to WoodStox. Make sure that you set the correct byte-order in WoodStox, though.

Related

How to store EBCDIC (IBM-1047) encoding text in Java String without corrupting it?

My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.
I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.
and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.
How can I ensure that the content is not corrupted when storing in the java string variable?
The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.
The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.
Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.
If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.
A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.
Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.
However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.

What happens if your input file contains some unsupported character?

I have this text file which might contain some unsupported characters in the Latin1 character set, which is the default character set of my JVM.
What would those characters be turned into when my java program tries to read from the file?
Concretely, supposed I had a 2-byte long character in the file, would it be read as a one-byte character (because each character in Latin1 is only 1-byte long)?
Thanks,
I can't use the InputStreaReader option, because the file has to be read with Latin1.
And
I have this text file which might contain some unsupported characters in the Latin1 character set ...
You have contradictory requirements here.
Either the file is LATIN-1 (and there no "unsupported characters") or it is not LATIN-1. If it is not LATIN-1, you should be trying to find out what character set / encoding it really is, and use that one instead of LATIN-1 to read the file.
As other answers / comments have explained, you can either change the JVM's default character set, or specify a character set explicitly when you open the Reader.
I'm having trouble setting the default character set of my JVM .
Please explain what you are trying and what problems you are having.
(and was a bit afraid of messing it up!)
COWARD! :-)
FWIW - if you try to read a data stream in (say) LATIN-1 and the data stream is not actually in LATIN-1, then you can expect the following:
Characters that encode the same in LATIN-1 and the actual character set will be passed unharmed.
Characters that don't encode the same, will either be replaced by a character that means "unknown character" (e.g. a question mark), or will be garbled. Which happens will depend on whether that byte or byte sequence at issue encodes a valid (but wrong) character, or no character at all.
The net result will be partially garbled text. The garbling may or may not be reversible, depending on exactly what the real character set and characters are. But it is best to avoid "going there" ... by using the RIGHT character set to decode in the first instance.
First of all you can specify the character set to use when reading a file. See for example: java.io.InputStreamReader
Secondly. Yes if reading using a 1 byte character set then each byte will be used to map to one character.
Thirdly: Test it and you shall see, beyond doubt what actually happens!
If you don't know the charset you'll have to guess it. This is tricky and error prone.
Here is a question regarding this issue:
How can I detect the encoding/codepage of a text file
Check out how you can fool notepad into guessing wrong.

Can a file be encoded in multiple charsets in Java?

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.
You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.
Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.
I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.

Perform binary search on a file written in UTF format

Is there a way to perform binary search on a file stored in UTF format in sorted order. I am able to perform binary search on a text file using RandomAccessFile. First I find out the length of the file and then jump to the middle position of the file using fseek, after jumping to the middle position I read the bytes. However, I am not finding it feasible for a file stored in UTF format, as the first characters are random in UTF format. And also with DataInputStream I am unable to jump to a particular position in the file. Is it possible to do binary search on such a file. If yes, then using which classes.
Yes, it is possible. If you jump into the middle of a file, you will first need to go to the nearest record separator and then use the text starting after the record separator.
Depending on the exact file format you have, a line feed, a TAB character or something similar could be used as the record separator.
Locating the record separator is easy if it is a character with a Unicode number below 32 (which NL, CR, TAB fulfill). Then you don't need to care about the multibyte UTF-8 encoding (for locating the separator). If it's a wide character Unicode format, then it isn't much more difficult either.
DataInputStream is the wrong class from random access. (Streaming is sort of the opposite of random access.) Have a look at RandomAccessFile instead.

How to write ASCII extended characters(which has ascii code > 127) to XML file using java?

I read texts from different sources which can have characters from different languages/extended characters like € ƒ „ … † ® ©. And then I am supposed to write to an XML file, I am using PrinterWriter in java to write to an XML file whatever string I read. So for these types of extended characters which has ascii greater than 127 gives illegal characters error in XML file, so how can I encode it properly while writing to XML.
First, there's no such thing as an ASCII code above 127. ASCII only defines values up to 127. "Extended ASCII" is an ambiguous term, as it's used to describe many different encodings.
Now, as for XML: use whichever XML API you want to write the string, without worrying about the contents (so long as they are representable in XML; various control characters in the range U+0000 to U+001F aren't representable, unfortunately). Don't try to create the XML from scratch yourself - that's what XML APIs are for. Make sure that your XML document uses an encoding that will cope with the characters you need (UTF-8 is normally a good choice, and is often the default), make sure that your Java strings have the right Unicode data in them, and you should be fine.
EDIT: I hadn't actually spotted this bit before:
I am using PrinterWriter in java to write to an XML
Don't. Please use an XML API. There are plenty around, and you'll have a lot less to worry about. I'd also not recommend using PrintWriter anyway for the most part - suppressing exceptions isn't really a good idea in most cases.
Use the &#value; syntax. Space would be

Categories