Convert XML document from Latin1 to UTF8 using Java

Convert XML document from Latin1 to UTF8 using Java - java

I am trying to create an XML document (rss feed) and have worked out all the kinks in it except for one character encoding issue. The problem is that I am using a UTF-8 encoding like so <?xml version="1.0" encoding="UTF-8"?> except the document itself is not encoded to UTF-8.
I am using the org.apache.ecs.xml package to create all the tags. I then use doc.output(stream) to write the content. This method does not seem to write output using UTF-8 and I don't know how to make that happen. Until I do, some symbols (the british pound is what I first noticed) aren't rendered properly in most readers.
--Updated with more information--
I ended up using a bad solution (as explained in the comments) to fix this problem. The correct answer seems to be dont use the org.apache.ecs.xml library. Thank you all for the help. StackOverflow wins again.

The simplest workaround is probably going to be changing your code like follows:
XMLDocument doc = new XMLDocument(1.0,false,Charset.defaultCharset().toString());
I'm guessing they're just using the default encoding to write characters to the stream. So pass the default encoding to the prologue and you should be fine.
I'll agree with other posters that this is probably the least of your worries. Looking at the source repository for ECS, it doesn't appear to have been updated for four years (the "ECS2" repository likewise).
And some self-promotion: if you're looking to build XML documents using a simple interface, the Practical XML library has a builder. It uses the standard JDK serialization mechanism for output.

Any chance you can write to a Writer rather than an OutputStream... that way you could specify the encoding.

Here is a solution my co-worker came up with that I THINK is the correct way to do it but what do I know. Instead of using doc.output(stream) we used:
try {
IOUtils.write(doc.toString(), stream, "UTF-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
To be honest I dont completely understand the problem yet, which is why I am having problems in the first place. It seems that #subtenante's solution went through and converted any character that UTF-8 could not represent and replaced it with the unicode entity. This solution seems to write to the stream using the UTF-8 encoding like I originally wanted doc.output to. I dont know the exact difference, just that both solved my problems. Any further comments to help me understand the problem would be appreciated.

I'm not familiar with this package but from the source on the web I suspect it may be broken:
http://kickjava.com/src/org/apache/ecs/xml/XMLDocument.java.htm
contains stuff like
for (int i=0; i<prolog.size(); i++) {
268 ConcreteElement e = (ConcreteElement)prolog.elementAt(i);
269 e.output(out);
270 // XXX really this should use line separator!
271 // XXX should also probably check for pretty print
272 // XXX also probably have difficulties with encoding
which suggests problems.
We use XOM (http://www.xom.nu) and that specifically has a setEncoding() on its Serializer so I would suggest changing packages...

Here is a function I wrote to convert all non-ASCII characters to their corresponding entity. Might help you sanitizing some PCDATA content before output.
/**
* Creates xml entities for non ascii characters in the given String.
*/
public static String xmlEntitify(String in){
StringBuffer b = new StringBuffer();
for (int i=0;i<in.length();i++){
Character c = in.charAt(i);
if (c<128){
b.append(c);
}
else if (c=='\ufeff'){
// BOM character, just remove it
}
else {
String cstr = Integer.toHexString(c).toUpperCase();
while(cstr.length()<4){
cstr="0"+cstr;
}
b.append("&#x");
b.append(cstr);
b.append(";");
}
}
return b.toString();
}
Read your input stream into a String content, and write into the output stream xmlEntitify(content).
Your output is guaranteed to contain only ASCII characters, no more encoding problem.
UPDATE
Given the comments, I'll be even bolder : if you are not sanitizing your data, you are calling for trouble. I guess you are at least already replacing the < and & characters in your PCDATA. If not, you definitely should. I have another version of the above method which, instead of the first if, has :
if (c<128 && c!='&' && c!='<' && c!='>' && c!='"'){
b.append(c);
}
so that these characters are also converted to their corresponding Unicode entity.
This converts all of my PCDATA to unicode-friendly ASCII-only strings. I had no more encoding problem since I'm using this technique. I don't ever output XML PCDATA which has not been passed through this method : this is not sweeping the elephant under the carpet. It is just getting rid of the problem by being as generic as can be.

Related

Encoding difficulties

I'm having some encoding problems with a code I'm working on. An encrypted string is received which is decoded with ISO-8859-1. This string is then put into a DB which has UTF-8 encoding. When this string is retrieved it's still ISO-8859-1, and there's no problems. The issue is that I also need to be able to retrieve this string as UTF-8, but I haven't been successfull in this.
I've tried to convert the string from ISO to UTF-8 when retrieved from the DB using this method:
private String convertIsoToUtf8(String isoLatin) {
try {
return new String(isoLatin.getBytes("ISO_8859_1"), "UTF_8");
} catch (UnsupportedEncodingException e) {
return isoLatin;
}
}
Unfortunately, the special characters are just displayed as question-marks in this case.
Original string: Test æøå
Example output after retriving from DB and converting to UTF-8: Test ???
Update: After reading the link provided in the comment, I managed to get it right. Since the DB is already UTF-8 encoded, all I needed to do was this:
return new String(isoLatin.getBytes("UTF-8"));

When you already have a String-object it is usually too late to correct any encoding-issues since some information may already have been lost - think of characters that can't be mapped one-to-one onto to java's internal UTF-16 representation.
The correct place to handle character-ecoding is the moment you get your Strings: when reading input from a file (set the correct encoding on your InputStreamReader), when converting the byte[] you got from decryption, when reading from the database (this should be handeled by your JDBC-driver) etc.
Also take care to correctly handle the encoding when doing the reverse. While it might seem to work OK most of the time when you use the default-encoding you might run into issues sooner or later that become difficult to impossible to resolve (as you do now).
P.S.: also keep in mind what tool you are using to display your output: some consoles won't display UTF-16 or UTF-8, check the encoding-settings of the editor you use to view your files etc. Sometimes your output might be correct and just can't be displayed correctly.

JAX-WS: Illegal character ((CTRL-CHAR, code 30)) [duplicate]

I am getting following exception from webservices:
com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 15))
I know the reason behind this, I am getting "Control Characters" in data I want to return. And in XML CTRL-CHAR are not allowed.
I searched for the solution, and many places I found the code to remove CTRL-CHAR.
The concern is shall I end up loss of data if I remove control characters from data?
I want the clean solution may encoding, instead of removing control char.

I would do what OrangeDog suggest. But if you want to solve it in your code try:
replaceAll("[\\x00-\\x09\\x11\\x12\\x14-\\x1F\\x7F]", "")
\\x12 is the char.

Thanks guys for you inputs. I am sharing solution might be helpful for others.
The requirement was not to wipe out CONTROL CHAR, it should remain as it is in DB also and one WS sends it across n/w client should able to get the CONTROL CHAR. So I implemented the code as follow:
Encode strings using URLEncoder in Web-Service code.
At client Side decode it using URLDecoder
Sharing sample code and output bellow.
Sample code:
System.out.println("NewSfn");
System.out.println(URLEncoder.encode("NewSfn", "UTF-8"));
System.out.println(URLDecoder.decode("NewSfn", "UTF-8"));
Output:
NewSfn
New%0FSfn
NewSfn
So client will recieve CONTROL CHARs.
EDIT: Stack Exchange is not showing CONTROL CHAR above. NewSfn is like this New(CONTROL CHAR)Sfn.

This error is being thrown by the Woodstox XML parser. The source code from the InputBootstrapper class looks like this:
protected void reportUnexpectedChar(int i, String msg)
throws WstxException
{
char c = (char) i;
String excMsg;
// WTF? JDK thinks null char is just fine as?!
if (Character.isISOControl(c)) {
excMsg = "Unexpected character (CTRL-CHAR, code "+i+")"+msg;
} else {
excMsg = "Unexpected character '"+c+"' (code "+i+")"+msg;
}
Location loc = getLocation();
throw new WstxUnexpectedCharException(excMsg, loc, c);
}
Amusing comment aside, the Woodstox is performing some additional validation on top of the JDK parser, and is rejecting the ASCII character 15 as invalid.
As to why that character is there, we can't tell you that, it's in your data. Similarly, we can't tell you if removing that character will break anything, since again, it's your data. You can only establish that for yourself.

If you have control characters in your text data then you need to solve that problem at its source.
The most likely causes are incorrect communication encodings (usually between database and app) or not sanitising user input.

I found the same problem when I was passing null values for some of the parameters. I passed empty or wrench values instead and this error went away.

I'm a bit confused by #ssedano's anwser, it seems to me he's trying to find all control chars from ASCII table 0x00 to 0x1F except for 0x0A (new line) and 0x0D (carriage return) plus 0x7F (del), then wouldn't ther regex be
replaceAll("[\\x00-\\x09\\x0B\\x0C\\x0E-\\x1F\\x7F]", "")

Why does ICU4J return the byte-order-mark when reading an array of bytes into a String?

I read a file into an array of bytes. Then I use ICU4J to detect the file's encoding (I don't know what the encoding might be, these files can have multiple different encodings) and return a Unicode String.
Like so:
byte[] fileContent = // read file into byte array
CharsetDetector cd = new CharsetDetector();
cd.setText(fileContent);
CharsetMatch cm = cd.detect();
String result = cm.getString();
When my file is encoded using UTF-16LE the first character in "result" is the byte-order-mark. I'm not interested in this and because it is specific to the encoding scheme and not really part of the file content I would expect it to be gone.
Yet ICU4J returns it. Why is this happening and is there some way of getting around this problem? The only solution I see is manually checking if the first character in the returned String is the byte order mark and stripping it manually. Is there some cleaner/better way?

I just consulted the docs ... icu-project.org/apiref/icu4j/com/ibm/icu/text/…. They in fact say that it returns the corresponding Java String but they do not say anything about removing the BOM. So I'd expect it to be there if it was in the first place.
To me it is natural that it is also retrieved. I'd expect them to explicitly mention it in the docs if they were trimming out the BOM.
I think the answer is here unicode.org/faq/utf_bom.html#bom1 - "Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol."
I think that's pretty much it. If a BOM is mandatory, you'd have to add it again. Filtering it out if the BOM is prohibited is considered the easy part I guess :)

java.net.URLConnection.guessContentTypeFromStream and text/plain

All,
I am trying to identify plain text files with Mac line endings and, inside an InputStream, silently convert them to Windows or Linux line endings (the important part is the LF character, really). Specifically, I'm working with several APIs that take InputStreams and are hard-locked to looking for \n as newlines.
Sometimes, I get binary files. Obviously, a file that isn't text-like shouldn't have this substitution done, because the value that happens to correspond to \r obviously can't silently be followed by a \n without mangling things badly.
I am attempting to use java.net.URLConnection.guessContentTypeFromStream and only performing endline conversions if the type is text/plain. Unfortunately, "text/plain" doesn't seem to be in its gamut of return values; all I get is null for my flat text files, and it's possibly not safe to assume all unidentifiable files can be modified.
What better library (preferably in a public Maven repository and open-source) can I use to do this? Alternatively, how can I make guessContentTypeFromStream work for me? I know I'm describing an inherently hazardous application and no solution can be perfect, but should I just treat "null" as likely to be "text/plain" and I simply need to write more code myself to look for evidence that it isn't?

It seems to me that what you're asking is to determine if a file is textual or not. Given that, there is a solution here that seems right:
Granted, he is talking about unix, bash and perl but the concept is the same:
Unless you inspect every byte of the
file, you are not going to get this
100%. And there is a big performance
hit with inspecting every byte. But
after some experiments, I settled on
an algorithm that works for me. I
examine the first line and declare the
file to be binary if I encounter even
one non-text byte. It seems a little
slack, I know, but I seem to get away
with it.
EDIT #1:
Expanding on this type of solution, it seems like a reasonable approach would be to ensure the file contains no non-ascii characters (unless you're dealing with files that are non-English...thats another solution). This could be done by checking if the file contents as a String does not match this:
// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );
EDIT #2
You may want to try this as your regex, or something close to it. Though, I'll admit it could likely use some refining.
".*(?:\\p{Print}|\\p{Space})+.*"

Print string literal unicode as the actual character

In my Java application I have been passed in a string that looks like this:
"\u00a5123"
When printing that string into the console, I get the same string as the output (as expected).
However, I want to print that out by having the unicode converted into the actual yen symbol (\u00a5 -> yen symbol) - how would I go about doing this?
i.e. so it looks like this: "[yen symbol]123"

I wrote a little program:
public static void main(String[] args) {
System.out.println("\u00a5123");
}
It's output:
¥123
i.e. it output exactly what you stated in your post. I am not sure there is not something else going on. What version of Java are you using?
edit:
In response to your clarification, there are a couple of different techniques. The most straightforward is to look for a "\u" followed by 4 hex-code characters, extract that piece and replace with a unicode version with the hexcode (using the Character class). This of course assumes the string will not have a \u in front of it.
I am not aware of any particular system to parse the String as though it was an encoded Java String.

As has been mentioned before, these strings will have to be parsed to get the desired result.
Tokenize the string by using \u as separator. For example: \u63A5\u53D7 => { "63A5", "53D7" }
Process these strings as follows:
String hex = "63A5";
int intValue = Integer.parseInt(hex, 16);
System.out.println((char)intValue);

You're probably going to have to write a parse for these, unless you can find one in a third party library. There is nothing in the JDK to parse these for you, I know because I fairly recently had an idea to use these kind of escapes as a way to smuggle unicode through a Latin-1-only database. (I ended up doing something else btw)
I will tell you that java.util.Properties escapes and unescapes Unicode characters in this manner when reading and writing files (since the files have to be ASCII). The methods it uses for this are private, so you can't call them, but you could use the JDK source code to inspire your solution.

Could replace the above with this:
System.out.println((char)0x63A5);
Here is the code to print all of the box building unicode characters.
public static void printBox()
{
for (int i=0x2500;i<=0x257F;i++)
{
System.out.printf("0x%x : %c\n",i,(char)i);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert XML document from Latin1 to UTF8 using Java - java

Any chance you can write to a Writer rather than an OutputStream... that way you could specify the encoding.

Related

Encoding difficulties

JAX-WS: Illegal character ((CTRL-CHAR, code 30)) [duplicate]

Why does ICU4J return the byte-order-mark when reading an array of bytes into a String?

java.net.URLConnection.guessContentTypeFromStream and text/plain

Print string literal unicode as the actual character

Categories

Resources