I am using jdk1.3 for blackberry platform. Now I am facing a problem when I trying to read an Unicode encoded xml file.
My code :
java.io.BufferedReader br = new java.io.BufferedReader(new java.io.InputStreamReader(new java.io.FileInputStream(path),"UTF16"));
br.readLine();
Error:
sun.io.MalformedInputException: Missing byte-order mark
at sun.io.ByteToCharUnicode.convert(ByteToCharUnicode.java:123)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at java.io.BufferedReader.fill(BufferedReader.java:139)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
Thanks
You XML file is missing a byte order mark.
In JDK 1.3, the byte order mark is mandatory if you use UTF-16. Try the UTF16-LE or -BE if you know in advance what the endianness is.
(The BOM is not mandatory in 1.4.2 and above.)
Of course, if your file is not UTF-16 at all, use the correct encoding. See the above link to character encodings. The actual encodings supported, apart from a small set of core encodings, are implementation defined so you'll need to check the docs for your particular JDK.
The encoding the files are in is supposed to be in the <xml> header of your files, something like:
<?xml version="1.0" encoding="THIS IS THE ENCODING YOU NEED TO USE"?>
If the file is in a single character encoding, or UTF-8 (without a BOM), You can try reading the first line with plain US-ASCII, it shouldn't contain any data outside that range. Parse the encoding field, then re-open the file with the deduced encoding.
This will only work if the actual encoding is supported by your platform obviously.
BTW: JDK 1.3 is ancient. Are you sure that's your version? (Doesn't change anything to the problem anyway except for the BOM part)
Try this code:
java.io.BufferedReader br = new java.io.BufferedReader(new java.io.InputStreamReader(new java.io.FileInputStream(path),"Windows-1256"));
br.readLine();
Related
I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")
I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();
One of my data processing modules crashed while reading ANSI input. Looking at the string in question using a hex viewer, there was a mysterious 0xA0 byte at the end of it.
Turns out this is
Unicode Character 'NO-BREAK SPACE' (U+00A0).
I tried replacing that:
String s = s.replace("\u00A0", "");
But it didn't work.
I then went and printed out what that character is using charAt and Java reports
65533
or 0xFFFD
(Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD))
Plugging that into the replace code, I finally got rid of it!
But why do I see an 0xA0 in the file, but Java reads it as 0xFFFD?
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line = r.readLine();
while (line != null){
// do stuff
line = r.readLine();
}
U+FFFD is the "Unicode replacement character", which is generally used to represent "some binary data which couldn't be decoded correctly in the encoding you were using". (Sometimes ? is used for this instead, but U+FFFD is generally a better idea, as it's unambiguous.)
Its presence is usually a sign that you've tried to use the wrong encoding. You haven't specified which encoding you were using - or indeed how you were using it - but that's probably the problem. Check the encoding you're using and the encoding of the file. Be aware that "ANSI" isn't an encoding - there are lots of encodings which are known as ANSI encodings, and you'll need to pick the right one for your file.
How did you open the file?
If you use InputStreamReader(InputStream, CharSet) your can specify the 'true' charset of the file you would like to open. If you do not specify the charset yourself, java is using the default charset of your platform. On unix this is often UTF8 while on windows its often ISO8859.
I create a XML String on the fly (NOT reading from a file). Then I use Cocoon 3 to transform it via FOP to a PDF. Somewhere in the middle Xerces runs. When I use the hardcoded stuff everything works. As soon as I put a german Umlaut into the database and enrich my xml with that data I get:
Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)
I have then debugged my app and found out, my "Ä" (which comes frome the database) has the byte value of 196, which is C4 in hex. This is what I have expected according to this: http://www.utf8-zeichentabelle.de/
I do not know why my code fails.
I have then tried to add a BOM manually, like that:
byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;
I know this is not exactly good, but I tried it - of course it failed. I have tried to add a xml header in front:
<?xml version="1.0" encoding="UTF-8"?>
Which failed too. Then I combined it. Failed.
After all I tried something like that:
xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");
Which is doing nothing in fact, because it is already UTF-8. Still it fails.
So... any ideas what I am doing wrong and what Xerces is expecting from me?
Thanks
Christian
If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.
The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).
Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.
EDIT: Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.
I looked up the code to find the actual problem in Cocoon-stax:
if (XMLGenerator.this.logger.isDebugEnabled()) {
XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();
As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.
As a workaround, one can use the following:
new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
"UTF-8");
This will trigger the right internal actions (as Christian found out by experimenting with the API).
I've opened an issue in Apache's bug tracker.
EDIT 2: The issue is fixed and will be included in an upcoming release.
The C4 you see on that page refers to the unicode code point, U+00C4. The byte sequence used to represent such a code point in UTF-8 is NOT "\xC4". What you want is what's in the UTF-8 (hex.) column, namely "\xC3\x84".
Therefore, your data is not in UTF-8.
You can read about how data is encoded in UTF-8 here.
I'm running Windows 7 with TextPad as a text editor for manually building the xml data file. I was getting the MalformedByteSequenceException. My spec in the xml file was UTF-8. After poking around, I found that my editor had a tool "Tools ... Convert to DOS". I did that, re-saved the file, and the exception went away and my code ran fine.
I then looked at the default encoding for that file type in my editor. It was ASCII, though when I changed the xml encoding parameter to ASCII, I got another different MalformedByteSequenceException.
So on Windows systems, you might try keeping the xml encoding to UTF-8, but save the file encoded DOS. I did not dig any further as to why this works.
When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}