UTF-16LE encoding and xerces2 Java - java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.

To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

Related

Java Spring returning CSV file encoded in UTF-8 with BOM

Apparently for excel to open CSV files nicely, it should have the Byte Order Mark at the start. The download of CSV is implemented by writing into HttpServletResponse's output stream in the controller, as the data is generated during request. I get an exception when I try to write the BOM bytes - java.io.CharConversionException: Not an ISO 8859-1 character: [] (even though the encoding I specified is UTF-8).
The controller's method in question
#RequestMapping("/monthly/list")
public List<MonthlyDetailsItem> queryDetailsItems(
MonthlyDetailsItemQuery query,
#RequestParam(value = "format", required = false) String format,
#RequestParam(value = "attachment", required = false, defaultValue="false") Boolean attachment,
HttpServletResponse response) throws Exception
{
// load item list
List<MonthlyDetailsItem> list = detailsSvc.queryMonthlyDetailsForList(query);
// adjust format
format = format != null ? format.toLowerCase() : "json";
if (!Arrays.asList("json", "csv").contains(format)) format = "json";
// modify common response headers
response.setCharacterEncoding("UTF-8");
if (attachment)
response.setHeader("Content-Disposition", "attachment;filename=duomenys." + format);
// build csv
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
response.getOutputStream().print("\ufeff");
response.getOutputStream().write(buildMonthlyDetailsItemCsv(list).getBytes("UTF-8"));
return null;
}
return list;
}
I have just come across, this same problem. The solution which works for me is to get the output stream from the response Object and write to it as follows
// first create an array for the Byte Order Mark
final byte[] bom = new byte[] { (byte) 239, (byte) 187, (byte) 191 };
try (OutputStream os = response.getOutputStream()) {
os.write(bom);
final PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(data);
w.flush();
w.close();
} catch (IOException e) {
// logit
}
So UTF-8 is specified on the OutputStreamWriter.
As an addendum to this, I should add, the same application needs to allow users to upload files, these may or may not have BOM's. This may be dealt with by using the class org.apache.commons.io.input.BOMInputStream, then using that to construct a org.apache.commons.csv.CSVParser.
The BOMInputStream includes a method hasBOM() to detect if the file has a BOM or not.
One gotcha that I first fell into was that the hasBOM() method reads (obviously!) from the underlying stream, so the way to deal with this is to first mark the stream, then after the test if it doesn't have a BOM, reset the stream. The code I use for this looks like the following:
try (InputStream is = uploadFile.getInputStream();
BufferedInputStream buffIs = new BufferedInputStream(is);
BOMInputStream bomIn = new BOMInputStream(buffIs);) {
buffIs.mark(LOOKAHEAD_LENGTH);
// this should allow us to deal with csv's with or without BOMs
final boolean hasBOM = bomIn.hasBOM();
final BufferedReader buffReadr = new BufferedReader(
new InputStreamReader(hasBOM ? bomIn : buffIs, StandardCharsets.UTF_8));
// if this stream does not have a BOM, then we must reset the stream as the test
// for a BOM will have consumed some bytes
if (!hasBOM) {
buffIs.reset();
}
// collect the validated entity details
final CSVParser parser = CSVParser.parse(buffReadr,
CSVFormat.DEFAULT.withFirstRecordAsHeader());
// Do stuff with the parser
...
// Catch and clean up
Hope this helps someone.
It doesn't make much sense: the BOM is for UTF-16; there is no byte order with UTF-8. The encoding You've set with setCharacterEncoding is used for getWriter, not for getOutputStream.
UPDATE:
OK, try this:
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
PrintWriter out = response.getWriter();
out.print("\uFEFF");
out.print(buildMonthlyDetailsItemCsv(list));
return null;
}
I'm assuming that method buildMonthlyDetailsItemCsv returns a String.

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded.
I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding()). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.
private void setFileEncoding() {
try {
bufferedReader.reset();
String firstLine = bufferedReader.readLine();
int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="
String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));
// now set the encoding to the reader to be used for parsing afterwards
bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
bufferedReader.mark(0);
} catch (IOException e) {
e.printStackTrace();
}
}
Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.
If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding.
If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.
ICU CharsetDetector:
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
jChardet:
// Initalize the nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1])
: nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
// Set an observer...
// The Notify() will be called when a matching charset is found.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
URL url = new URL(argv[0]);
BufferedInputStream imp = new BufferedInputStream(url.openStream());
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
// Check if the stream is only ascii.
if (isAscii)
isAscii = det.isAscii(buf,len);
// DoIt if non-ascii and not done yet.
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}
You may be able to get the correct character-set from the content-type header, if your server sends it correctly.

How to write to a StyledDocument with a specific charset?

for a NetBeans plugin I want to change the content of a file (which is opened in the NetBeans editor) with a specific String and a specific charset. In order to achieve that, I open the file (a DataObject) with an EditorCookie and then I change the content by inserting a different string to the StyledDocument of my data object.
However, I have a feeling that the file is always saved as UTF-8. Even if I write a file mark in the file. Am I doing something wrong?
This is my code:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument document = cookie.openDocument();
document.remove(0, document.getLength());
document.insertString(0, utf16be, null);
cookie.saveDocument();
} catch (BadLocationException | IOException ex) {
Exceptions.printStackTrace(ex);
}
});
I have also tried this approach which doesn't work too:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument doc = cookie.openDocument();
String utf16be = "\uFEFFHello World!";
InputStream is = new ByteArrayInputStream(utf16be.getBytes(StandardCharsets.UTF_16BE));
FileObject fileObject = dataObject.getPrimaryFile();
String mimePath = fileObject.getMIMEType();
Lookup lookup = MimeLookup.getLookup(MimePath.parse(mimePath));
EditorKit kit = lookup.lookup(EditorKit.class);
try {
kit.read(is, doc, doc.getLength());
} catch (IOException | BadLocationException ex) {
Exceptions.printStackTrace(ex);
} finally {
is.close();
}
cookie.saveDocument();
} catch (Exception ex) {
Exceptions.printStackTrace(ex);
}
});
Your problem is probably here:
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
This won't do what you think it does. This will convert your string to a byte array using the UTF-16 little endian encoding and then create a String from these bytes using the JRE's default encoding.
So, here's the catch:
A String has no encoding.
The fact that in Java this is a sequence of chars does not matter. Substitute 'char' for 'carrier pigeons', the net effect will be the same.
If you want to write a String to a byte stream with a given encoding, you need to specify the encoding you need on the Writer object you create. Similarly, if you want to read a byte stream into a String using a given encoding, it is the Reader which you need to configure to use the encoding you want.
But your StyledDocument object's method name is .insertString(); You should .insertString() your String object as is; don't transform it the way you do, since this is misguided, as explained above.

Want to throw exception when encounter special UTF-8 characters in an XML file

I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.
If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.
yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...
You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories