javax.xml.transform.TransformerFactory Unicode issue- Java - java

We are not able transform Unicode Characters properly. We are giving input in XML format, when we try to transform we are not able to get back the original string.
This is the code i'm using,
StringCarrier OStringCarrier = new StringCarrier();
String SXmlFileData= "<export_candidate_response><criteria><output><lastname>Bhagavath</lastname><firstname>ガネーシュ</firstname></output></export_candidate_response>";
String SResult = "";
try
{
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(new StreamSource(SXslFileName));
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF8");
OutputStream xmlResult = (OutputStream)new ByteArrayOutputStream();
StreamResult outResult = new StreamResult(xmlResult);
transformer.transform(new StreamSource(
new ByteArrayInputStream(SXmlFileData.getBytes("UTF8"))),outResult);
SResult = outResult.getOutputStream().toString();
}
catch (TransformerConfigurationException OException)
{
//Exception has been thrown
OException.printStackTrace();
return OStringCarrier;
}
catch (TransformerException OException)
{
//Exception has been thrown
OException.printStackTrace();
return OStringCarrier;
}
catch (Exception OException)
{
//Exception has been thrown
OException.printStackTrace();
return OStringCarrier;
}
This is the output i'm getting ガãƒ?ーシュ in place of ガネーシュ

This is the output i'm getting ガãƒ?ーシュ in place of ガネーシュ
That tells you that somewhere in this process, data in UTF-8 is being read by a piece of software that thinks it is reading Latin-1. What it doesn't tell you is where in the process this is happening. So you need to divide-and-conquer - you need to find the last point at which the data is correct.
Start by establishing whether the problem is before the transformation or after it. That's very easy if you're using an XSLT 2.0 processor: you can use ` to see what string of characters the XSLT processor has been given. It's a bit trickier with a 1.0 processor, but you can use substring($in, $n, 1) to extract the nth character, and that should give you a clue.
My suspicion is that it's the input. Firstly, putting non-ASCII characters in a Java string literal is always a bit dangerous, because the round trip to a source repository can easily corrupt the code if you're not very careful about everything being configured correctly. Secondly, if the string is correct, it would be much safer to read it using a StringReader, rather than converting it to a byte stream. Try:
transformer.transform(new StreamSource(
new StringReader(SXmlFileData)),outResult);

Related

UTF-16LE encoding and xerces2 Java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.
To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

Apache FOP losing elements

I'm trying to print a PDF using Apache FOP + XSL as follow:
(in is JDOM)
private void transformPDF(Document in, StreamSource template, OutputStream out) throws ConvertirXMLException {
try {
Driver driver = new Driver();
driver.setOutputStream(out);
driver.setRenderer(1);
Transformer transformer = TransformerFactory.newInstance().newTransformer(template);
transformer.transform(new JDOMSource(in), new SAXResult(driver.getContentHandler()));
} catch (Exception e) {
throw new ConvertirXMLException(e.toString());
}
}
It's like FOP's documentation but it's not working fine. If I go to debug mode, i can see that in has the correct content, but when FOP transform the data to PDF, I lose some elements.
I have tested the in data and my template in a XSL editor where you can debug it and make some transformations and works fine (no lose data) so I'm a little bit lost... Any idea?

No return value after moving XSLT transformation to other project

I have the following piece of code to transform an xml document to a html page. It's nothing special, just that I use the saxon factory for transforming as I need a function from XSLT 2.0.
try {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
StreamResult output = new StreamResult(bos);
StreamSource input = new StreamSource(new FileReader(TransformationTest.class.getResource(sourceFileName).getFile()));
StreamSource stylesheet = new StreamSource(TransformationTest.class.getResourceAsStream(STYLESHEET));
Transformer tr = TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl", null).newTransformer(stylesheet);
tr.transform(input, output);
return bos.toString();
} catch (TransformerException e) {
throw new RuntimeException("XSL-Transform: Error while transforming the document.");
} catch (FileNotFoundException e) {
throw new RuntimeException("XSL-Transform: Error while reading the source file.");
}
The code is working fine in a seperate project with saxon added as a dependency. I tried it with different inputs and everytime I got the right output.
Now I tried adding it to a bigger project which has many many other libraries in the classpath and suddenly the code stops working. The output is just blank, no exceptions, no error messages, just a blank string that will be returned. I'm not an expert of what is happening in the transformer, so I need a bit of help here. What exactly could be reason for this behaviour and how can I fix it?

How apply CDATA to transformer parameter with jdom

For some reason I have tried to surround the parameters sExtraParameter, sExtraParameter2, sExtraParameter3 with <![CDATA[ ]]> string in order to get "pretty-printed" latin characters. But every time I check the xml output, it stills show bad parsed characters.
So, if is there another way to apply the CDATA to this parameters?
public static Element xslTransformJDOM(File xmlFile, String xslStyleSheet, String sExtraParameter, String sExtraParameterValue, String sExtraParameter2, String sExtraParameterValue2, String sExtraParameter3,String sExtraParameterValue3 ) throws JDOMException, TransformerConfigurationException, FileNotFoundException, IOException{
try{
Transformer transformer = TransformerFactory.newInstance().newTransformer(new StreamSource(xslStyleSheet));
transformer.setParameter(sExtraParameter, sExtraParameterValue);
transformer.setParameter(sExtraParameter2, sExtraParameterValue2);
transformer.setParameter(sExtraParameter3, sExtraParameterValue3);
JDOMResult out = new JDOMResult();
transformer.transform(new StreamSource(xmlFile), out);
Element result = out.getDocument().detachRootElement();
setSize(new XMLOutputter().outputString(result).length());
return result;
}
catch (TransformerException e){
throw new JDOMException("XSLT Transformation failed", e);
}
}
edit:
I am following up a project from my boss, for these reason I have not the entire code to show you here.
Maybe I have missed the question, but the API (http://docs.oracle.com/javaee/1.4/api/javax/xml/transform/Transformer.html#setParameter(java.lang.String, java.lang.Object)) for setParameter does not expect
value - The value object. This can be any valid Java object. It is up to the processor to provide the proper object coersion or to simply pass the object on for use in an extension.
This could then vary by implementation, assuming you are using JDOM.
There may be a CDATA xml element that would then be processed correctly. Maybe: http://www.jdom.org/docs/apidocs/org/jdom2/CDATA.html
You could still think about setting the serializer settings to some sort of whitespace preservation. http://www.jdom.org/docs/apidocs.1.1/org/jdom/output/Format.TextMode.html

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories