I'm having problems parsing an xml string using XmlBeans. The problem itself is in a J2EE application where the string itself is received from external systems, but i replicated the problem in a small test project.
The only solution i found is to let XmlBeans parse a File instead of a String, but that's not an option in the J2EE application. Plus i really want to know what exactly the problem is because i want to solve it.
Source of test class:
public class TestXmlSpy {
public static void main(String[] args) throws IOException {
InputStreamReader reader = new InputStreamReader(new FileInputStream("d:\\temp\\IE734.xml"),"UTF-8");
BufferedReader r = new BufferedReader(reader);
String xml = "";
String str;
while ((str = r.readLine()) != null) {
xml = xml + str;
}
xml = xml.trim();
System.out.println("Ready reading XML");
XmlOptions options = new XmlOptions();
options.setCharacterEncoding("UTF-8");
try {
XmlObject xmlObject = XmlObject.Factory.parse(new File("D:\\temp\\IE734.xml"), options);
System.out.println("Ready parsing File");
XmlObject.Factory.parse(xml, options);
System.out.println("Ready parsing String");
} catch (XmlException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The XML file validates perfectly against the XSD's im using. Also, parsing it as a File object works fine and gives me a parsed XmlObject to work with. However, parsing the xml-String gives the stacktrace below. I've checked the string itself in the debugger and don't really see anything wrong with it at first sight, especially not at row 1 column 1 where i think the Sax parser is having a problem with if i'm interpreting the error correctly.
Stacktrace:
Ready reading XML
Ready parsing File
org.apache.xmlbeans.XmlException: error: Unexpected element: CDATA
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3511)
at org.apache.xmlbeans.impl.store.Locale.parse(Locale.java:713)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:697)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:684)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:208)
at org.apache.xmlbeans.XmlObject$Factory.parse(XmlObject.java:658)
at xmlspy.TestXmlSpy.main(TestXmlSpy.java:37)
Caused by: org.xml.sax.SAXParseException; systemId: file:; lineNumber: 1; columnNumber: 1; Unexpected element: CDATA
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723)
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
... 6 more
This is an encoding problem, I used the below code that worked for me:
File xmlFile = new File("./data/file.xml");
FileDocument fileDoc = FileDocument.Factory.parse(xmlFile);
The exception is caused by the length of the XML file. If you add or remove one character from the file, the parser will succeed.
The problem occurs within the 3rd party PiccoloLexer library that XMLBeans relies on. It has been fixed in revision 959082 but has not been applied to xbean 2.5 jar.
What does the org.apache.xmlbeans.XmlException with a message of “Unexpected element: CDATA” mean?
XMLBeans - Problem with XML files if length is exactly 8193bytes
Issue reported on XMLBean Jira
Related
I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.
To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
Im trying to validate UBL invoice XML with the UBL invoice XSD file which imports some other XSD files. I read the docs and various SO questions that had the same problems. I however still have not found a fix for mine.
I create a basic validate method like so:
private Schema validate(InputStream[] schemas, LSResourceResolver lsResourceResolver) {
Schema schema = null;
// Convert InputStream[] to StreamSource[]
StreamSource[] schemaStreamSources = new StreamSource[schemas.length];
for(int index1=0 ; index1<schemas.length ; index1++)
schemaStreamSources[index1] = new StreamSource(schemas[index1]);
// Create a compiled Schema object.
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
schemaFactory.setResourceResolver(lsResourceResolver);
try {
schema = schemaFactory.newSchema(schemaStreamSources);
} catch(SAXException ex) {
Logger.getLogger(getClass().getName()).log(Level.SEVERE, "getCompiledSchema", ex);
}
return schema;
}
Then i create streams for all of the XSD files that get referenced in the xsd files and run the validate method like so:
InputStream invoiceXSD = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/main/UBL-Invoice-2.0.xsd");
InputStream commonExt = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/UBL-CommonExtensionComponents-2.0.xsd");
InputStream unqualified = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/UnqualifiedDataTypeSchemaModule-2.0.xsd");
InputStream commonBasic = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/UBL-CommonBasicComponents-2.0.xsd");
InputStream qualified = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/UBL-QualifiedDatatypes-2.0.xsd");
InputStream extContent = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/UBL-ExtensionContentDatatype-2.0.xsd");
InputStream code1 = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/CodeList_CurrencyCode_ISO_7_04.xsd");
InputStream code2 = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/CodeList_LanguageCode_ISO_7_04.xsd");
InputStream code3 = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/CodeList_MIMEMediaTypeCode_IANA_7_04.xsd");
InputStream code4 = this.getClass().getClassLoader().getResourceAsStream("schemas/ubl20/common/CodeList_UnitCode_UNECE_7_04.xsd");
InputStream[] streams = {extContent, code1, code2, code3,code4, unqualified,qualified, commonBasic, commonExt, invoiceXSD};
validate(streams, null);
They are in the correct folder and when i remove the main invoiceXSD it all compiles fine, which indicates to me that the other file imports work as expected. Only the link between UBL-Invoice-2.0.xsd and UBL-CommonExtensionComponents-2.0.xsd (where the reference of the error actually exists, but cant be found) somehow does not work.
Im lost at this point, if someone has any experience on this topic and could point me in the right direction that would be greatly appreciated. Ive read things about implementing a custom LSResourceResolver which i might go ahead and try at this point. Im just clueless as to why my current implementation fails. Full stacktrace:
org.xml.sax.SAXParseException; lineNumber: 44; columnNumber: 72; src-resolve: Cannot resolve the name 'ext:UBLExtensions' to a(n) 'element declaration' component.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:396)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.reportSchemaErr(XSDHandler.java:4156)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.reportSchemaError(XSDHandler.java:4139)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getGlobalDecl(XSDHandler.java:1745)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDElementTraverser.traverseLocal(XSDElementTraverser.java:170)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.traverseLocalElements(XSDHandler.java:3612)
at com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:636)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:613)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:572)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:538)
at com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:255)
at Main.validate(Main.java:49)
at Main.<init>(Main.java:35)
at Main.main(Main.java:19)
As SVG is a regular XML file and ImageTranscoder.transcode() API accepts org.w3c.dom.Document, respective TranscoderInput constructor accepts org.w3c.dom.Document; one would expect that loading and parsing file with a Java stock XML parser would work:
TranscoderInput input = new TranscoderInput(loadSvgDocument(new FileInputStream(svgFile)));
BufferedImageTranscoder t = new BufferedImageTranscoder();
t.transcode(input, null);
Where loadSvgDocument() method is defined as:
Document loadSvgDocument(String svgFileName, InputStream is) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
// using stock Java 8 XML parser
Document document;
try {
DocumentBuilder db = dbf.newDocumentBuilder();
document = db.parse(is);
} catch (...) {...}
return document;
}
It does not work. I am getting strange casting exceptions.
Exception in thread "main" java.lang.ClassCastException: org.apache.batik.dom.GenericElement cannot be cast to org.w3c.dom.svg.SVGSVGElement
at org.apache.batik.anim.dom.SVGOMDocument.getRootElement(SVGOMDocument.java:235)
at org.apache.batik.transcoder.SVGAbstractTranscoder.transcode(SVGAbstractTranscoder.java:193)
at org.apache.batik.transcoder.image.ImageTranscoder.transcode(ImageTranscoder.java:92)
at org.apache.batik.transcoder.XMLAbstractTranscoder.transcode(XMLAbstractTranscoder.java:142)
at org.apache.batik.transcoder.SVGAbstractTranscoder.transcode(SVGAbstractTranscoder.java:156)
Note: class BufferedImageTranscoder is my class, created as per Batik blueprints, extending ImageTranscoder which in turn extends SVGAbstractTranscoder mentioned in the stack trace above.
Unfortunately I cannot use Batik own parser, SAXSVGDocumentFactory:
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new SAXSVGDocumentFactory(parser);
svgDocument = (SVGDocument) f.createDocument(..);
I am trying to render Papirus SVG icons but they all have <svg ... version="1"> and SAXSVGDocumentFactory does not like that and fails on the createDocument(..) with Unsupport SVG version '1'. They probably meant unsupported.
Exception in thread "main" java.lang.RuntimeException: Unsupport SVG version '1'
at org.apache.batik.anim.dom.SAXSVGDocumentFactory.getDOMImplementation(SAXSVGDocumentFactory.java:327)
at org.apache.batik.dom.util.SAXDocumentFactory.startElement(SAXDocumentFactory.java:640)
. . .
at org.apache.batik.anim.dom.SAXSVGDocumentFactory.createDocument(SAXSVGDocumentFactory.java:225)
Changing version="1" to version="1.0" in the file itself fixes the problem and the icon is rendered nicely for me. But there are hundreds (thousands) of icons and fixing them all is tedious and I would effectively create a port of their project. This is not a way forward for me. Much easier is to make the fix in run time, using DOM API:
Element svgDocumentNode = svgDocument.getDocumentElement();
String svgVersion = svgDocumentNode.getAttribute("version");
if (svgVersion.equals("1")) {
svgDocumentNode.setAttribute("version", "1.0");
}
But that can be done only with stock Java XML parser, Batik XML parser blows too early, before this code can be reached, before Document is generated. But when I use stock XML parser, make the version fix, then Batik Transcoder (rasterizer) does not like it. So I hit a wall here.
Is there a convertor from a stock XML parser produced org.w3c.dom.Document and Batik compatible org.w3c.dom.svg.SVGDocument?
OK, I found a solution bypassing the problem. Luckily class SAXSVGDocumentFactory can be easily subclasses and critical method
getDOMImplementation() overriden.
protected Document loadSvgDocument(InputStream is) {
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new LenientSaxSvgDocumentFactory(parser);
SVGDocument svgDocument;
try {
svgDocument = (SVGDocument) f.createDocument("aaa", is);
} catch (...) {
...
}
return svgDocument;
}
static class LenientSaxSvgDocumentFactory extends SAXSVGDocumentFactory {
public LenientSaxSvgDocumentFactory(String parser) {
super(parser);
}
#Override
public DOMImplementation getDOMImplementation(String ver) {
// code is mostly rip-off from original Apache Batik 1.9 code
// only the condition was extended to accept also "1" string
if (ver == null || ver.length() == 0
|| ver.equals("1.0") || ver.equals("1.1") || ver.equals("1")) {
return SVGDOMImplementation.getDOMImplementation();
} else if (ver.equals("1.2")) {
return SVG12DOMImplementation.getDOMImplementation();
}
throw new RuntimeException("Unsupported SVG version '" + ver + "'");
}
}
This time I got lucky, the main question remains however: is there a convertor from a stock XML parser produced org.w3c.dom.Document and Batik compatible org.w3c.dom.svg.SVGDocument?
I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.
If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.
yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...
You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}
I am trying to join two PostScript files to one with ghost4j 0.5.0 as follows:
final PSDocument[] psDocuments = new PSDocument[2];
psDocuments[0] = new PSDocument();
psDocuments[0].load("1.ps");
psDocuments[1] = new PSDocument();
psDocuments[1].load("2.ps");
psDocuments[0].append(psDocuments[1]);
psDocuments[0].write("3.ps");
During this simplified process I got the following exception message for the above "append" line:
org.ghost4j.document.DocumentException: java.lang.ClassCastException:
org.apache.xmlgraphics.ps.dsc.events.UnparsedDSCComment cannot be cast to
org.apache.xmlgraphics.ps.dsc.events.DSCCommentPage
Until now I have not made to find out whats the problem here - maybe some kind of a problem within one of the PostScript files?
So help would be appreciated.
EDIT:
I tested with ghostScript commandline tool:
gswin32.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pswrite -sOutputFile="test.ps" --filename "1.ps" "2.ps"
which results in a document where 1.ps and 2.ps are merged into one(!) page (i.e. overlay).
When removing the --filename the resulting document will be a PostScript with two pages as expected.
The exception occurs because one of the 2 documents does not follow the Adobe Document Structuring Convention (DSC), which is mandatory if you want to use the Document append method.
Use the SafeAppenderModifier instead. There is an example here: http://www.ghost4j.org/highlevelapisamples.html (Append a PDF document to a PostScript document)
I think something is wrong in the document or in the XMLGraphics library as it seems it cannot parse a part of it.
Here you can see the code in ghost4j that I think it is failing (link):
DSCParser parser = new DSCParser(bais);
Object tP = parser.nextDSCComment(DSCConstants.PAGES);
while (tP instanceof DSCAtend)
tP = parser.nextDSCComment(DSCConstants.PAGES);
DSCCommentPages pages = (DSCCommentPages) tP;
And here you can see why XMLGraphics may bre sesponsable (link):
private DSCComment parseDSCComment(String name, String value) {
DSCComment parsed = DSCCommentFactory.createDSCCommentFor(name);
if (parsed != null) {
try {
parsed.parseValue(value);
return parsed;
} catch (Exception e) {
//ignore and fall back to unparsed DSC comment
}
}
UnparsedDSCComment unparsed = new UnparsedDSCComment(name);
unparsed.parseValue(value);
return unparsed;
}
It seems parsed.parseValue(value) has thrown an exception, it was hidden in the catch and it returned an unparsed version ghost4j didn't expect.