Translating SAX exceptions - java

I have some Java code that validates XML against an XSD. I am using a modified version of the Error Handler found here: http://www.ibm.com/developerworks/xml/library/x-javaxmlvalidapi.html to catch and log ALL exceptions while validating.
The errors are very terse, they look something like this:
http://www.w3.org/TR/xml-schema-1#cvc-complex-type.2.4.a?s:cID&{"http://www.myschema.com/schema":txn}
Other messages such as
http://www.w3.org/TR/xml-schema-1#cvc-complex-type.2.4.a?s:attributes&{"http://www.myschema.com/schema":sequence}
are even more cryptic.
Is there an easy way to get a clear and intelligible message out of SAX explaining what went wrong here? I think in the first error it was expecting txn and instead found the element cID. BUT... I don't know all the possible errors that might be generated by SAX so I'd rather not try to manually create a translation table.
The eventual users of this output are mostly non-technical so I need to be able generate simple and clear messages such as "element txn was out of sequence".
If it helps, here's the code (more or less) that's used for validation:
Source schema1 = new StreamSource(new File("resources/schema1.xsd"));
Source schema2 = new StreamSource(new File("resources/schema2.xsd"));
Source[] sources = {schema1,schema2};
validator = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI).newSchema(sources).newValidator();
ErrorHandler lenient = new ForgivingErrorHandler();
validator.setErrorHandler(lenient);
Elsewhere...
StreamSource xmlSource = new StreamSource(new StringReader(XMLData) );
try
{
validator.validate(xmlSource);
}
catch (SAXException e)
{
logger.error("XML Validation Error: ",e);
}

Well, it seems I had to add xsi:schemaLocation="http://www.mycompany.com/schema resources/schema1.xsd " to the XML document, because s:http://www.mycompany.com/schema is the default namespace: xmlns="s:http://www.mycompany.com/schema". Of course, I don't have access to modify the tool that generates the XML, so the following ugly hack was necessary:
xmlDataStr = xmlDataStr.replace("<rootNode ", "<rootNode xsi:schemaLocation=\"http://www.mycompany.com/schema resources/schema1.xsd \" ");
...of course now I'm getting double validation errors! A clear and intelligible one such as:
cvc-complex-type.2.4.a: Invalid content was found starting with element 's:cID'. One of '{"http://www.mycompany.ca/schema":tdr}' is expected.
Immediately followed by:
http://www.w3.org/TR/xml-schema-1#cvc-complex-type.2.4.a?s:cID&{"http://www.mycompany.com/schema":tdr}
The double-error is annoying but at least the first one is usable...

Related

Pretty-print and incomplete XML

We have a logging system where we log payload on-demand to troubleshoot and in non-prod. However, due to column size constraint, we truncate the XML if it is more than 5000 characters.
The XML is not pretty-print formatted and is a continuous string.
When the XML is truncated, it is hard to format it to make it easy to check the data in the XML. Usually, I use Java DocumentBuilderFactory to format a complete XML, but that fails if we use against a incomplete XML.
I would like to have a solution that can format an incomplete XML instead of throwing an error.
Following the approach Michael Kay had outlined in his answer I linked to in a comment to use an identity Transformer with indentation over a StreamSource to catcn any parse exception the code looks like
String xml = "<root><section><p>Paragraph 1.</p><p>Paragraph 2."; //"<root><section><p>Paragraph 1.</p><p>Paragraph 2.</p></section></root>";
Transformer identityTransformer = TransformerFactory.newInstance().newTransformer();
identityTransformer.setOutputProperty("indent", "yes");
StringWriter resultWriter = new StringWriter();
StreamResult resultStream = new StreamResult(resultWriter);
try {
identityTransformer.transform(new StreamSource(new StringReader(xml)), resultStream);
}
catch (TransformerException e) {
System.out.println(e.getMessageAndLocation());
System.out.println(resultWriter.toString());
}
and then at least, for that example, gets to the last p element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<section>
<p>Paragraph 1.</p>
<p
So some information at the end is lost but before that incomplete element the code at least breaks up the long one liner of the input into several lines.
Note: I used Saxon 10 HE as the default Transformer, if you use the JRE's one or Xalan you will need to set identityTransformer.setOutputProperty("{http://xml.apache.org/xalan}indent-amount", "2"); as otherwise you get line breaks but no indentation.

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

XSL - Exclude access to ACCESS_EXTERNAL_STYLESHEET

I'm looking for informations and a solution regarding an issue that I have with my implementation of SAXParser to perform XSL Transformation.
In order to improve the quality of our project, the sonarqube sensitivity has been rised. Then a new error appearred for my implementation.
Sonarqube is asking me to set properties to empty value in order to exclude the possibilities of an attack based on those values.
Problem, if I can set the property for ACCESS_EXTERNAL_DTD and ACCESS_EXTERNAL_SCHEMA to empty correctly, the property ACCESS_EXTERNAL_STYLESHEET seems to not be a valid property for SAXParser. And without it set correctly, sonarqube doesn't remove the blocker error as it seems mandatory for XSL Transformation.
SAXParser saxParser = saxParserFactory.newSAXParser();
saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, ""); // Work
saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, ""); // Work
saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, ""); // Doesn't work and throw org.xml.sax.SAXNotRecognizedException
What should I do ?
I'm under Saxon-HE:9.8.0-8
Thank you in advance

SAX XML parser - localized error messages

The web-application I am currently working on, validates user-supplied xml files against xsd stored on a server. The problem is that if xml fails validation, error messages should be in Russian. I have my parser working - it gives error messages but only in English
String parserClass = "org.apache.xerces.parsers.SAXParser";
String validationFeature = "http://xml.org/sax/features/validation";
String schemaFeature = "http://apache.org/xml/features/validation/schema";
XMLReader reader = null;
reader = XMLReaderFactory.createXMLReader(parserClass);
reader.setFeature(validationFeature,true);
reader.setFeature(schemaFeature,true);
BatchContentHandler contentHandler = new BatchContentHandler(reader);
reader.setContentHandler(contentHandler);
BatchErrorHandler errorHandler = new BatchErrorHandler(reader);
reader.setErrorHandler(errorHandler);
reader.setFeature("http://apache.org/xml/features/continue-after-fatal-error", true);
reader.parse(new InputSource(new ByteArrayInputStream(streamedXML)));
It works fine - error messages are in English.
Reading this post Locale specific messages in Xerces 2.11.0 (Java) and also this post https://www.java.net//node/699069 I added these lines
Locale l = new Locale("ru", "RU");
reader.setProperty("http://apache.org/xml/properties/locale", l);
I also added XMLSchemaMessages_RU.properties file to the jar. Now I get NULL pointer exception. Any ideas or hints? Thanks in advance!
I found here this about http://apache.org/xml/properties/locale:
Desc:The locale to use for reporting errors and warnings. When the value of this property is null the platform default returned from
java.util.Locale.getDefault() will be used.
Type: java.util.Locale
Access: read-write
Since: Xerces-J 2.10.0
Note: If no messages are available for the specified locale the platform default will be used. If the platform default is not English
and no messages are available for this locale then messages will be
reported in English.
Also I found here an example where in order to create a Locale object for the Russian language this code is provided:
Locale dLocale = new Locale.Builder().setLanguage("ru").setScript("Cyrl").build();
I don't know if this could be useful. Just have a try and give me feedback about it!

XML Validation: Am I Doing It Right?

I was just wondering if someone could give my XML validation code a once over to see if I'm doing it right. Here's the portion of code that is giving me the trouble...
SAXParserFactory factory = SAXParserFactory.newInstance();
SchemaFactory schemaFactory = SchemaFactory
.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
// *** CODE FAILS ON THE BELOW LINE **/
factory.setSchema(schemaFactory
.newSchema(new Source[] { new StreamSource(schemaStream) }));
SAXParser parser = factory.newSAXParser();
SAXReader reader = new SAXReader(parser.getXMLReader());
reader.setValidation(false);
reader.setErrorHandler(new ResultProducingErrorHandler());
reader.read(content);
Whenever I run the above code, I get an error along the lines of:
src-resolve: Cannot resolve the name 'ns:myStructure' to a(n) 'type definition' component.
The elements mentioned in the error messages are all ones that are imported into the schema via calls to <xs:import />. The schema seems to validate OK via the W3C XML Schema Validator.
Do I have to include each of these schema's individually or is Java smart enough to go off and fetch these extra schema's too? I tried adding them in the array passed to the newSchema call but that didn't make any difference.
I don't think I can give out the link to the schema, so I'm really just looking for a yes or no regarding if my code looks at least acceptable.
Ensure that the xs:import statements point to paths that are reachable from the current directory of your application. The current directory may not be what you think it is.

Categories