I have a project which consists of reading 1000 XML files, each defining a rule of processing the different types of data. The consequence is that the application takes a few seconds to load the XML files when it starts. It's an Android mobile app so the CPU isn't very powerful.
Is there a way to create static objects at compilation time by reading these XML files? If I can pre-process the XML by defining static objects which already have the XML read into it, the app should be able to start loaded, a lot faster. The draw-back that the XML file can't change in the runtime is acceptable.
This is a generic question - I am not bound to use any specific method or library. Anything that allows me to pre-parse the XML will do. But as comments asked for my current runtime-parsing implementation, I provide it in the following paragraphs which uses the DOM parser shipped with Java.
The current implementation:
The XML processing class simply creates an object by reads each XML file. It is used like this:
lst.add(XMLData(new FileInputStream(new File("assets/001.xml"))));
lst.add(XMLData(new FileInputStream(new File("assets/002.xml"))));
....
Where XMLData is the object that reads the XML file and keeps the relevant information. lst is a List of such objects.
The XMLData class look like this:
class XMLDAta {
public XMLData(InputStream xml) throws IOException, SAXException {
DocumentBuilder dBuilder;
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dBuilder = dbFactory.newDocumentBuilder();
} catch (ParserConfigurationException e) {
// TODO: if schema has problems (e.g. defined twice).
// all XML well-formedness were checked before shipping them
e.printStackTrace(); // shouldn't happen
return;
}
Document xml = dBuilder.parse(xmlAsset);
Related
I have a few SonarQube vulnerabilities and one of them caught my eye.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
dom = db.parse(sIn);
} catch (ParserConfigurationException pce) {
log.error("ERROR-pce***************************"+pce.getMessage(),pce);
throw pce;
} catch (SAXException se) {
log.error("ERROR-se**********************"+se.getMessage(),se);
throw se;
} catch (IOException ioe) {
log.error("ERROR-ioe*********************"+ioe.getMessage(),ioe);
throw ioe;
}
As you can see in my code, I have a new DocumentBuilder(); and then I parse this:
InputStream sIn = new ByteArrayInputStream(contenidoXml.getBytes(StandardCharsets.UTF_8));
The Sonar "solution" is to do one of the following things:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// to be compliant, completely disable DOCTYPE declaration:
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
// or completely disable external entities declarations:
factory.setFeature("http://xml.org/sax/features/external-general-entities", false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
// or prohibit the use of all protocols by external entities:
factory.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
factory.setAttribute(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
This is legacy code and I am quite lost here. Could someone explain me the differences between the three solutions and which one is more possible to have zero impact in the code (we have to update a different classes but last time this was deployed SQ didn't even exist in my company).
Refer this for general XXE information
Disable DOCTYPE declaration - Disables processing of DOCTYPE, possible impact is the document may not be validated only wellformedness check would be done
Disable External Entity - External entities if declared in the document will not be de-referenced, the values of the entities (if used in the document) would be null or (depending on the underlying parser configured) could through parse exception
Prohibit use of protocol - The parser will not use any protocol to access external DTD or Schema. All external DTD, schema should be available locally through SYSTEM identifier or registered with the parser
The method you choose would be largely dependent on the document you are parsing. If it uses schema, disabling DOCTYPE could be a good solution. If the document is guaranteed not to use external entities, disabling both DOCTYPE and external entities could be a better approach
Application Background:
Basically, I am building an application in which I am parsing the XML document using SAX PARSER for every incoming tag I would like to know its datatype and other information so I am using the XSD associated with that XML file to get the datatype and other information related to those tags. Hence, I am parsing the XSD file and storing all the information in Hashmap so that whenever the tag comes I can pass that XML TAG as key to my Hashmap and obtain the value (information associated with it which is obtained during XSD parsing) associated with it.
Problem I am facing:
As of now, I am able to parse my XSD using the DocumentBuilderFactory. But during the collection of elements, I am able to get only one type of element and store it in my NODELIST such as elements with tag name "xs:element". My XSD also has some other element type such as "xs:complexType", xs:any etc. I would like to read all of them and store them into a single NODELIST which I can later loop and push to HASHMAP. However I am unable to add any additional elements to my NODELIST after adding one type to it:
Below code will add tags with the xs:element
NodeList list = doc.getElementsByTagName("xs:element");
How can I add the tags with xs:complexType and xs:any to the same NODELIST?
Is this a good way to find the datatype and other attributes of the XSD or any other better approach available. As I may need to hit the HASHMAP many times for every TAG in XML will there be a performance issue?
Is DocumentBuilderFactory is a good approach to parse XML or are there any better libaraies for XSD parsing? I looked into Xerces2 but could not find any good example and I got struck and posted the question here.
Following is my code for parsing the XSD using DocumentBuilderFactory:
public class DOMParser {
private static Map<String, Element> xmlTags = new HashMap<String, Element>();
public static void main(String[] args) throws URISyntaxException, SAXException, IOException, ParserConfigurationException {
String xsdPath1 = Paths.get(Xerces2Parser.class.getClassLoader().getResource("test.xsd").toURI()).toFile().getAbsolutePath();
String filePath1 = Path.of(xsdPath1).toString();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File(filePath1));
NodeList list = doc.getElementsByTagName("xs:element");
System.out.println(list.getLength());
// How to add the xs:complexType to same list as above
// list.add(doc.getElementsByTagName("xs:complexType"));
// list = doc.getElementsByTagName("xs:complexType");
// Loop and add data to Map for future lookups
for (int i = 0; i < list.getLength(); i++) {
Element element = (Element) list.item(i);
if (element.hasAttributes()) {
xmlTags.put(element.getAttribute("name"), element);
}
}
}
}
I don't know what you are trying to achieve (you have described the code you are writing, not the problem it is designed to solve) but what you are doing seems misguided. Trying to get useful information out of an XSD schema by parsing it at the XML level is really hard work, and it's clear from the questions you are asking that you haven't appreciated the complexities of what you are attempting.
It's hard to advise you on the low-level detail of maintaining hash maps and node lists when we don't understand what you are trying to achieve. What information are you trying to extract from the schema, and why?
There are a number of ways of getting information out of a schema at a higher level. Xerces has a Java API for accessing a compiled schema. Saxon has an XML representation of compiled schemas called SCM (the difference from raw XSD is that all the work of expanding xs:include and xs:import, expanding attribute groups, model groups, and substitution groups etc has been done for you). Saxon also has an XPath API (a set of extension functions) for accessing compiled schema information.
We are implementing a portal handling requests for modifying and generating Microsoft Office 2007 documents (docx).The back-end is implemented in Java using Apache POI as the API of manipulating the contents of the docx files. The back-end is accessed through RestAPI calls coming from a front-end written in JavaScript.
The back-end acts like a Document Server that handles about 15 different docx documents which act as templates and contain tokens that need to be replaced with actual values. The requests coming from the front-end are actually a token value map that the back-end needs to replace in the templates and generate a new document, for each request. The workflow is as follows:
receive request from front-end: token-value map
read template document as XWPFDocument object
parse and replace text in all XWPFParagraph/XWPFTable elements of the XWPFDocument
write the modified XWPFDocument to a different file path
I am trying to implement a caching mechanisms at the moment, it is a real performance issue going to the disk and reading the files for each request. I would need to treat each template document as a Prototype and return a clone for each request that the back-end receives, something similar to this:
XWPFDocument theDocument = documentCache.clone(documentConfiguration.getInputType());
The clone method is currently implemented as follows:
public XWPFDocument clone(DocumentDictionary.DocumentType type){
if(PACKAGE_MAP.isEmpty())
getPackages();
XWPFDocument document = null;
try {
document = new XWPFDocument(PACKAGE_MAP.get(type));
}catch(IOException exception){
logger.error("Unable to clone document for input type {}", type);
}
return document;
}
This implementation does not yield the desired results, the first request processing works as expected, but the second request fails when writting the document with the error:
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
The exception above does not replicate in the case of reading the document fresh at each request.
Looking at the Apache POI API, the clone() methods for XWPFDocument and ZipPackage, used in the reading/writting process are protected, so I cannot use the basic functionality offered by the programming language and the issues seems to come from the fact that the ZipPackage is shared and used in both the reading/writting of the document.
Has anyone been able to implement such a mechanism using Java and Apache POI?
You can pre-load the byte[] for each template type and then as base for clone with ByteArrayInputStream
You can also instantiate the template only when requested, so the getPackages() became getPackage(DocumentDictionary.DocumentType type) and check for single type in the foreach
The XWPFDocument is written on runtime, so when you make modification to its paragraphs, tables or runs in general, you are editing the template document, so you have to reload it in other ways.
private void getPackages() {
for (DocumentDictionary.DocumentType type : DocumentDictionary.DocumentType.values()) {
PACKAGE_MAP.put(type, FileUtils.readFileToByteArray(new File(getTemplateFromType(type))));
}
}
private String getTemplateFromType(DocumentDictionary.DocumentType type) {
switch(type) {
case TYPE_1:
return "/path/to/template/type_1.docx";
...
}
}
public XWPFDocument clone(DocumentDictionary.DocumentType type) {
if(PACKAGE_MAP.isEmpty())
getPackages();
XWPFDocument document = null;
try {
document = new XWPFDocument(new ByteArrayInputStream(PACKAGE_MAP.get(type)));
} catch(IOException exception) {
logger.error("Unable to clone document for input type {}", type);
}
return document;
ByteArrayInputStream();
}
I am learning about XML in Java and every time I want to use a Document object I have to write:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
I know how it works further, but what actually happens in those 3 lines? Why do I need a DocumentBuilderFactory and then a DocumentBuilder to build a Document?
Update: Could you give me an example where I shouldn't write the first 2 lines exactly the same? I don't see the point of instantiating 2 more objects for a new Document. What is their effective role?
1) Factory (creates something) can create a DocumentBuilder
Obtain a new instance of a DocumentBuilderFactory. This static method
creates a new factory instance.
2)
Creates a new instance of a DocumentBuilder using the currently
configured parameters.
3)
Parse the content of the given file as an XML document and return a
new DOM Document object. An IllegalArgumentException is thrown if the
File is null null.
Source
This is how the library is build. Without the factory you will not be able to create a new DocumentBuilder object and thus will not be able to parse a file
The approach you use for the XML parsing is known as the Document Object Model (DOM) approach (note: it is not the only one available) and a part of Java API for XML Processing (JAXP). Quoting:
Designed to be flexible, JAXP allows you to use any XML-compliant
parser from within your application
To allow the programmer to use any XML parser, the system needs to avoid using a specific implementation. To be able to do that it decides the implementation during runtime using a design pattern known as the Factory pattern which (quoting) "...deals with the problem of creating objects (products) without specifying the exact class of object that will be created."
So when you use DocumentBuilder dBuilder = dbFactory.newDocumentBuilder(); the returned instance is not actually a DocumentBuilder (it couldn't be - this is an abstract class) but an instance of another class that extends DocumentBuilder. You could print the actual class in runtime to verify that.
// returns com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl in my system
System.out.println( dbFactory.getClass().getName() );
// returns com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl in my system
System.out.println( dBuilder.getClass().getName() );
Examples where you wouldn't need to use the first two lines, would be the cases where you would use a specific parsing implementation directly (and thus introducing a third party dependency in your project).
I hope this helps
Right from the javadocs:
DocumentBuilderFactory.newInstance()
Obtain a new instance of a DocumentBuilderFactory. This static method
creates a new factory instance. This method uses the following ordered
lookup procedure to determine the DocumentBuilderFactory
implementation class to load:
Use the javax.xml.parsers.DocumentBuilderFactory system property.
Use the properties file "lib/jaxp.properties" in the JRE directory. This configuration file is in standard java.util.Properties format
and contains the fully qualified name of the implementation class
with the key being the system property defined above. The
jaxp.properties file is read only once by the JAXP implementation and
it's values are then cached for future use. If the file does not
exist when the first attempt is made to read from it, no further
attempts are made to check for its existence. It is not possible to
change the value of any property in jaxp.properties after it has been
read for the first time.
Use the Services API (as detailed in the JAR specification), if available, to determine the classname. The Services API will look for
a classname in the file Platform default DocumentBuilderFactory
instance.
META-INF/services/javax.xml.parsers.DocumentBuilderFactory in jars
available to the runtime.
Platform default DocumentBuilderFactory instance.
Once an application has obtained a reference to a
DocumentBuilderFactory it can use the factory to configure and obtain
parser instances.
DocumentBuilderFactory.newDocumentBuilder()
Creates a new instance of a DocumentBuilder using the currently
configured parameters.
Returns: A new instance of a DocumentBuilder.
Throws: ParserConfigurationException - if a DocumentBuilder cannot be
created which satisfies the configuration requested.
DocumentBuilder.parse()
Parse the content of the given file as an XML document and return a
new DOM Document object. An IllegalArgumentException is thrown if the
File is null null.
Parameters: f - The file containing the XML to parse.
Returns: A new DOM Document object.
Throws: IOException - If any IO errors occur. SAXException - If any
parse errors occur. IllegalArgumentException - When f is null
I created a DOM document static object, such as below, it uses javax.xml.parsers.* and org.w3c.dom.* API:
DocumentBuilderFactory docBldrFactry = DocumentBuilderFactory.newInstance();
docBldrObj = docBldrFactry.newDocumentBuilder();
File file = new File(fileDirectory);
// Parse the XML file and return a DOM document object
document = docBldrObj.parse(file);
//FYI, document is declared as private static org.w3c.dom.Document document elsewhere.
Later after created above, If this static DOM document object shared by threads, but all threads are just read (traverse) this document, is it thread safe?
I assume it is since read should not modify this shared state, but not sure whether internally there is some magic about it which I don't know.
Thanks
The problem solved by writing own simple Document structure. E.g, clone the DOM document into that, which is thread-safe on read operations.
FYI, for my own purpose, when cloning the document, I don't clone everything but the information based on my need (COMMENT_NODE, TEXT_NODE, ELEMENT_NODE, attributes).