I'm going to create a class, which should unmarshall very huge xml files.
I've implemented general unmarshalling:
public XMLProcessor(XMLFile file) throws JAXBException, IOException, SAXException {
JAXBContext jc = JAXBContext.newInstance(Customers.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
File xml = new File(file.getFile().getOriginalFilename());
file.getFile().transferTo(xml);
this.customers = (Customers) unmarshaller.unmarshal(xml);
}
It works fine, but it took more than a minute to process 1 million customers XML.
Can i improve perfomance by creating multiple threads and unmarshall a few parts of XML file concurrently?
How should i split my XML file into parts?
Could you show me some sample code for my case?
Although I cannot provide a complete solution yet, I'd like to share with you the approach that I am currently implementing on a similar problem. My XML file structure is like:
<products>
<product id ="p1">
<variant id="v1"></variant>
<variant id="v2"></variant>
</product>
<product id ="p2">
<variant id="v3"></variant>
<variant id="v4"></variant>
</product>
</products>
products and variants may be quite complex, with a lot of attributes, lists etc.
My current approach is to use SAX to extract the XML-stream of a single product entity and then hand this over to a new Unmarshaller Thread (with standard multi-threading operations, limiting to a max thread count, etc.).
However I am still not 100% confident if SAX generates too much overhead (which could eat up the multi-threading benefit). If this is the case, I'll try to read the XML-stream directly, reacting on the open/close-tags for "". A this won't be xml-conform, this is my measure of last resort
Related
I have gone through many Stackoverflow pages and web to decide on parser which fits in for my requirement.
I need to read nested and big xml files in java , so DOM parser would not be good fit . My xml looks like below(snippet)-
<products>
<product>
<productCode></productCode>
<Code>3002191</Code>
<anotherCode></anotherCode>
<entityName>entityName value</entityName>
<entityName2>entityName value</entityName2>
<Type>value</Type>
<List>1</List>
<SecondCode>124</SecondCode>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>045</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>1219</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
</product>
<product>
..
</product>
</products>
Requirement: I need to store products information into list of hashmap for further processing with other xmls. Firstly, I thought to use Stax api to do this.But element docInfo has countries element so there can be multiple document for many countries and I cant parse backward to save another document(which has same document info but with country) . Please let me know if I am clear enough
Please let me know which parser will be good to handle this situation , i do not have any schemas for this xml.
Thanks a lot.
To parse a big amount of XML, the best is to use SAX :
https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
You implement the ContentHandler interface and you can put the logic you need when parsing the docInfo and subsequent countries.
I'm writing a game in java using Netbeans.
I want to be able to save the game to an XML file in the beginning of each round, and to be able to load any saved game in the beginning of the game.
The XML file should eventually include the current state of the game when saved (players, names, sum of money, etc.).
I read over the internet and understood that I need to create a content tree of all the classes of the game using DOM and then Marshall the tree into an XML file, using JAXB.
I have no idea where to start from, how to create the context tree, and so on.
Any help or good tutorial would be helpful (couldn't find anything good).
This is a broad question, however if you do not know how to read/write XML from java beans using jaxb take a look at this tutorial http://www.vogella.com/tutorials/JAXB/article.html. I would start there before trying to make a game.
JAXB enables you to easily convert instances of Java objects to/from XML. With JAXB you will never need to interact directly with DOM.
JAXBContext jc = JAXBContext.newInstance(Game.class);
File file = new File("gameData.xml");
Unmarshaller unmarshaller = jc.createUnmarshaller();
Game game = (Game) unmarshaller.unmarshal(file);
Marshaller marshaller = jc.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.marshal(game, file);
JAXB is configuration by exception meaning you only need to annotate your model where you wan the XML representation to differ from the default. The following will help you get started:
https://wiki.eclipse.org/EclipseLink/Examples/MOXy/GettingStarted
I've a JMS messaging app thats reading and writing to MQ queues. The message data is string form and in xml format (minus the normal header markers like the xml version etc). I'm looking at the best ways to read in, write out and validate against an xsd schema however the examples I'm coming across all talk about working with files.
Is there any way (tutorials out there) to take an xml string; read it in and validate it and also do the same for an xml string I create validate and write out without writing to disk?
Would appreciate any pointers.
Check out the javax.xml.validation APIs in Java SE, in particular the Validator class which is used to validate XML content against an XML schema:
http://download-llnw.oracle.com/javase/6/docs/api/javax/xml/validation/package-summary.html
Use a StringReader on the strings, pass the reader to the JAXB methods to read the contents.
thanks folks I managed to sort this one out with the following.
Marshall:
JAXBContext jaxbContext = JAXUtility.getContext(packageLocation);
StringWriter sw = new StringWriter();
Marshaller m = jaxbContext.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
if (o instanceof UnadvisedDeal) { m.marshal((UnadvisedDeal) o,sw);
UnMarshall:
JAXBContext jaxbContext = JAXUtility.getContext(packageLocation);
Unmarshaller um = jaxbContext.createUnmarshaller();
ud = (UnadvisedDeal) um.unmarshal(new StringReader(sw.toString()));
thanks for the help though
I'm writing the xsd and the code to validate, so I have great control here.
I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate:
<foo>
<name>Harold</name>
<bar>Alpha</bar>
<baz>Mercury</baz>
<!-- ... more general info that applies to all foos ... -->
<bar-config>
<!-- the content here is specific to the bar named "Alpha" -->
</bar-config>
<baz-config>
<!-- the content here is specific to the baz named "Mercury" -->
</baz>
</foo>
In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config.
The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be.
Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor.
Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas.
Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.
I finally figured this out.
First of all, in the foo schema, the bar-config and baz-config elements have a type which includes an any element, like this:
<sequence>
<any minOccurs="0" maxOccurs="1"
processContents="lax" namespace="##any" />
</sequence>
In the xml, then, you must specify the proper namespace using the xmlns attribute on the child element of bar-config or baz-config, like this:
<bar-config>
<config xmlns="http://www.example.org/bar/Alpha">
... config xml here ...
</config>
</bar-config>
Then, your XML schema file for bar Alpha will have a target namespace of http://www.example.org/bar/Alpha and will define the root element config.
If your XML file has namespace declarations and schema locations for both of the schema files, this is sufficient for the editor to do all of the validating (at least good enough for Eclipse).
So far, we have satisfied the requirement that the xml author may write the xml in such a way that it is validated in the editor.
Now, we need the consumer to be able to validate. In my case, I'm using Java.
If by some chance, you know the schema files that you will need to use to validate ahead of time, then you simply create a single Schema object and validate as usual, like this:
Schema schema = factory().newSchema(new Source[] {
new StreamSource(stream("foo.xsd")),
new StreamSource(stream("Alpha.xsd")),
new StreamSource(stream("Mercury.xsd")),
});
In this case, however, we don't know which xsd files to use until we have parsed the main document. So, the general procedure is to:
Validate the xml using only the main (foo) schema
Determine the schema to use to validate the portion of the document
Find the node that is the root of the portion to validate using a separate schema
Import that node into a brand new document
Validate the brand new document using the other schema file
Caveat: it appears that the document must be built namespace-aware in order for this to work.
Here's some code (this was ripped from various places of my code, so there might be some errors introduced by the copy-and-paste):
// Contains the filename of the xml file
String filename;
// Load the xml data using a namespace-aware builder (the method
// 'stream' simply opens an input stream on a file)
Document document;
DocumentBuilderFactory docBuilderFactory =
DocumentBuilderFactory.newInstance();
docBuilderFactory.setNamespaceAware(true);
document = docBuilderFactory.newDocumentBuilder().parse(stream(filename));
// Create the schema factory
SchemaFactory sFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Load the main schema
Schema schema = sFactory.newSchema(
new StreamSource(stream("foo.xsd")));
// Validate using main schema
schema.newValidator().validate(new DOMSource(document));
// Get the node that is the root for the portion you want to validate
// using another schema
Node node= getSpecialNode(document);
// Build a Document from that node
Document subDocument = docBuilderFactory.newDocumentBuilder().newDocument();
subDocument.appendChild(subDocument.importNode(node, true));
// Determine the schema to use using your own logic
Schema subSchema = parseAndDetermineSchema(document);
// Validate using other schema
subSchema.newValidator().validate(new DOMSource(subDocument));
Take a look at NVDL (Namespace-based Validation Dispatching Language) - http://www.nvdl.org/
It is designed to do what you want to do (validate parts of an XML document that have their own namespaces and schemas).
There is a tutorial here - http://www.dpawson.co.uk/nvdl/ - and a Java implementation here - http://jnvdl.sourceforge.net/
Hope that helps!
Kevin
You need to define a target namespace for each separately-validated portions of the instance document. Then you define a master schema that uses <xsd:include> to reference the schema documents for these components.
The limitation with this approach is that you can't let the individual components define the schemas that should be used to validate them. But it's a bad idea in general to let a document tell you how to validate it (ie, validation should something that your application controls).
You can also use a "resource resolver" to allow "xml authors" to specify their own schema file, at least to some extent, ex: https://stackoverflow.com/a/41225329/32453 at the end of the day, you want a fully compliant xml file that can be validatable with normal tools, anyway :)
We serialize/deserialize XML using XStream... and just got an OutOfMemory exception.
Firstly I don't understand why we're getting the error as we have 500MB allocated to the server.
Question is - what changes should we make to stay out of trouble? We want to ensure this implementation scales.
Currently we have ~60K objects, each ~50 bytes. We load the 60K POJO's in memory, and serialize them to a String which we send to a web service using HttpClient. When receiving, we get the entire String, then convert to POJO's. The XML/object hierarchy is like:
<root>
<meta>
<date>10/10/2009</date>
<type>abc</type>
</meta>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
I gather the best approach is to not store the POJO's in memory and not write the contents to a single String. Instead we should write the individual <data> POJO's to a stream. XStream supports this but seems like the <meta> element wouldn't be supported. Data would need to be in form:
<root>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
So what approach is easiest to stream the entire tree?
You definitely want to avoid serializing your POJOs into a humongous String and then writing that String out. Use the XStream APIs to serialize the POJOs directly to your OutputStream. I ran into the same situation earlier this year when I found that I was generating 200-300Mb XML documents and getting OutOfMemoryErrors. It was very easy to make the switch.
And ditto of course for the reading side. Don't read the XML into a String and ask XStream to deserialize from that String: deserialize directly from the InputStream.
You mention a second issue regarding not being able to serialize the <meta> element and the <data> elements. I don't think this is an XStream problem or limitation as I routinely serialize much more complex structures on the order of:
<myobject>
<item>foo</item>
<anotheritem>foo</anotheritem>
<alist>
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
...
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
</alist>
<anotherlist>
<anotherlistitem>
<valA>A</valA>
<valB>B</valB>
<valC>C</valC>
...
</anotherlistitem>
...
</anotherlist>
</myobject>
I've successfully serialized and deserialized nested lists too.
Not sure what the problem is here...you've found your answer on that webpage.
The example code on the link you provided suggests:
Writer someWriter = new FileWriter("filename.xml");
ObjectOutputStream out = xstream.createObjectOutputStream(someWriter, "root");
out.writeObject(dataObject);
// iterate over your objects...
out.close();
and for reading nearly identical but with Reader for Writer and Input for Output:
Reader someReader = new FileReader("filename.xml");
ObjectInputStream in = xstream.createObjectInputStream(someReader);
DataObject foo = (DataObject)in.readObject();
// do some stuff here while there's more objects...
in.close();
I'd suggest using tools like Visual VM or Eclipse Memory Analyzer to make sure you don't have a memory leak/problem.
Also, how do you know each object is 50 bytes? That doesn't sound likely.
Use XMLStreamWriter (or XStream) to serialize it, you can write whatever you want on it. If you have the option of getting the input stream instead of the entire string, use a SAXParser, it is event based and, although the implementation maybe a little bit clumsy, you will be able to read any XML that is thrown at you, even if it the XML is huge (I have parse 2GB+ more XML files with SAXParser).
Just as a side note, you should send the binary data and not the string to a XML parser. XML parsers will read the encoding of the byte array that is going to come next through the xml tag in the beginning of the XML sequence:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
A string is encoded in something already. It's better practice to let the XML parse the original stream before you create a String with it.