Parse XML file in java parsing new line problem

Parse XML file in java parsing new line problem - java

XMLInputFactory factory = XMLInputFactory.newInstance();
Reader fileReader = new FileReader(xmlFileName);
XMLEventReader reader = factory.createXMLEventReader(fileReader);
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
StartElement element = (StartElement) event;
//process start element
}
if (event.isEndElement()) {
currentParent = (DefaultMutableTreeNode)(currentParent.getParent());
//process end element
}
if (event.isCharacters()) {
Characters characters = (Characters) event;
String text = characters.getData();
if(text.startsWith("\n") || text.startsWith("\r\n") || text.startsWith("\r")) {
continue;
}
//process characters element
}
My problem with the above code is that while processing the XML file new lines are processed as character nodes, I was hoping there is a certain flag to ignore the new lines. Please let me know if this is possible.

As per the specification whitespace must be preserved, unless told otherwise:
An XML processor MUST always pass all characters in a document that are not markup through to the application.

Try reading through this question for some pointers on what to do.

Related

How to feed the Java stax parser with chunks of strings and not InputStreams?

I know Java stax parser works with InputStreams. However, I would need to manually push chunks of strings to the parser instead of inputstream.
Would it be possible?
Kind regards,

You can use a ByteArrayInputStream wrapping around your String chunks' bytes.
Quick example
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader reader = factory.createXMLStreamReader(
new ByteArrayInputStream("<start></start>".getBytes("UTF-8"))
);
while (reader.hasNext()) {
int event = reader.next();
switch (event) {
case (XMLStreamConstants.START_ELEMENT): {
System.out.println(reader.getLocalName());
break;
}
}
}
Output
start

Java Modify XML

I want to read an XML file in Java and then update certain elements in that file with new values. My file is > 200mb and performance is important, so the DOM model cannot be used.
I feel that a StaX Parser is the solution, but there is no decent literature on using Java StaX to read and then write XML back to the same file.
(For reference I have been using the java tutorial and this helpful tutorial to get what I have so far)
I am using Java 7, but there doesn't seem to be any updates to the XML parsing API since...a long time ago. So this probably isn't relevant.
Currently I have this:
public static String readValueFromXML(final File xmlFile, final String value) throws FileNotFoundException, XMLStreamException
{
XMLEventReader reader = new XMLInputFactory.newFactory().createXMLEventReader(new FileReader(xmlFile));
String found = "";
boolean read = false;
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
if (event.isStartElement() &&
event.asStartElement().getName().getLocalPart().equals(value))
{
read = true;
}
if (event.isCharacters() && read)
{
found = event.asCharacters().getData();
break;
}
}
return found;
}
which will read the XMLFile and return the value of the selected element. However, I have another method updateXMLFile(final File xmlFile, final String value) which I want to use in conjunction with this.
So my question is threefold:
Is there a StaX implementation for editing XML
Will XPath be any help? Can that be used without converting my file to a Document?
(More Generally) Why doesn't Java have a better XML API?

There are two things you may want to look at. The first is to use JAXB to bind the XML to POJOs which you can then have your way with and serialize the structure back to XML when needed.
The second is a JDBC driver for XML, there are several available for a fee, not sure if there are any open source ones or not. In my experience JAXB is the better choice. If the XML file is too large to handle efficiently with JAXB I think you need to look at using a database as a replacement for the XML file.

This is my approach, which reads events from the file using StaX and writes them to another file. The values are updated as the loop passes over the correctly named elements.
public void read(String key, String value)
{
try (FileReader fReader = new FileReader(inputFile); FileWriter fWriter = new FileWriter(outputFile))
{
XMLEventFactory factory = XMLEventFactory.newInstance();
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(fReader);
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(fWriter);
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
boolean update = false;
if (event.isStartElement() && event.asStartElement().getName().getLocalPart().equals(key))
{
update = true;
}
else if (event.isCharacters() && update)
{
Characters characters = factory.createCharacters(value);
event = characters;
update = false;
}
writer.add(event);
}
}
catch (XMLStreamException | FactoryConfigurationError | IOException e)
{
e.printStackTrace();
}
}

How to find unclosed tags in XML with Java?

I have some XML documents with errors in - sometimes end tags are missing - and I want to find the places where this happens and fix them (manually).
I've used XOM to parse the documents and it handily says "missing end tag" at the right times, and tells me the name of the element, but doesn't guide me very well to where the problem is in the file.
I could write my own parser that helps to do this, but I wonder if there's already a solution? I don't want automatic tidying, as I want to make sure end tags are inserted in the right place. I just want to know the line number of the start tag.

I think it simple and can be done without any 3rd party library. Java has standart class
javax.xml.stream.XMLEventReader, and it will throw XMLException when it find missed end tag. Then call e.getLocation().getLineNumber() to get line number.
a bit complecated sample:
InputStream is = new FileInputStream("test.xml");
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inputFactory.createXMLEventReader(is, "utf-8");
Stack<StartElement> stack = new Stack<StartElement>();
while (eventReader.hasNext()) {
try {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
StartElement startElement = event.asStartElement();
System.out.println("processing element: " + startElement.getName().getLocalPart());
stack.push(startElement);
}
if(event.isEndElement()){
stack.pop();
}
}catch(XMLStreamException e){
System.out.println("error in line: " +e.getLocation().getLineNumber());
StartElement se = stack.pop();
System.out.println("non-closed tag:" + se.getName().getLocalPart() + " " + se.getLocation().getLineNumber());
throw e;
}
}

XMLEventReader help you to fix your problem :
Look at the following article :
Link : http://tutorials.jenkov.com/java-xml/stax-xmleventreader.html

Java XMLStreamReader is giving unwanted string

I have some xml I am reading here it is.
<application>
<client>website</client>
<register>
<name>
<first>Tommy</first>
<second>Jay</second>
</name>
<address>
<firstLine>line1</firstLine>
<secondLine>line2</secondLine>
<city>city1</city>
<county>county1</county>
<postcode>YY12 9UY</postcode>
</address>
</register>
</application>
Anyway when I read it with the xmlStreamReader as below
public XMLElementALT getNextElement()
{
element = new XMLElementALT();
int event;
try
{
event = reader.next();
}
catch (XMLStreamException ex)
{
return null;
}
if (event == XMLStreamConstants.START_ELEMENT)
{
element.setTag(reader.getLocalName());
}
else if (event == XMLStreamConstants.CHARACTERS)
{
element.setAttribute(reader.getText());
}
else if (event == XMLStreamConstants.END_ELEMENT)
{
element.setEndTag(reader.getLocalName());
}
else if (event == XMLStreamConstants.END_DOCUMENT)
{
element.setFinished();
}
return element;
}
This all goes well! However the problem that I have is that after reading the tag the next event I get is the event XMLStreamConstants.CHARACHTERS and reports that I have the attribute("\n ") which is the space between the tag and the next tag . How can I remove this? I want to have the next event as XMLStreamConstants.START_ELEMENT.I know I could put my XML in all on one line but I like to have the gaps when I input it so that I can see the structure. I also have an xsd to validate against and this validates the xml successfully, is their something in their I can do in the xsd to make it remove the spaces?
Thanks

You can ignore CHARACTERS events that contain only whitespace, either within your getNextElement method or by using a filter when you create the reader
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader rawReader = factory.createXMLStreamReader(...);
XMLStreamReader filteredReader = factory.createFilteredReader(rawReader,
new StreamFilter() {
public boolean accept(XMLStreamReader r) {
return !r.isWhiteSpace();
}
});
The isWhiteSpace method returns true if the current event is a CHARACTERS event consisting entirely of whitespace. It returns false if it's not a CHARACTERS event, or if it is CHARACTERS but not all white space.
However, it is important to note that an XMLStreamReader is not guaranteed to return all the text content of an element in one single CHARACTERS event, it is allowed to give you several separate blocks of characters which you must concatenate together yourself.

stax - get xml node as string

xml looks like so:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
I'm using stax to process one "<statement>" at a time and I got that working. I need to get that entire statement node as a string so I can create "123.xml" and "456.xml" or maybe even load it into a database table indexed by account.
using this approach: http://www.devx.com/Java/Article/30298/1954
I'm looking to do something like this:
String statementXml = staxXmlReader.getNodeByName("statement");
//load statementXml into database

I had a similar task and although the original question is older than a year, I couldn't find a satisfying answer. The most interesting answer up to now was Blaise Doughan's answer, but I couldn't get it running on the XML I am expecting (maybe some parameters for the underlying parser could change that?). Here the XML, very simplyfied:
<many-many-tags>
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
</many-many-tags>
My solution:
public static String readElementBody(XMLEventReader eventReader)
throws XMLStreamException {
StringWriter buf = new StringWriter(1024);
int depth = 0;
while (eventReader.hasNext()) {
// peek event
XMLEvent xmlEvent = eventReader.peek();
if (xmlEvent.isStartElement()) {
++depth;
}
else if (xmlEvent.isEndElement()) {
--depth;
// reached END_ELEMENT tag?
// break loop, leave event in stream
if (depth < 0)
break;
}
// consume event
xmlEvent = eventReader.nextEvent();
// print out event
xmlEvent.writeAsEncodedUnicode(buf);
}
return buf.getBuffer().toString();
}
Usage example:
XMLEventReader eventReader = ...;
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement elem = xmlEvent.asStartElement();
String name = elem.getName().getLocalPart();
if ("DESCRIPTION".equals(name)) {
String xmlFragment = readElementBody(eventReader);
// do something with it...
System.out.println("'" + fragment + "'");
}
}
else if (xmlEvent.isEndElement()) {
// ...
}
}
Note that the extracted XML fragment will contain the complete extracted body content, including white space and comments. Filtering those on demand, or making the buffer size parametrizable have been left out for code brevity:
'
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
'

You can use StAX for this. You just need to advance the XMLStreamReader to the start element for statement. Check the account attribute to get the file name. Then use the javax.xml.transform APIs to transform the StAXSource to a StreamResult wrapping a File. This will advance the XMLStreamReader and then just repeat this process.
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
File file = new File("out" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}

Stax is a low-level access API, and it does not have either lookups or methods that access content recursively. But what you actually trying to do? And why are you considering Stax?
Beyond using a tree model (DOM, XOM, JDOM, Dom4j), which would work well with XPath, best choice when dealing with data is usually data binding library like JAXB. With it you can pass Stax or SAX reader and ask it to bind xml data into Java beans and instead of messing with xml process Java objects. This is often more convenient, and it is usually quite performance.
Only trick with larger files is that you do not want to bind the whole thing at once, but rather bind each sub-tree (in your case, one 'statement' at a time).
This is easiest done by iterating Stax XmlStreamReader, then using JAXB to bind.

I've been googling and this seems painfully difficult.
given my xml I think it might just be simpler to:
StringBuilder buffer = new StringBuilder();
for each line in file {
buffer.append(line)
if(line.equals(STMT_END_TAG)){
parse(buffer.toString())
buffer.delete(0,buffer.length)
}
}
private void parse(String statement){
//saxParser.parse( new InputSource( new StringReader( xmlText ) );
// do stuff
// save string
}

Why not just use xpath for this?
You could have a fairly simple xpath to get all 'statement' nodes.
Like so:
//statement
EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.
EDIT #2: Using dom4j, this is how you would do it:
(from their cookbook)
String text = "your xml here";
Document document = DocumentHelper.parseText(text);
public void bar(Document document) {
List list = document.selectNodes( "//statement" );
// loop through node data
}

I had the similar problem and found the solution.
I used the solution proposed by #t0r0X but it does not work well in the current implementation in Java 11, the method xmlEvent.writeAsEncodedUnicode creates the invalid string representation of the start element (in the StartElementEvent class) in the result XML fragment, so I had to modify it, but then it seems to work well, what I could immediatelly verify by the parsing of the fragment by DOM and JaxBMarshaller to specific data containers.
In my case I had the huge structure
<Orders>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
...
</Orders>
in the file of multiple hundred megabytes (a lot of repeating "SyncOrder" structures), so the usage of DOM would lead to a large memory consumption and slow evaluation. Therefore I used the StAX to split the huge XML to smaller XML pieces, which I have analyzed with DOM and used the JaxbElements generated from the xsd definition of the element SyncOrder (This infrastructure I had from the webservice, which uses the same structure, but it is not important).
In this code there can be seen Where the XML fragment has een created and could be used, I used it directly in other processing...
private static <T> List<T> unmarshallMultipleSyncOrderXmlData(
InputStream aOrdersXmlContainingSyncOrderItems,
Function<SyncOrderType, T> aConversionFunction) throws XMLStreamException, ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory locDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
locDocumentBuilderFactory.setNamespaceAware(true);
DocumentBuilder locDocBuilder = locDocumentBuilderFactory.newDocumentBuilder();
List<T> locResult = new ArrayList<>();
XMLInputFactory locFactory = XMLInputFactory.newFactory();
XMLEventReader locReader = locFactory.createXMLEventReader(aOrdersXmlContainingSyncOrderItems);
boolean locIsInSyncOrder = false;
QName locSyncOrderElementQName = null;
StringWriter locXmlTextBuffer = new StringWriter();
int locDepth = 0;
while (locReader.hasNext()) {
XMLEvent locEvent = locReader.nextEvent();
if (locEvent.isStartElement()) {
if (locDepth == 0 && Objects.equals(locEvent.asStartElement().getName().getLocalPart(), "Orders")) {
locDepth++;
} else {
if (locDepth <= 0)
throw new IllegalStateException("There has been passed invalid XML stream intot he function. "
+ "Expecting the element 'Orders' as the root alament of the document, but found was '"
+ locEvent.asStartElement().getName().getLocalPart() + "'.");
locDepth++;
if (locSyncOrderElementQName == null) {
/* First element after the "Orders" has passed, so we retrieve
* the name of the element with the namespace prefix: */
locSyncOrderElementQName = locEvent.asStartElement().getName();
}
if(Objects.equals(locEvent.asStartElement().getName(), locSyncOrderElementQName)) {
locIsInSyncOrder = true;
}
}
} else if (locEvent.isEndElement()) {
locDepth--;
if(locDepth == 1 && Objects.equals(locEvent.asEndElement().getName(), locSyncOrderElementQName)) {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
/* at this moment the call of locXmlTextBuffer.toString() gets the complete fragment
* of XML containing the valid SyncOrder element, but I have continued to other processing,
* which immediatelly validates the produced XML fragment is valid and passes the values
* to communication object: */
Document locDocument = locDocBuilder.parse(new ByteArrayInputStream(locXmlTextBuffer.toString().getBytes()));
SyncOrderType locItem = unmarshallSyncOrderDomNodeToCo(locDocument);
locResult.add(aConversionFunction.apply(locItem));
locXmlTextBuffer = new StringWriter();
locIsInSyncOrder = false;
}
}
if (locIsInSyncOrder) {
if (locEvent.isStartElement()) {
/* here replaced the standard implementation of startElement's method writeAsEncodedUnicode: */
locXmlTextBuffer.write(startElementToStrng(locEvent.asStartElement()));
} else {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
}
}
}
return locResult;
}
private static String startElementToString(StartElement aStartElement) {
StringBuilder locStartElementBuffer = new StringBuilder();
// open element
locStartElementBuffer.append("<");
String locNameAsString = null;
if ("".equals(aStartElement.getName().getNamespaceURI())) {
locNameAsString = aStartElement.getName().getLocalPart();
} else if (aStartElement.getName().getPrefix() != null
&& !"".equals(aStartElement.getName().getPrefix())) {
locNameAsString = aStartElement.getName().getPrefix()
+ ":" + aStartElement.getName().getLocalPart();
} else {
locNameAsString = aStartElement.getName().getLocalPart();
}
locStartElementBuffer.append(locNameAsString);
// add any attributes
Iterator<Attribute> locAttributeIterator = aStartElement.getAttributes();
Attribute attr;
while (locAttributeIterator.hasNext()) {
attr = locAttributeIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(attr));
}
// add any namespaces
Iterator<Namespace> locNamespaceIterator = aStartElement.getNamespaces();
Namespace locNamespace;
while (locNamespaceIterator.hasNext()) {
locNamespace = locNamespaceIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(locNamespace));
}
// close start tag
locStartElementBuffer.append(">");
// return StartElement as a String
return locStartElementBuffer.toString();
}
private static String attributeToString(Attribute aAttr) {
if( aAttr.getName().getPrefix() != null && aAttr.getName().getPrefix().length() > 0 )
return aAttr.getName().getPrefix() + ":" + aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
else
return aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
}
public static SyncOrderType unmarshallSyncOrderDomNodeToCo(
Node aSyncOrderItemNode) {
Source locSource = new DOMSource(aSyncOrderItemNode);
Object locUnmarshalledObject = getMarshallerAndUnmarshaller().unmarshal(locSource);
SyncOrderType locCo = ((JAXBElement<SyncOrderType>) locUnmarshalledObject).getValue();
return locCo;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse XML file in java parsing new line problem - java

As per the specification whitespace must be preserved, unless told otherwise: An XML processor MUST always pass all characters in a document that are not markup through to the application.

Try reading through this question for some pointers on what to do.

Related

How to feed the Java stax parser with chunks of strings and not InputStreams?

Java Modify XML

How to find unclosed tags in XML with Java?

Java XMLStreamReader is giving unwanted string

stax - get xml node as string

Categories

Resources