Parse file containing XML Fragments in Java

Parse file containing XML Fragments in Java - java

I inherited an "XML" license file containing no root element, but rather two XML fragments (<XmlCreated> and <Product>) so when I try to parse the file, I (expectantly) get an error about a document that is not-well-formed.
I need to get both the XmlCreated and Product tags.
Sample XML file:
<?xml version="1.0"?>
<XmlCreated>May 11 2009</XmlCreated>
<!-- License Key file Attributes -->
<Product image ="LicenseKeyFile">
<!-- MyCompany -->
<Manufacturer ID="7f">
<SerialNumber>21072832521007</SerialNumber>
<ChassisId>72060034465DE1C3</ChassisId>
<RtspMaxUsers>500</RtspMaxUsers>
<MaxChannels>8</MaxChannels>
</Manufacturer>
</Product>
Here is the current code that I use to attempt to load the XML. It does not work, but I've used it before as a starting point for well-formed XML.
public static void main(String[] args) {
try {
File file = new File("C:\\path\\LicenseFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
} catch (Exception e) {
e.printStackTrace();
}
}
At the db.parse(file) line, I get the following Exception:
[Fatal Error] LicenseFile.xml:6:2: The markup in the document following the root element must be well-formed.
org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.mycompany.licensesigning.LicenseSigner.main(LicenseSigner.java:20)
How would I go about parsing this frustrating file?

If you know this document is always going to be non-well formed... make it so. Add a new dummy <root> tag after the <?xml...>and </root> after the last of the data.

You're going to need to create two separate Document objects by breaking the file up into smaller pieces and parsing those pieces individually (or alternatively reconstructing them into a larger document by adding a tag which encloses both of them).
If you can rely on the structure of the file it should be easy to read the file into a string and then search for substrings like <Product and </Product> and then use those markers to create a string you can pass into a document builder.

How about implementing a simple wrapper around InputStream that wraps the input from the file with a root-level tag, and using that as the input to DocumentBuilder.parse()?
If the expected input is small enough to load into memory, read into a string, wrap it with a dummy start/end tag and then use:
DocumentBuilder.parse(new InputSource(new StringReader(string)))

I'd probably create a SequenceInputStream where you sandwich the real stream with two ByteArrayInputStreams that return some dummy root start tag, and end tag.
Then i'd use use the parse method that takes a stream rather than a file name.

I agree with Jim Garrison to some extent, use an InputStream or StreamReader and wrap the input in the required tags, its a simple and easy method. Main problem i can forsee is you'll have to have some checks for valid and invalid formatting (if you want to be able to use the method for both valid and invalid data), if the formatting is invalid (because of root level tags missing) wrap the input with the tags, if its valid then don't wrap the input. If the input is invalid for some other reason, you can also alter the input to correct the formatting issues.
Also, its probably better to store the ipnut in a collection of strings (of some sort) rather than a string itself, this will mean that you wont have as much of a limit to your input size. Make each string one line from the file. You should end up with a logical and easy to follow structure which mwill make it easier to allow for corrections of other formatting issues in the future.
Hardest part about that is figuring out what has caused the invalid formatting. In your case just check for root level tags, if the tags exist and are formatted correctly, dont wrap, If not, wrap.

Related

how to efficiently parse XML with multiple roots in java? [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.

A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

How to parse invalid (bad / not well-formed) XML?

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.

A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

XML Exception while Parsing from String to DOM error

In my code, I am converting DOM object to String object and replace few strings and again writing to DOM object.
While doing so, am traversing many files, code works first time while for 2nd file, while writing to DOM object , java exception thrown.
I know root cause, as two root element for xml has getting generated , however don't have idea to eliminate
Java exception occurred:
org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
# Code
doc = docBuilder.parse(fXmlFile);
TramsformerObj.transform(DOMSource(doc),StreamResult(writerObj));
% Replace String
try
StrObj = writerObj.getBuffer.toString.replaceAll(Sourcestr,Replacestr);
catch ME
disp(ME.message)
end
% Convert String to DOM object
try
isObj.setCharacterStream(java.io.StringReader(StrObj));
docm = docBuilder.parse(isObj);
catch ME
disp(ME.message)
end
PS:- Note all object/ variables are available in memory and accesible. This code is of MATLAB, am invoking all packages of JAVA in MATLAB

I think that the LSinput isObj (if that's what it is) should be recreated for each
new file before you set the character stream and pass the isObj to the parse method.
Also, recreate the LSParser docBuilder
isObj = new InputSource
docm = DOMImplementationLS.createLSParser(...);
isObj.setCharacterStream(java.io.StringReader(StrObj));
docm = docBuilder.parse(isObj);

Another answer to this is using XML within XML will break you need to not parse the inner XML and use the CDATA tag see here:Check This
<?xml version='1.0'?>
<sometag></sometag>
<![CDATA[
<?xml version='1.0'?>
<nonParsedTag></nonParsedTag>
]]>

Java parsing xml file with appended data

I've xml file, which looks like this:
<Header>
<Type>TestType</Type>
<Owner>Me</Owner>
</Header>
ĺß™¸Ű;?źÉćĂˇţ¬=ńgăűßEĹ¶áCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂŢŘö¤xi¦Ö†5ÚPMáx^š‡âő
Those funny letters are binary coded data.
I've a trouble with parsing it. All I want to do is read values of Type and Owner nodes and data after Header. That data can be big. It's basically xml with data appended after it. Header always starts with and ends with . Number of child nodes in it can change
I tried just simple parsing:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(f);
and what I got was:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.

In order to be processed by an XML parser a file must be well formed and optionally valid (The latter requires testing against a "schema" describing the expected tag format).
In this case your document is not well formed:
$ xmllint --noout File1.xml
File1.xml:5: parser error : Extra content at the end of the document
ĺß™¸Ű;?źÉćĂˇţ¬=ńgăűßEĹ¶áCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂ
^
I would suggest finding some way to strip away the offending characters and then process the properly formatted XML. For example assuming the XML is in the first 4 files of the file:
head -n 4 File1.xml | xmllint --noout -

You could try a SAX parser instead which does not read in the whole document. Just read in elements/attributes until you have what you want, then stop.
But this is not a well formed XML file. If possible, fix it by putting the (encoded) binary data into its own element.

Parsing pseudo XML file in Java

I’m trying to parse text from a file that comes in a pseudo XML format. I can get a DOM document out of it when it comes in the following structure:
<product>
<product_id>234567</product_id>
<description>abc</description>
</product>
The problem I’m running into happens when the structure is similar to the following:
<product>
<product_id>234567</product_id>
<description>abc</description>
<quantity 1:2>
<version>1.1</version>
</quantity 1:2>
<version>1.2</version>
<quantity 2:2>
</quantity 2:2>
</product>
It generates the following exception due to the space in <quantity 1:2>:
org.xml.sax.SAXParseException:[Fatal Error] :1:167: Element type " quantity " must be followed by either attribute specifications, ">" or "/>"
I can get around this by replacing the space with an underscore. The problem is the structure can be vary in size and include several child nodes with the same format (<node 1:x>) and the file can contain hundreds of structures to parse. Is there a class available that will parse text like this a return a tree-like object?

Your file is not an XML at all, and SAX is for XML (Simple API for XML). You should re-think your structure so you can do something like:
<quantity myAttr="1.2">
<version>1.2</version>
</quantity>
<quantity myAttr="1.x">
<version>1.1</version>
</quantity>
<version>1.0</version>
Or something like that.

Preprocess the file and change elements with that x:y form to <element value="x:y"/> then your DOM/SAX parsers will not choke.
I would suggest using a regular expression to help but that way leads to madness.

It generates the following exception due to the space in <quantity 1:2>
This is not the root cause of the error, the root cause is, as people have already mentioned, your file format is not valid XML. A valid XML tag would look like <quantity attr1="val1" attr2="val2>.
It sounds like you have no control over the file format. In this case I think the easiest way is to preprocess your file into valid XML then have DOM/SAX parser to parse it:
FileInputStream file = new FileInputStream("pseudo.pxml");
ByteArrayOutputStream temp = new ByteArrayOutputStream();
int c = -1;
while ((c=file.read()) >= 0){
temp.write(c);
}
String xml = new String(temp.toByteArray());
xml = xml.replaceAll("([^:\s]+:[^:\s]+)", "value=\"\\1\"");
ByteArrayInputStream xmlIn = new ByteArrayInputStream(xml.getBytes());
/* use xmlIn for your XML parsers */
Note that I did not test this code nor is it optimized; just wanted to give you an idea.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse file containing XML Fragments in Java - java

If you know this document is always going to be non-well formed... make it so. Add a new dummy <root> tag after the <?xml...>and </root> after the last of the data.

I'd probably create a SequenceInputStream where you sandwich the real stream with two ByteArrayInputStreams that return some dummy root start tag, and end tag. Then i'd use use the parse method that takes a stream rather than a file name.

Related

how to efficiently parse XML with multiple roots in java? [duplicate]

How to parse invalid (bad / not well-formed) XML?

XML Exception while Parsing from String to DOM error

Java parsing xml file with appended data

Parsing pseudo XML file in Java

Categories

Resources