XML Exception while Parsing from String to DOM error - java

In my code, I am converting DOM object to String object and replace few strings and again writing to DOM object.
While doing so, am traversing many files, code works first time while for 2nd file, while writing to DOM object , java exception thrown.
I know root cause, as two root element for xml has getting generated , however don't have idea to eliminate
Java exception occurred:
org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
# Code
doc = docBuilder.parse(fXmlFile);
TramsformerObj.transform(DOMSource(doc),StreamResult(writerObj));
% Replace String
try
StrObj = writerObj.getBuffer.toString.replaceAll(Sourcestr,Replacestr);
catch ME
disp(ME.message)
end
% Convert String to DOM object
try
isObj.setCharacterStream(java.io.StringReader(StrObj));
docm = docBuilder.parse(isObj);
catch ME
disp(ME.message)
end
PS:- Note all object/ variables are available in memory and accesible. This code is of MATLAB, am invoking all packages of JAVA in MATLAB

I think that the LSinput isObj (if that's what it is) should be recreated for each
new file before you set the character stream and pass the isObj to the parse method.
Also, recreate the LSParser docBuilder
isObj = new InputSource
docm = DOMImplementationLS.createLSParser(...);
isObj.setCharacterStream(java.io.StringReader(StrObj));
docm = docBuilder.parse(isObj);

Another answer to this is using XML within XML will break you need to not parse the inner XML and use the CDATA tag see here:Check This
<?xml version='1.0'?>
<sometag></sometag>
<![CDATA[
<?xml version='1.0'?>
<nonParsedTag></nonParsedTag>
]]>

Related

Java parsing xml file with appended data

I've xml file, which looks like this:
<Header>
<Type>TestType</Type>
<Owner>Me</Owner>
</Header>
ĺß™¸Ű;?źÉćáţ¬=ńgăűßEŶáCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂŢŘö¤xi¦Ö†5ÚPMáx^š‡âő
Those funny letters are binary coded data.
I've a trouble with parsing it. All I want to do is read values of Type and Owner nodes and data after Header. That data can be big. It's basically xml with data appended after it. Header always starts with and ends with . Number of child nodes in it can change
I tried just simple parsing:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(f);
and what I got was:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
In order to be processed by an XML parser a file must be well formed and optionally valid (The latter requires testing against a "schema" describing the expected tag format).
In this case your document is not well formed:
$ xmllint --noout File1.xml
File1.xml:5: parser error : Extra content at the end of the document
ĺß™¸Ű;?źÉćáţ¬=ńgăűßEŶáCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂ
^
I would suggest finding some way to strip away the offending characters and then process the properly formatted XML. For example assuming the XML is in the first 4 files of the file:
head -n 4 File1.xml | xmllint --noout -
You could try a SAX parser instead which does not read in the whole document. Just read in elements/attributes until you have what you want, then stop.
But this is not a well formed XML file. If possible, fix it by putting the (encoded) binary data into its own element.

Getting Premature end of file Exception

I'm trying to parse the existing xhtml file to add the additional body content into that file. I am using the following code:
First I am reading the body from the Jsoup and i am trying to put it in the XhtmlFile
Document doc = Jsoup.parse(readFile, "UTF-8");
Elements content = doc.getElementsByTag("body");
try {
Document document=null;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Create the builder and parse the file
document = (Document)factory.newDocumentBuilder().parse(finalFile);
//document.getElementsByTagName("body")append(content.toString());
//document=parserXML(finalFile);
document.getElementsByTag("body").append(content.toString());
} catch (SAXException e) {
System.out.println("SAXException>>>>>>");
e.printStackTrace();
} catch (ParserConfigurationException e) {
System.out.println("in parser configuration Exception block>>>>>>");
e.printStackTrace();
}
But i am getting the following exception:
[Fatal Error] ResultParsedFile.html:1:1: Premature end of file.
org.xml.sax.SAXParseException: Premature end of file.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.converter.typeconverter.EmailTypeConverter.readHTML(EmailTypeConverter.java:101)
at com.converter.typeconverter.EmailTypeConverter.callTika(EmailTypeConverter.java:64)
at com.converter.master.ApplicationMain.main(ApplicationMain.java:64)
Plese help me in solving this issue ...
Thanks in advance...
If you get this error at the first position of the file (which the 1:1 indicates) it means that the file is empty.
Maybe you start reading the file before the source has closed it?
In case you use an input stream (which is not the case here), this can happen when you re-use a stream that you have already used to reach the end of the file. You need to create a new stream from the input file in order to reset it from the start of the file.
The message indicates that you have a badly formed XML file. Usually when I've gotten this message I had an opening tag with no matching end tag. I think you'll also get this on an empty file.
I had recently experienced this error, turns out one of my .hbm.xml files was being generated as empty, error was being generated from application context xml which was referring to the hbm file
1.xml is not readable.
2.To rectify the xml ,only option is to drag and drop into spreadsheet,error will be highlighted more clear.after doing suggested correction xml will finally be loaded to spreadsheet.then that xml which is successfully loaded will not face any parsing issue

Specifying DTD to be used by DocumentBuilders for XML parsing?

I am currently writing a tool, using Java 1.6, that brings together a number of XML files. All of the files validate to the DocBook 4.5 DTD (I have checked this using xmllint and specifying the DocBook 4.5 DTD as the --dtdvalid parameter), but not all of them include the DOCTYPE declaration.
I load each XML file into the DOM to perform the required manipulation like so:
private Document fileToDocument( File input ) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( input );
}
For the most part this has worked quite well, I can use he returned object to navigate the tree and perform the required manipulations and then write the document back out. Where I am encountering problems is with files which:
Do not include the DOCTYPE declaration, and
Do include entities defined in the DTD (for example — / —).
Where this is the case an exception is thrown from the builder.parse(...) call with the message:
[Fatal Error] :5:15: The entity "mdash" was referenced, but not declared.
Fair enough, it isn't declared. What I would ideally do in this instance is set the DocumentBuilderFactory to always use the DocBook 4.5 DTD regardless of whether one is specified in the file.
I did try validation using the DocBook 4.5 schema but found that this produced a number of unrelated errors with the XML. It seems like the schema might not be functionally equivalent to the DTD, at least for this version of the DocBook specification.
The other option I can think of is to read the file in, try and detect whether a doctype was set or not, and then set one if none was found prior to actually parsing the XML into the DOM.
So, my question is, is there a smarter way that I have not seen to tell the parser to use a specific DTD or ensure that parsing proceeds despite the entities not resolving (not just the &emdash; example but any entities in the XML - there are a large number of potentials)?
Could using an EntityResolver2 and implementing EntityResolver2.getExternalSubset() help?
... This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered, but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. ...

org.xml.sax.SAXParseException: Content is not allowed in prolog

I have a Java based web service client connected to Java web service (implemented on the Axis1 framework).
I am getting following exception in my log file:
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
at org.apache.ws.axis.security.WSDoAllReceiver.invoke(WSDoAllReceiver.java:114)
at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
at org.apache.axis.client.AxisClient.invoke(AxisClient.java:198)
at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
at org.apache.axis.client.Call.invoke(Call.java:2767)
at org.apache.axis.client.Call.invoke(Call.java:2443)
at org.apache.axis.client.Call.invoke(Call.java:2366)
at org.apache.axis.client.Call.invoke(Call.java:1812)
This is often caused by a white space before the XML declaration, but it could be any text, like a dash or any character. I say often caused by white space because people assume white space is always ignorable, but that's not the case here.
Another thing that often happens is a UTF-8 BOM (byte order mark), which is allowed before the XML declaration can be treated as whitespace if the document is handed as a stream of characters to an XML parser rather than as a stream of bytes.
The same can happen if schema files (.xsd) are used to validate the xml file and one of the schema files has an UTF-8 BOM.
Actually in addition to Yuriy Zubarev's Post
When you pass a nonexistent xml file to parser. For example you pass
new File("C:/temp/abc")
when only C:/temp/abc.xml file exists on your file system
In either case
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
document = builder.parse(new File("C:/temp/abc"));
or
DOMParser parser = new DOMParser();
parser.parse("file:C:/temp/abc");
All give the same error message.
Very disappointing bug, because the following trace
javax.servlet.ServletException
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
...
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
... 40 more
doesn't say anything about the fact of 'file name is incorrect' or 'such a file does not exist'. In my case I had absolutely correct xml file and had to spent 2 days to determine the real problem.
Try adding a space between the encoding="UTF-8" string in the prolog and the terminating ?>. In XML the prolog designates this bracket-question mark delimited element at the start of the document (while the tag prolog in stackoverflow refers to the programming language).
Added: Is that dash in front of your prolog part of the document? That would be the error there, having data in front of the prolog, -<?xml version="1.0" encoding="UTF-8"?>.
I had the same problem (and solved it) while trying to parse an XML document with freemarker.
I had no spaces before the header of XML file.
The problem occurs when and only when the file encoding and the XML encoding attribute are different. (ex: UTF-8 file with UTF-16 attribute in header).
So I had two ways of solving the problem:
changing the encoding of the file itself
changing the header UTF-16 to UTF-8
It means XML is malformed or the response body is not XML document at all.
Just spent 4 hours tracking down a similar problem in a WSDL. Turns out the WSDL used an XSD which imports another namespace XSD. This imported XSD contained the following:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.xyz.com/Services/CommonTypes" elementFormDefault="qualified"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:CommonTypes="http://www.xyz.com/Services/CommonTypes">
<include schemaLocation=""></include>
<complexType name="RequestType">
<....
Note the empty include element! This was the root of my woes. I guess this is a variation on Egor's file not found problem above.
+1 to disappointing error reporting.
My answer wouldn't help you probably, but it help with this problem generally.
When you see this kind of exception you should try to open your xml file in any Hex Editor and sometime you can see additional bytes at the beginning of the file which text-editor doesn't show.
Delete them and your xml will be parsed.
In my case, removing the 'encoding="UTF-8"' attribute altogether worked.
It looks like a character set encoding issue, maybe because your file isn't really in UTF-8.
For the same issues, I have removed the following line,
File file = new File("c:\\file.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
It is working fine. Not so sure why that UTF-8 gives problem. To keep me in shock, it works fine for UTF-8 also.
Am using Windows-7 32 bit and Netbeans IDE with Java *jdk1.6.0_13*. No idea how it works.
Sometimes it's the code, not the XML
The following code,
Document doc = dBuilder.parse(new InputSource(new StringReader("file.xml")));
will also result in this error,
[Fatal Error] :1:1: Content is not allowed in prolog.org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
because it's attempting to parse the string literal, "file.xml" (not the contents of the file.xml file) and failing because "file.xml" as a string is not well-formed XML.
Fix: Remove StringReader():
Document doc = dBuilder.parse(new InputSource("file.xml"));
Similarly, dirty buffer problems can leave residual junk ahead of the actual XML. If you've carefully checked your XML and are still getting this error, log the exact contents being passed to the parser; sometimes what's actually being (tried to be) parsed is surprising.
First clean project, then rebuild project. I was also facing the same issue. Everything came alright after this.
If all else fails, open the file in binary to make sure there are no funny characters [3 non printable characters at the beginning of the file that identify the file as utf-8] at the beginning of the file. We did this and found some. so we converted the file from utf-8 to ascii and it worked.
As Mike Sokolov has already pointed it out, one of the possible reasons is presence of some character/s (such as a whitespace) before the tag.
If your input XML is being read as a String (as opposed to byte array) then you
can use replace your input string with the below code to make sure that all 'un-necessary'
characters before the xml tag are wiped off.
inputXML=inputXML.substring(inputXML.indexOf("<?xml"));
You need to be sure that the input xml starts with the xml tag though.
To fix the BOM issue on Unix / Linux systems:
Check if there's an unwanted BOM character:
hexdump -C myfile.xml | more
An unwanted BOM character will appear at the start of the file as ...<?xml>
Alternatively, do file myfile.xml. A file with a BOM character will appear as: myfile.xml: XML 1.0 document text, UTF-8 Unicode (with BOM) text
Fix a single file with: tail -c +4 myfile.xml > temp.xml && mv temp.xml myfile.xml
Repeat 1 or 2 to check the file has been sanitised. Probably also sensible to do view myfile.xml to check contents have stayed.
Here's a bash script to sanitise a whole folder of XML files:
#!/usr/bin/env bash
# This script is to sanitise XML files to remove any BOM characters
has_bom() { head -c3 "$1" | LC_ALL=C grep -qe '\xef\xbb\xbf'; }
for filename in *.xml ; do
if has_bom ${filename}; then
tail -c +4 ${filename} > temp.xml
mv temp.xml ${filename}
fi
done
What i have tried [Did not work]
In my case the web.xml in my application had extra space. Even after i deleted ; it did not work!.
I was playing with logging.properties and web.xml in my tomcat, but even after i reverted the error persists!.
Solution
To be specific i tried do adding
org.apache.catalina.filters.ExpiresFilter.level = FINE
Tomcat expire filter is not working correctly
I followed the instructions found here and i got the same error.
I tried several things to solve it (ie changing the encoding, typing the XML file rather than copy-pasting it ect) in Notepad and XML Notepad but nothing worked.
The problem got solved when I edited and saved my XML file in Notepad++ (encoding --> utf-8 without BOM)
In my case I got this error because the API I used could return the data either in XML or in JSON format. When I tested it using a browser, it defaulted to the XML format, but when I invoked the same call from a Java application, the API returned the JSON formatted response, that naturally triggered a parsing error.
Just an additional thought on this one for the future. Getting this bug could be the case that one simply hits the delete key or some other key randomly when they have an XML window as the active display and are not paying attention. This has happened to me before with the struts.xml file in my web application. Clumsy elbows ...
I was also getting the same
XML reader error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,2] Message: Reference is not allowed in prolog.
, when my application was creating a XML response for a RestFull Webservice call.
While creating the XML format String I replaced the &lt and &gt with < and > then the error went off, and I was getting proper response. Not sure how it worked but it worked.
sample:
String body = "<ns:addNumbersResponse xmlns:ns=\"http://java.duke.org\"><ns:return>"
+sum
+"</ns:return></ns:addNumbersResponse>";
I had the same issue.
First I downloaded the XML file to local desktop and I got Content is not allowed in prolog during the importing file to portal server. Even visually file was looking good to me but somehow it's was corrupted.
So I re-download the same file and tried the same and it worked.
We had the same problem recently and it turned out to be the case of a bad URL and consequently a standard 403 HTTP response (which obviously isn't the valid XML the client was looking for). I'm going to share the detail in case someone within the same context run into this problem:
This was a Spring based web application in which a "JaxWsPortProxyFactoryBean" bean was configured to expose a proxy for a remote port.
<bean id="ourPortJaxProxyService"
class="org.springframework.remoting.jaxws.JaxWsPortProxyFactoryBean"
p:serviceInterface="com.amir.OurServiceSoapPortWs"
p:wsdlDocumentUrl="${END_POINT_BASE_URL}/OurService?wsdl"
p:namespaceUri="http://amir.com/jaxws" p:serviceName="OurService"
p:portName="OurSoapPort" />
The "END_POINT_BASE_URL" is an environment variable configured in "setenv.sh" of the Tomcat instance that hosts the web application. The content of the file is something like this:
export END_POINT_BASE_URL="http://localhost:9001/BusinessAppServices"
#export END_POINT_BASE_URL="http://localhost:8765/BusinessAppServices"
The missing ";" after each line caused the malformed URL and thus the bad response. That is, instead of "BusinessAppServices/OurService?wsdl" the URL had a CR before "/". "TCP/IP Monitor" was quite handy while troubleshooting the problem.
For all those that get this error:
WARNING: Catalina.start using conf/server.xml: Content is not allowed in prolog.
Not very informative.. but what this actually means is that there is garbage in your conf/server.xml file.
I have seen this exact error in other XML files.. this error can be caused by making changes with a text editor which introduces the garbage.
The way you can verify whether or not you have garbage in the file is to open it with a "HEX Editor" If you see any character before this string
"<?xml version="1.0" encoding="UTF-8"?>"
like this would be garbage
"‰ŠŒ<?xml version="1.0" encoding="UTF-8"?>"
that is your problem....
The Solution is to use a good HEX Editor.. One that will allow you to save files with differing types of encoding..
Then just save it as UTF-8.
Some systems that use XML files may need it saved as UTF NO BOM
Which means with "NO Byte Order Mark"
Hope this helps someone out there!!
For me, a Build->Clean fixed everything!
I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding with a small script in Python. I tried use Notepad++ but I didn't have success:
import os
import sys
path = os.path.dirname(__file__)
file_name = 'my_input_file.xml'
if __name__ == "__main__":
with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
lines = f1.read()
f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
f2.write(lines)
f2.close()
Even I had faced a similar problem. Reason was some garbage character at the beginning of the file.
Fix : Just open the file in a text editor(tested on Sublime text) remove any indent if any in the file and copy paste all the content of the file in a new file and save it. Thats it!. When I ran the new file it ran without any parsing errors.
I took code of Dineshkumar and modified to Validate my XML file correctly:
import org.apache.log4j.Logger;
public class Myclass{
private static final Logger LOGGER = Logger.getLogger(Myclass.class);
/**
* Validate XML file against Schemas XSD in pathEsquema directory
* #param pathEsquema directory that contains XSD Schemas to validate
* #param pathFileXML XML file to validate
* #throws BusinessException if it throws any Exception
*/
public static void validarXML(String pathEsquema, String pathFileXML)
throws BusinessException{
String W3C_XML_SCHEMA = "http://www.w3.org/2001/XMLSchema";
String nameFileXSD = "file.xsd";
String MY_SCHEMA1 = pathEsquema+nameFileXSD);
ParserErrorHandler parserErrorHandler;
try{
SchemaFactory schemaFactory = SchemaFactory.newInstance(W3C_XML_SCHEMA);
Source [] source = {
new StreamSource(new File(MY_SCHEMA1))
};
Schema schemaGrammar = schemaFactory.newSchema(source);
Validator schemaValidator = schemaGrammar.newValidator();
schemaValidator.setErrorHandler(
parserErrorHandler= new ParserErrorHandler());
/** validate xml instance against the grammar. */
File file = new File(pathFileXML);
InputStream isS= new FileInputStream(file);
Reader reader = new InputStreamReader(isS,"UTF-8");
schemaValidator.validate(new StreamSource(reader));
if(parserErrorHandler.getErrorHandler().isEmpty()&&
parserErrorHandler.getFatalErrorHandler().isEmpty()){
if(!parserErrorHandler.getWarningHandler().isEmpty()){
LOGGER.info(
String.format("WARNING validate XML:[%s] Descripcion:[%s]",
pathFileXML,parserErrorHandler.getWarningHandler()));
}else{
LOGGER.info(
String.format("OK validate XML:[%s]",
pathFileXML));
}
}else{
throw new BusinessException(
String.format("Error validate XML:[%s], FatalError:[%s], Error:[%s]",
pathFileXML,
parserErrorHandler.getFatalErrorHandler(),
parserErrorHandler.getErrorHandler()));
}
}
catch(SAXParseException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXParseException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (SAXException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (IOException e) {
throw new BusinessException(String.format("Error validate XML:[%s],
IOException:[%s]",pathFileXML,e.getMessage()),e);
}
}
}
Set your document to form like this:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
%children%
</root>
I had the same issue with spring
MarshallingMessageConverter
and by pre-proccess code.
Mayby someone will need reason:
BytesMessage #readBytes - reading bytes.. and i forgot that reading is one direction operation.
You can not read twice.
Try with BOMInputStream in apache.commons.io:
public static <T> T getContent(Class<T> instance, SchemaType schemaType, InputStream stream) throws JAXBException, SAXException, IOException {
JAXBContext context = JAXBContext.newInstance(instance);
Unmarshaller unmarshaller = context.createUnmarshaller();
Reader reader = new InputStreamReader(new BOMInputStream(stream), "UTF-8");
JAXBElement<T> entry = unmarshaller.unmarshal(new StreamSource(reader), instance);
return entry.getValue();
}
I was having the same problem while parsing the info.plist file in my mac. However, the problem was fixed using the following command which turned the file into an XML.
plutil -convert xml1 info.plist
Hope that helps someone.

Parse file containing XML Fragments in Java

I inherited an "XML" license file containing no root element, but rather two XML fragments (<XmlCreated> and <Product>) so when I try to parse the file, I (expectantly) get an error about a document that is not-well-formed.
I need to get both the XmlCreated and Product tags.
Sample XML file:
<?xml version="1.0"?>
<XmlCreated>May 11 2009</XmlCreated>
<!-- License Key file Attributes -->
<Product image ="LicenseKeyFile">
<!-- MyCompany -->
<Manufacturer ID="7f">
<SerialNumber>21072832521007</SerialNumber>
<ChassisId>72060034465DE1C3</ChassisId>
<RtspMaxUsers>500</RtspMaxUsers>
<MaxChannels>8</MaxChannels>
</Manufacturer>
</Product>
Here is the current code that I use to attempt to load the XML. It does not work, but I've used it before as a starting point for well-formed XML.
public static void main(String[] args) {
try {
File file = new File("C:\\path\\LicenseFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
} catch (Exception e) {
e.printStackTrace();
}
}
At the db.parse(file) line, I get the following Exception:
[Fatal Error] LicenseFile.xml:6:2: The markup in the document following the root element must be well-formed.
org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.mycompany.licensesigning.LicenseSigner.main(LicenseSigner.java:20)
How would I go about parsing this frustrating file?
If you know this document is always going to be non-well formed... make it so. Add a new dummy <root> tag after the <?xml...>and </root> after the last of the data.
You're going to need to create two separate Document objects by breaking the file up into smaller pieces and parsing those pieces individually (or alternatively reconstructing them into a larger document by adding a tag which encloses both of them).
If you can rely on the structure of the file it should be easy to read the file into a string and then search for substrings like <Product and </Product> and then use those markers to create a string you can pass into a document builder.
How about implementing a simple wrapper around InputStream that wraps the input from the file with a root-level tag, and using that as the input to DocumentBuilder.parse()?
If the expected input is small enough to load into memory, read into a string, wrap it with a dummy start/end tag and then use:
DocumentBuilder.parse(new InputSource(new StringReader(string)))
I'd probably create a SequenceInputStream where you sandwich the real stream with two ByteArrayInputStreams that return some dummy root start tag, and end tag.
Then i'd use use the parse method that takes a stream rather than a file name.
I agree with Jim Garrison to some extent, use an InputStream or StreamReader and wrap the input in the required tags, its a simple and easy method. Main problem i can forsee is you'll have to have some checks for valid and invalid formatting (if you want to be able to use the method for both valid and invalid data), if the formatting is invalid (because of root level tags missing) wrap the input with the tags, if its valid then don't wrap the input. If the input is invalid for some other reason, you can also alter the input to correct the formatting issues.
Also, its probably better to store the ipnut in a collection of strings (of some sort) rather than a string itself, this will mean that you wont have as much of a limit to your input size. Make each string one line from the file. You should end up with a logical and easy to follow structure which mwill make it easier to allow for corrections of other formatting issues in the future.
Hardest part about that is figuring out what has caused the invalid formatting. In your case just check for root level tags, if the tags exist and are formatted correctly, dont wrap, If not, wrap.

Categories