org.xml.sax.SAXParseException: Content is not allowed in prolog - java

I have a Java based web service client connected to Java web service (implemented on the Axis1 framework).
I am getting following exception in my log file:
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
at org.apache.ws.axis.security.WSDoAllReceiver.invoke(WSDoAllReceiver.java:114)
at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
at org.apache.axis.client.AxisClient.invoke(AxisClient.java:198)
at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
at org.apache.axis.client.Call.invoke(Call.java:2767)
at org.apache.axis.client.Call.invoke(Call.java:2443)
at org.apache.axis.client.Call.invoke(Call.java:2366)
at org.apache.axis.client.Call.invoke(Call.java:1812)

This is often caused by a white space before the XML declaration, but it could be any text, like a dash or any character. I say often caused by white space because people assume white space is always ignorable, but that's not the case here.
Another thing that often happens is a UTF-8 BOM (byte order mark), which is allowed before the XML declaration can be treated as whitespace if the document is handed as a stream of characters to an XML parser rather than as a stream of bytes.
The same can happen if schema files (.xsd) are used to validate the xml file and one of the schema files has an UTF-8 BOM.

Actually in addition to Yuriy Zubarev's Post
When you pass a nonexistent xml file to parser. For example you pass
new File("C:/temp/abc")
when only C:/temp/abc.xml file exists on your file system
In either case
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
document = builder.parse(new File("C:/temp/abc"));
or
DOMParser parser = new DOMParser();
parser.parse("file:C:/temp/abc");
All give the same error message.
Very disappointing bug, because the following trace
javax.servlet.ServletException
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
...
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
... 40 more
doesn't say anything about the fact of 'file name is incorrect' or 'such a file does not exist'. In my case I had absolutely correct xml file and had to spent 2 days to determine the real problem.

Try adding a space between the encoding="UTF-8" string in the prolog and the terminating ?>. In XML the prolog designates this bracket-question mark delimited element at the start of the document (while the tag prolog in stackoverflow refers to the programming language).
Added: Is that dash in front of your prolog part of the document? That would be the error there, having data in front of the prolog, -<?xml version="1.0" encoding="UTF-8"?>.

I had the same problem (and solved it) while trying to parse an XML document with freemarker.
I had no spaces before the header of XML file.
The problem occurs when and only when the file encoding and the XML encoding attribute are different. (ex: UTF-8 file with UTF-16 attribute in header).
So I had two ways of solving the problem:
changing the encoding of the file itself
changing the header UTF-16 to UTF-8

It means XML is malformed or the response body is not XML document at all.

Just spent 4 hours tracking down a similar problem in a WSDL. Turns out the WSDL used an XSD which imports another namespace XSD. This imported XSD contained the following:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.xyz.com/Services/CommonTypes" elementFormDefault="qualified"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:CommonTypes="http://www.xyz.com/Services/CommonTypes">
<include schemaLocation=""></include>
<complexType name="RequestType">
<....
Note the empty include element! This was the root of my woes. I guess this is a variation on Egor's file not found problem above.
+1 to disappointing error reporting.

My answer wouldn't help you probably, but it help with this problem generally.
When you see this kind of exception you should try to open your xml file in any Hex Editor and sometime you can see additional bytes at the beginning of the file which text-editor doesn't show.
Delete them and your xml will be parsed.

In my case, removing the 'encoding="UTF-8"' attribute altogether worked.
It looks like a character set encoding issue, maybe because your file isn't really in UTF-8.

For the same issues, I have removed the following line,
File file = new File("c:\\file.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
It is working fine. Not so sure why that UTF-8 gives problem. To keep me in shock, it works fine for UTF-8 also.
Am using Windows-7 32 bit and Netbeans IDE with Java *jdk1.6.0_13*. No idea how it works.

Sometimes it's the code, not the XML
The following code,
Document doc = dBuilder.parse(new InputSource(new StringReader("file.xml")));
will also result in this error,
[Fatal Error] :1:1: Content is not allowed in prolog.org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
because it's attempting to parse the string literal, "file.xml" (not the contents of the file.xml file) and failing because "file.xml" as a string is not well-formed XML.
Fix: Remove StringReader():
Document doc = dBuilder.parse(new InputSource("file.xml"));
Similarly, dirty buffer problems can leave residual junk ahead of the actual XML. If you've carefully checked your XML and are still getting this error, log the exact contents being passed to the parser; sometimes what's actually being (tried to be) parsed is surprising.

First clean project, then rebuild project. I was also facing the same issue. Everything came alright after this.

If all else fails, open the file in binary to make sure there are no funny characters [3 non printable characters at the beginning of the file that identify the file as utf-8] at the beginning of the file. We did this and found some. so we converted the file from utf-8 to ascii and it worked.

As Mike Sokolov has already pointed it out, one of the possible reasons is presence of some character/s (such as a whitespace) before the tag.
If your input XML is being read as a String (as opposed to byte array) then you
can use replace your input string with the below code to make sure that all 'un-necessary'
characters before the xml tag are wiped off.
inputXML=inputXML.substring(inputXML.indexOf("<?xml"));
You need to be sure that the input xml starts with the xml tag though.

To fix the BOM issue on Unix / Linux systems:
Check if there's an unwanted BOM character:
hexdump -C myfile.xml | more
An unwanted BOM character will appear at the start of the file as ...<?xml>
Alternatively, do file myfile.xml. A file with a BOM character will appear as: myfile.xml: XML 1.0 document text, UTF-8 Unicode (with BOM) text
Fix a single file with: tail -c +4 myfile.xml > temp.xml && mv temp.xml myfile.xml
Repeat 1 or 2 to check the file has been sanitised. Probably also sensible to do view myfile.xml to check contents have stayed.
Here's a bash script to sanitise a whole folder of XML files:
#!/usr/bin/env bash
# This script is to sanitise XML files to remove any BOM characters
has_bom() { head -c3 "$1" | LC_ALL=C grep -qe '\xef\xbb\xbf'; }
for filename in *.xml ; do
if has_bom ${filename}; then
tail -c +4 ${filename} > temp.xml
mv temp.xml ${filename}
fi
done

What i have tried [Did not work]
In my case the web.xml in my application had extra space. Even after i deleted ; it did not work!.
I was playing with logging.properties and web.xml in my tomcat, but even after i reverted the error persists!.
Solution
To be specific i tried do adding
org.apache.catalina.filters.ExpiresFilter.level = FINE
Tomcat expire filter is not working correctly

I followed the instructions found here and i got the same error.
I tried several things to solve it (ie changing the encoding, typing the XML file rather than copy-pasting it ect) in Notepad and XML Notepad but nothing worked.
The problem got solved when I edited and saved my XML file in Notepad++ (encoding --> utf-8 without BOM)

In my case I got this error because the API I used could return the data either in XML or in JSON format. When I tested it using a browser, it defaulted to the XML format, but when I invoked the same call from a Java application, the API returned the JSON formatted response, that naturally triggered a parsing error.

Just an additional thought on this one for the future. Getting this bug could be the case that one simply hits the delete key or some other key randomly when they have an XML window as the active display and are not paying attention. This has happened to me before with the struts.xml file in my web application. Clumsy elbows ...

I was also getting the same
XML reader error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,2] Message: Reference is not allowed in prolog.
, when my application was creating a XML response for a RestFull Webservice call.
While creating the XML format String I replaced the &lt and &gt with < and > then the error went off, and I was getting proper response. Not sure how it worked but it worked.
sample:
String body = "<ns:addNumbersResponse xmlns:ns=\"http://java.duke.org\"><ns:return>"
+sum
+"</ns:return></ns:addNumbersResponse>";

I had the same issue.
First I downloaded the XML file to local desktop and I got Content is not allowed in prolog during the importing file to portal server. Even visually file was looking good to me but somehow it's was corrupted.
So I re-download the same file and tried the same and it worked.

We had the same problem recently and it turned out to be the case of a bad URL and consequently a standard 403 HTTP response (which obviously isn't the valid XML the client was looking for). I'm going to share the detail in case someone within the same context run into this problem:
This was a Spring based web application in which a "JaxWsPortProxyFactoryBean" bean was configured to expose a proxy for a remote port.
<bean id="ourPortJaxProxyService"
class="org.springframework.remoting.jaxws.JaxWsPortProxyFactoryBean"
p:serviceInterface="com.amir.OurServiceSoapPortWs"
p:wsdlDocumentUrl="${END_POINT_BASE_URL}/OurService?wsdl"
p:namespaceUri="http://amir.com/jaxws" p:serviceName="OurService"
p:portName="OurSoapPort" />
The "END_POINT_BASE_URL" is an environment variable configured in "setenv.sh" of the Tomcat instance that hosts the web application. The content of the file is something like this:
export END_POINT_BASE_URL="http://localhost:9001/BusinessAppServices"
#export END_POINT_BASE_URL="http://localhost:8765/BusinessAppServices"
The missing ";" after each line caused the malformed URL and thus the bad response. That is, instead of "BusinessAppServices/OurService?wsdl" the URL had a CR before "/". "TCP/IP Monitor" was quite handy while troubleshooting the problem.

For all those that get this error:
WARNING: Catalina.start using conf/server.xml: Content is not allowed in prolog.
Not very informative.. but what this actually means is that there is garbage in your conf/server.xml file.
I have seen this exact error in other XML files.. this error can be caused by making changes with a text editor which introduces the garbage.
The way you can verify whether or not you have garbage in the file is to open it with a "HEX Editor" If you see any character before this string
"<?xml version="1.0" encoding="UTF-8"?>"
like this would be garbage
"‰ŠŒ<?xml version="1.0" encoding="UTF-8"?>"
that is your problem....
The Solution is to use a good HEX Editor.. One that will allow you to save files with differing types of encoding..
Then just save it as UTF-8.
Some systems that use XML files may need it saved as UTF NO BOM
Which means with "NO Byte Order Mark"
Hope this helps someone out there!!

For me, a Build->Clean fixed everything!

I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding with a small script in Python. I tried use Notepad++ but I didn't have success:
import os
import sys
path = os.path.dirname(__file__)
file_name = 'my_input_file.xml'
if __name__ == "__main__":
with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
lines = f1.read()
f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
f2.write(lines)
f2.close()

Even I had faced a similar problem. Reason was some garbage character at the beginning of the file.
Fix : Just open the file in a text editor(tested on Sublime text) remove any indent if any in the file and copy paste all the content of the file in a new file and save it. Thats it!. When I ran the new file it ran without any parsing errors.

I took code of Dineshkumar and modified to Validate my XML file correctly:
import org.apache.log4j.Logger;
public class Myclass{
private static final Logger LOGGER = Logger.getLogger(Myclass.class);
/**
* Validate XML file against Schemas XSD in pathEsquema directory
* #param pathEsquema directory that contains XSD Schemas to validate
* #param pathFileXML XML file to validate
* #throws BusinessException if it throws any Exception
*/
public static void validarXML(String pathEsquema, String pathFileXML)
throws BusinessException{
String W3C_XML_SCHEMA = "http://www.w3.org/2001/XMLSchema";
String nameFileXSD = "file.xsd";
String MY_SCHEMA1 = pathEsquema+nameFileXSD);
ParserErrorHandler parserErrorHandler;
try{
SchemaFactory schemaFactory = SchemaFactory.newInstance(W3C_XML_SCHEMA);
Source [] source = {
new StreamSource(new File(MY_SCHEMA1))
};
Schema schemaGrammar = schemaFactory.newSchema(source);
Validator schemaValidator = schemaGrammar.newValidator();
schemaValidator.setErrorHandler(
parserErrorHandler= new ParserErrorHandler());
/** validate xml instance against the grammar. */
File file = new File(pathFileXML);
InputStream isS= new FileInputStream(file);
Reader reader = new InputStreamReader(isS,"UTF-8");
schemaValidator.validate(new StreamSource(reader));
if(parserErrorHandler.getErrorHandler().isEmpty()&&
parserErrorHandler.getFatalErrorHandler().isEmpty()){
if(!parserErrorHandler.getWarningHandler().isEmpty()){
LOGGER.info(
String.format("WARNING validate XML:[%s] Descripcion:[%s]",
pathFileXML,parserErrorHandler.getWarningHandler()));
}else{
LOGGER.info(
String.format("OK validate XML:[%s]",
pathFileXML));
}
}else{
throw new BusinessException(
String.format("Error validate XML:[%s], FatalError:[%s], Error:[%s]",
pathFileXML,
parserErrorHandler.getFatalErrorHandler(),
parserErrorHandler.getErrorHandler()));
}
}
catch(SAXParseException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXParseException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (SAXException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (IOException e) {
throw new BusinessException(String.format("Error validate XML:[%s],
IOException:[%s]",pathFileXML,e.getMessage()),e);
}
}
}

Set your document to form like this:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
%children%
</root>

I had the same issue with spring
MarshallingMessageConverter
and by pre-proccess code.
Mayby someone will need reason:
BytesMessage #readBytes - reading bytes.. and i forgot that reading is one direction operation.
You can not read twice.

Try with BOMInputStream in apache.commons.io:
public static <T> T getContent(Class<T> instance, SchemaType schemaType, InputStream stream) throws JAXBException, SAXException, IOException {
JAXBContext context = JAXBContext.newInstance(instance);
Unmarshaller unmarshaller = context.createUnmarshaller();
Reader reader = new InputStreamReader(new BOMInputStream(stream), "UTF-8");
JAXBElement<T> entry = unmarshaller.unmarshal(new StreamSource(reader), instance);
return entry.getValue();
}

I was having the same problem while parsing the info.plist file in my mac. However, the problem was fixed using the following command which turned the file into an XML.
plutil -convert xml1 info.plist
Hope that helps someone.

Related

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

How to parse invalid (bad / not well-formed) XML?

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

SAX xml parser's missing line number

I am receiving a XML parsing exception related to UTF-8, and this is the message:
Invalid byte 2 of 4-byte UTF-8 sequence.
[Feb 23 13:19:01.937 PST 2015][main][SEVERE][com.accelovation.nlp.util.xml.XMLUtil$XMLDocument:<init>] SAX Exceptoin :org.xml.sax.SAXParseException;
I am trying to debug, but it requires to modify compiler options to generate line number attributes. I can't set a break point and Eclipse reminds me:
Unable to install breakpoint in org.apache.exerces.jaxp.DocumentBuiderImpl due to missing line number attributes. Modify compiler options to generate line number attributes.
How should I modify compiler options to generate numbers? In my Eclipse compiler options, I already checked "Add line numbers to generated class files".
Add more details of how the XML file is parsed, where the parameter is a File object passed to this function:
Document document = null;
DocumentBuilder docBuilder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
if (resolver != null) {
docBuilder.setEntityResolver(resolver);
}
document = docBuilder.parse(file);
It's difficult to generate accurate line numbers for encoding errors, because if the file is incorrectly encoded, then detecting line boundaries is unreliable. I don't think using Eclipse to run Xerces in debugging mode is going to help you much.
I've heard it said that emacs is good on diagnostics for encoding errors. Try opening your file in emacs and see what it says. Alternatively, the most common cause of the this error is that the file is actually encoded in iso-8859-1 rather than utf-8; so try changing the XML declaration and seeing if that works.

java reads a weird character at the beginning of the file which doesn't exist

I have a simple xml file on my hard drive.
When I open it with notepad++ this is what I see:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>
But when I read it using a FileInputStream I get:
?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...
I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.
What is this extra "?" sign? why is it there and how do I get rid of it?
That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.
Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).
As a workaround, make sure that the program that produces this XML leaves off the BOM.
Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.
You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:
From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.
(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)
You might be having a newline. Delete that.
Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.
this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream
...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...
Next to the FileInputStream a ByteArrayInputStream worked also with me:
JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);
=> No unmarshaling error anymore.

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!
The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)
This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/
I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM
I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it
In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.
Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>
Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.
I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));
In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.
I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.
In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...
bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>
I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.
Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);

Categories