SAX xml parser's missing line number

SAX xml parser's missing line number - java

I am receiving a XML parsing exception related to UTF-8, and this is the message:
Invalid byte 2 of 4-byte UTF-8 sequence.
[Feb 23 13:19:01.937 PST 2015][main][SEVERE][com.accelovation.nlp.util.xml.XMLUtil$XMLDocument:<init>] SAX Exceptoin :org.xml.sax.SAXParseException;
I am trying to debug, but it requires to modify compiler options to generate line number attributes. I can't set a break point and Eclipse reminds me:
Unable to install breakpoint in org.apache.exerces.jaxp.DocumentBuiderImpl due to missing line number attributes. Modify compiler options to generate line number attributes.
How should I modify compiler options to generate numbers? In my Eclipse compiler options, I already checked "Add line numbers to generated class files".
Add more details of how the XML file is parsed, where the parameter is a File object passed to this function:
Document document = null;
DocumentBuilder docBuilder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
if (resolver != null) {
docBuilder.setEntityResolver(resolver);
}
document = docBuilder.parse(file);

It's difficult to generate accurate line numbers for encoding errors, because if the file is incorrectly encoded, then detecting line boundaries is unreliable. I don't think using Eclipse to run Xerces in debugging mode is going to help you much.
I've heard it said that emacs is good on diagnostics for encoding errors. Try opening your file in emacs and see what it says. Alternatively, the most common cause of the this error is that the file is actually encoded in iso-8859-1 rather than utf-8; so try changing the XML declaration and seeing if that works.

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

java reads a weird character at the beginning of the file which doesn't exist

I have a simple xml file on my hard drive.
When I open it with notepad++ this is what I see:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>
But when I read it using a FileInputStream I get:
?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...
I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.
What is this extra "?" sign? why is it there and how do I get rid of it?

That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.
Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).
As a workaround, make sure that the program that produces this XML leaves off the BOM.

Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.

You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:
From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.
(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)

You might be having a newline. Delete that.
Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.

this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream
...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...

Next to the FileInputStream a ByteArrayInputStream worked also with me:
JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);
=> No unmarshaling error anymore.

Digester: The element type "user" must be terminated by the matching end-tag "</user>"

I'm using Digester to parse a xml file and I get the following error:
May 3, 2011 6:41:25 PM org.apache.commons.digester.Digester fatalError
SEVERE: Parse Fatal Error at line 2336608 column 3: The element type "user" must be terminated by the matching end-tag "</user>".
org.xml.sax.SAXParseException: The element type "user" must be terminated by the matching end-tag "</user>".
However 2336608 is the last line of my text file. I guess I'm opening a tag and I never close it. Do you know how can I find it and fix it, in big text files ?
thanks

Write another script which scans each file of the line and whenever it finds an open <user> tag, increments a counter and prints
line number 1234 <user> opened (1 open total)
and whenever it finds a close </user> tag, decrements the counter prints
line number 4546 </user> closed (0 open total)
Since you have one more opening tag than closing tag, the final output of this script will tell you that 1 tag was left open. However, assuming that your XML model does not allow for nested <user> tags, then you can assume the problemsome declaration is wherever you see the output of line number ... <user> opened (2 open total).

$ grep -Hin "</\?user>" Text.xml will print out every line with either or . If they're not nested, then you should be able to inspect that output fand find the missing close tag (when immediately follows . A script do do the same:
https://gist.github.com/953837
This assumes that the open and close tags are on different lines.

Use tidy -xml -e <your-xml-file>. http://tidy.sourceforge.net/
Tidy is a great little tool for validating HTML, and in XML mode (-xml above) it will validate XML as well.
It prints out line and column numbers for parse errors.
Most of the major package managers (apt, port, etc.) will have pre-built packages for it.

I think there is no need to start scripting for detecting xml errors.
You can use the w3 xml validator for instance
http://www.w3schools.com/xml/xml_validator.asp
I just pasted a 15 mb xml in there and I managed to fix it quite easily. You can also input the xml as a url if you have the possibility to upload it somewhere. Java reported the error in some place which seemed fine, but this tool localized the actual error, and after correcting that, java didn't error anymore.
There are many types of xml errors, and are not all related to the nested structure, so it is best to just use a well known tool for this. For instance, my error was an argument error(I was missing a ") but java detected a nesting problem.

org.xml.sax.SAXParseException: Content is not allowed in prolog

I have a Java based web service client connected to Java web service (implemented on the Axis1 framework).
I am getting following exception in my log file:
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
at org.apache.ws.axis.security.WSDoAllReceiver.invoke(WSDoAllReceiver.java:114)
at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
at org.apache.axis.client.AxisClient.invoke(AxisClient.java:198)
at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
at org.apache.axis.client.Call.invoke(Call.java:2767)
at org.apache.axis.client.Call.invoke(Call.java:2443)
at org.apache.axis.client.Call.invoke(Call.java:2366)
at org.apache.axis.client.Call.invoke(Call.java:1812)

This is often caused by a white space before the XML declaration, but it could be any text, like a dash or any character. I say often caused by white space because people assume white space is always ignorable, but that's not the case here.
Another thing that often happens is a UTF-8 BOM (byte order mark), which is allowed before the XML declaration can be treated as whitespace if the document is handed as a stream of characters to an XML parser rather than as a stream of bytes.
The same can happen if schema files (.xsd) are used to validate the xml file and one of the schema files has an UTF-8 BOM.

Actually in addition to Yuriy Zubarev's Post
When you pass a nonexistent xml file to parser. For example you pass
new File("C:/temp/abc")
when only C:/temp/abc.xml file exists on your file system
In either case
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
document = builder.parse(new File("C:/temp/abc"));
or
DOMParser parser = new DOMParser();
parser.parse("file:C:/temp/abc");
All give the same error message.
Very disappointing bug, because the following trace
javax.servlet.ServletException
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
...
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
... 40 more
doesn't say anything about the fact of 'file name is incorrect' or 'such a file does not exist'. In my case I had absolutely correct xml file and had to spent 2 days to determine the real problem.

Try adding a space between the encoding="UTF-8" string in the prolog and the terminating ?>. In XML the prolog designates this bracket-question mark delimited element at the start of the document (while the tag prolog in stackoverflow refers to the programming language).
Added: Is that dash in front of your prolog part of the document? That would be the error there, having data in front of the prolog, -<?xml version="1.0" encoding="UTF-8"?>.

I had the same problem (and solved it) while trying to parse an XML document with freemarker.
I had no spaces before the header of XML file.
The problem occurs when and only when the file encoding and the XML encoding attribute are different. (ex: UTF-8 file with UTF-16 attribute in header).
So I had two ways of solving the problem:
changing the encoding of the file itself
changing the header UTF-16 to UTF-8

It means XML is malformed or the response body is not XML document at all.

Just spent 4 hours tracking down a similar problem in a WSDL. Turns out the WSDL used an XSD which imports another namespace XSD. This imported XSD contained the following:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.xyz.com/Services/CommonTypes" elementFormDefault="qualified"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:CommonTypes="http://www.xyz.com/Services/CommonTypes">
<include schemaLocation=""></include>
<complexType name="RequestType">
<....
Note the empty include element! This was the root of my woes. I guess this is a variation on Egor's file not found problem above.
+1 to disappointing error reporting.

My answer wouldn't help you probably, but it help with this problem generally.
When you see this kind of exception you should try to open your xml file in any Hex Editor and sometime you can see additional bytes at the beginning of the file which text-editor doesn't show.
Delete them and your xml will be parsed.

In my case, removing the 'encoding="UTF-8"' attribute altogether worked.
It looks like a character set encoding issue, maybe because your file isn't really in UTF-8.

For the same issues, I have removed the following line,
File file = new File("c:\\file.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
It is working fine. Not so sure why that UTF-8 gives problem. To keep me in shock, it works fine for UTF-8 also.
Am using Windows-7 32 bit and Netbeans IDE with Java *jdk1.6.0_13*. No idea how it works.

Sometimes it's the code, not the XML
The following code,
Document doc = dBuilder.parse(new InputSource(new StringReader("file.xml")));
will also result in this error,
[Fatal Error] :1:1: Content is not allowed in prolog.org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
because it's attempting to parse the string literal, "file.xml" (not the contents of the file.xml file) and failing because "file.xml" as a string is not well-formed XML.
Fix: Remove StringReader():
Document doc = dBuilder.parse(new InputSource("file.xml"));
Similarly, dirty buffer problems can leave residual junk ahead of the actual XML. If you've carefully checked your XML and are still getting this error, log the exact contents being passed to the parser; sometimes what's actually being (tried to be) parsed is surprising.

First clean project, then rebuild project. I was also facing the same issue. Everything came alright after this.

If all else fails, open the file in binary to make sure there are no funny characters [3 non printable characters at the beginning of the file that identify the file as utf-8] at the beginning of the file. We did this and found some. so we converted the file from utf-8 to ascii and it worked.

As Mike Sokolov has already pointed it out, one of the possible reasons is presence of some character/s (such as a whitespace) before the tag.
If your input XML is being read as a String (as opposed to byte array) then you
can use replace your input string with the below code to make sure that all 'un-necessary'
characters before the xml tag are wiped off.
inputXML=inputXML.substring(inputXML.indexOf("<?xml"));
You need to be sure that the input xml starts with the xml tag though.

To fix the BOM issue on Unix / Linux systems:
Check if there's an unwanted BOM character:
hexdump -C myfile.xml | more
An unwanted BOM character will appear at the start of the file as ...<?xml>
Alternatively, do file myfile.xml. A file with a BOM character will appear as: myfile.xml: XML 1.0 document text, UTF-8 Unicode (with BOM) text
Fix a single file with: tail -c +4 myfile.xml > temp.xml && mv temp.xml myfile.xml
Repeat 1 or 2 to check the file has been sanitised. Probably also sensible to do view myfile.xml to check contents have stayed.
Here's a bash script to sanitise a whole folder of XML files:
#!/usr/bin/env bash
# This script is to sanitise XML files to remove any BOM characters
has_bom() { head -c3 "$1" | LC_ALL=C grep -qe '\xef\xbb\xbf'; }
for filename in *.xml ; do
if has_bom ${filename}; then
tail -c +4 ${filename} > temp.xml
mv temp.xml ${filename}
fi
done

What i have tried [Did not work]
In my case the web.xml in my application had extra space. Even after i deleted ; it did not work!.
I was playing with logging.properties and web.xml in my tomcat, but even after i reverted the error persists!.
Solution
To be specific i tried do adding
org.apache.catalina.filters.ExpiresFilter.level = FINE
Tomcat expire filter is not working correctly

I followed the instructions found here and i got the same error.
I tried several things to solve it (ie changing the encoding, typing the XML file rather than copy-pasting it ect) in Notepad and XML Notepad but nothing worked.
The problem got solved when I edited and saved my XML file in Notepad++ (encoding --> utf-8 without BOM)

In my case I got this error because the API I used could return the data either in XML or in JSON format. When I tested it using a browser, it defaulted to the XML format, but when I invoked the same call from a Java application, the API returned the JSON formatted response, that naturally triggered a parsing error.

Just an additional thought on this one for the future. Getting this bug could be the case that one simply hits the delete key or some other key randomly when they have an XML window as the active display and are not paying attention. This has happened to me before with the struts.xml file in my web application. Clumsy elbows ...

I was also getting the same
XML reader error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,2] Message: Reference is not allowed in prolog.
, when my application was creating a XML response for a RestFull Webservice call.
While creating the XML format String I replaced the &lt and &gt with < and > then the error went off, and I was getting proper response. Not sure how it worked but it worked.
sample:
String body = "<ns:addNumbersResponse xmlns:ns=\"http://java.duke.org\"><ns:return>"
+sum
+"</ns:return></ns:addNumbersResponse>";

I had the same issue.
First I downloaded the XML file to local desktop and I got Content is not allowed in prolog during the importing file to portal server. Even visually file was looking good to me but somehow it's was corrupted.
So I re-download the same file and tried the same and it worked.

We had the same problem recently and it turned out to be the case of a bad URL and consequently a standard 403 HTTP response (which obviously isn't the valid XML the client was looking for). I'm going to share the detail in case someone within the same context run into this problem:
This was a Spring based web application in which a "JaxWsPortProxyFactoryBean" bean was configured to expose a proxy for a remote port.
<bean id="ourPortJaxProxyService"
class="org.springframework.remoting.jaxws.JaxWsPortProxyFactoryBean"
p:serviceInterface="com.amir.OurServiceSoapPortWs"
p:wsdlDocumentUrl="${END_POINT_BASE_URL}/OurService?wsdl"
p:namespaceUri="http://amir.com/jaxws" p:serviceName="OurService"
p:portName="OurSoapPort" />
The "END_POINT_BASE_URL" is an environment variable configured in "setenv.sh" of the Tomcat instance that hosts the web application. The content of the file is something like this:
export END_POINT_BASE_URL="http://localhost:9001/BusinessAppServices"
#export END_POINT_BASE_URL="http://localhost:8765/BusinessAppServices"
The missing ";" after each line caused the malformed URL and thus the bad response. That is, instead of "BusinessAppServices/OurService?wsdl" the URL had a CR before "/". "TCP/IP Monitor" was quite handy while troubleshooting the problem.

For all those that get this error:
WARNING: Catalina.start using conf/server.xml: Content is not allowed in prolog.
Not very informative.. but what this actually means is that there is garbage in your conf/server.xml file.
I have seen this exact error in other XML files.. this error can be caused by making changes with a text editor which introduces the garbage.
The way you can verify whether or not you have garbage in the file is to open it with a "HEX Editor" If you see any character before this string
"<?xml version="1.0" encoding="UTF-8"?>"
like this would be garbage
"‰ŠŒ<?xml version="1.0" encoding="UTF-8"?>"
that is your problem....
The Solution is to use a good HEX Editor.. One that will allow you to save files with differing types of encoding..
Then just save it as UTF-8.
Some systems that use XML files may need it saved as UTF NO BOM
Which means with "NO Byte Order Mark"
Hope this helps someone out there!!

For me, a Build->Clean fixed everything!

I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding with a small script in Python. I tried use Notepad++ but I didn't have success:
import os
import sys
path = os.path.dirname(__file__)
file_name = 'my_input_file.xml'
if __name__ == "__main__":
with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
lines = f1.read()
f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
f2.write(lines)
f2.close()

Even I had faced a similar problem. Reason was some garbage character at the beginning of the file.
Fix : Just open the file in a text editor(tested on Sublime text) remove any indent if any in the file and copy paste all the content of the file in a new file and save it. Thats it!. When I ran the new file it ran without any parsing errors.

I took code of Dineshkumar and modified to Validate my XML file correctly:
import org.apache.log4j.Logger;
public class Myclass{
private static final Logger LOGGER = Logger.getLogger(Myclass.class);
/**
* Validate XML file against Schemas XSD in pathEsquema directory
* #param pathEsquema directory that contains XSD Schemas to validate
* #param pathFileXML XML file to validate
* #throws BusinessException if it throws any Exception
*/
public static void validarXML(String pathEsquema, String pathFileXML)
throws BusinessException{
String W3C_XML_SCHEMA = "http://www.w3.org/2001/XMLSchema";
String nameFileXSD = "file.xsd";
String MY_SCHEMA1 = pathEsquema+nameFileXSD);
ParserErrorHandler parserErrorHandler;
try{
SchemaFactory schemaFactory = SchemaFactory.newInstance(W3C_XML_SCHEMA);
Source [] source = {
new StreamSource(new File(MY_SCHEMA1))
};
Schema schemaGrammar = schemaFactory.newSchema(source);
Validator schemaValidator = schemaGrammar.newValidator();
schemaValidator.setErrorHandler(
parserErrorHandler= new ParserErrorHandler());
/** validate xml instance against the grammar. */
File file = new File(pathFileXML);
InputStream isS= new FileInputStream(file);
Reader reader = new InputStreamReader(isS,"UTF-8");
schemaValidator.validate(new StreamSource(reader));
if(parserErrorHandler.getErrorHandler().isEmpty()&&
parserErrorHandler.getFatalErrorHandler().isEmpty()){
if(!parserErrorHandler.getWarningHandler().isEmpty()){
LOGGER.info(
String.format("WARNING validate XML:[%s] Descripcion:[%s]",
pathFileXML,parserErrorHandler.getWarningHandler()));
}else{
LOGGER.info(
String.format("OK validate XML:[%s]",
pathFileXML));
}
}else{
throw new BusinessException(
String.format("Error validate XML:[%s], FatalError:[%s], Error:[%s]",
pathFileXML,
parserErrorHandler.getFatalErrorHandler(),
parserErrorHandler.getErrorHandler()));
}
}
catch(SAXParseException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXParseException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (SAXException e){
throw new BusinessException(String.format("Error validate XML:[%s], SAXException:[%s]",
pathFileXML,e.getMessage()),e);
}
catch (IOException e) {
throw new BusinessException(String.format("Error validate XML:[%s],
IOException:[%s]",pathFileXML,e.getMessage()),e);
}
}
}

Set your document to form like this:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
%children%
</root>

I had the same issue with spring
MarshallingMessageConverter
and by pre-proccess code.
Mayby someone will need reason:
BytesMessage #readBytes - reading bytes.. and i forgot that reading is one direction operation.
You can not read twice.

Try with BOMInputStream in apache.commons.io:
public static <T> T getContent(Class<T> instance, SchemaType schemaType, InputStream stream) throws JAXBException, SAXException, IOException {
JAXBContext context = JAXBContext.newInstance(instance);
Unmarshaller unmarshaller = context.createUnmarshaller();
Reader reader = new InputStreamReader(new BOMInputStream(stream), "UTF-8");
JAXBElement<T> entry = unmarshaller.unmarshal(new StreamSource(reader), instance);
return entry.getValue();
}

I was having the same problem while parsing the info.plist file in my mac. However, the problem was fixed using the following command which turned the file into an XML.
plutil -convert xml1 info.plist
Hope that helps someone.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SAX xml parser's missing line number - java

Related

Replacing text in XWPFParagraph without changing format of the docx file

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

java reads a weird character at the beginning of the file which doesn't exist

Digester: The element type "user" must be terminated by the matching end-tag "</user>"

org.xml.sax.SAXParseException: Content is not allowed in prolog

Categories

Resources