Reading special Charanters in XML org.w3c.dom

Reading special Charanters in XML org.w3c.dom - java

I´m reading a XML with characters like "ñ". When i use
...
Node c = nodeList.item( j);
c.getFirstChild().getNodeValue();
...
for to read this
<ID>1Ññ</ID>
I get:
1Ã‘Ã±
Any idea?
The xml file starts with the following line
<?xml version="1.0" encoding="ISO-8859-1"?>

You have a problem with your character encoding.
The character sequence Ã‘Ã± clearly shows that there are UTF-8 characters that are decoded in any other character encoding (presumbly ISO-8859-1).
Please check your complete application that the encodings are correct.
Start with the method parse() in the DocumentBuilder and use the method that uses a InputSource and create the InputSource with a Reader that has the correct encoding (ISO-8859-1 in you case).

Related

How to enable non-IANA encodings when using javax.xml.stream.XMLStreamReader

I'm using javax.xml.stream.XMLStreamReader to parse XML documents. Unfortunately, some of the documents I'm parsing use non-IANA encoding names, like "macroman" and "ms-ansi". For example:
<?xml version="1.0" encoding="macroman"?>
<foo />
This causes the parse to blow up with an exception:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,42]
Message: Invalid encoding name "macroman".
Is there any way to provide a custom encoding handler to my XMLStreamReader so that I can augment it with support for the encodings I need??

You could wrap the input stream with a transformer that replaces the non-standard charset with the equivalent charset that XMLStreamReader does understand.
See Filter (search and replace) array of bytes in an InputStream

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

UTF-8 in clobval query and sax parser

I am using the below oracle query to retrieve the data from Oracle database. My column type is XMLTYPE:
select a.xmlrecord.getClobVal() xmlrecord "+"
from" + " " + tablename + " a
The reason why I am using getclobVal() is we have a limitations in getstringVal() query where we cannot retrieve more than 4000 characters in Oracle.
Currently I am extracting the data from database and sending it directly to sax parser. Below is the piece of code which I'm using
while (orset.next()){
Reader reader = new BufferedReader(orset.getCharacterStream("xmlrecord")); // to retrieve getClob
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
sp.parse(is, handler);
}
The problem is we are unable to retrieve UTF-8 characters even though I am encoding UTF-8 in my code.
Kindly assist.

Your reader is a CharacterStream and not a ByteStream. Encodings are ignored for character stream and has an effect only on byte streams so if you wish to incorporate encoding , create your BufferedReader for byte stream instead of character stream ,
I am quoting two sources below,
Class InputSource
The SAX parser will use the InputSource object to determine how to
read XML input. If there is a character stream available, the parser
will read that stream directly, disregarding any text encoding
declaration found in that stream. If there is no character stream, but
there is a byte stream, the parser will use that byte stream, using
the encoding specified in the InputSource or else (if no encoding is
specified) autodetecting the character encoding using an algorithm
such as the one in the XML specification. If neither a character
stream nor a byte stream is available, the parser will attempt to open
a URI connection to the resource identified by the system identifier.
setEncoding
This method has no effect when the application provides a character
stream.

UTF-8 is working fine with characterstream resultset.
The above piece of code returned UTF-8 characters and the problem is due to the Windows machine doesn't support UTF-8 character set.
Finally we installed a package for Arabic character(UTF-8) in windows PC and the issue is resolved.

java reads a weird character at the beginning of the file which doesn't exist

I have a simple xml file on my hard drive.
When I open it with notepad++ this is what I see:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>
But when I read it using a FileInputStream I get:
?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...
I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.
What is this extra "?" sign? why is it there and how do I get rid of it?

That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.
Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).
As a workaround, make sure that the program that produces this XML leaves off the BOM.

Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.

You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:
From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.
(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)

You might be having a newline. Delete that.
Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.

this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream
...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...

Next to the FileInputStream a ByteArrayInputStream worked also with me:
JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);
=> No unmarshaling error anymore.

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!

The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");

I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)

This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/

I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM

I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it

In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.

Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>

Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.

I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));

In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.

I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.

In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...

bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>

I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.

Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading special Charanters in XML org.w3c.dom - java

I´m reading a XML with characters like "ñ". When i use ... Node c = nodeList.item( j); c.getFirstChild().getNodeValue(); ... for to read this <ID>1Ññ</ID> I get: 1Ã‘Ã± Any idea? The xml file starts with the following line <?xml version="1.0" encoding="ISO-8859-1"?>

Related

How to enable non-IANA encodings when using javax.xml.stream.XMLStreamReader

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

UTF-8 in clobval query and sax parser

java reads a weird character at the beginning of the file which doesn't exist

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

Categories

Resources