How to get XML declaration header? Android - java

I'm writing a rss reader app for android and now i need to know what is the encoding of xml before i start parsing it (windows-1251 or utf-8). This is described in xml declaration header i.e. <?xml version="1.0" encoding="UTF-8"?>. How can i get this header before parsing? I use android.sax implementation of sax parser and pass encoding as string parameter to InputStreamReader.
I found a related question:
SAX Parser doesn't recognize windows-1255 encoding - but the solution there is to convert cp-1251 to utf-8, which is too cumbersome and demanding on resources. I think there must be better solution, as i only need to know encoding value from header <?xml version="1.0" encoding="UTF-8"?>. But i can't manage to get this header from xml, parser starts from <rss> tag. How should i get it?

I would recommend switching to Android's officially supported xmlPullParser and the encoding support issue should go away.
Here is the Android doc on it.
Do not think lightly of this as the SAX parser does not work well in Android v3.0+

Well, the question was pretty obvious :) Here is the code that worked, based on Squonk's comment:
byte[] data = new byte[50];
try{
bs.mark(60);
bs.read(data, 0, data.length);
String value = new String(data,"UTF-8");
if(value.toLowerCase().contains("utf-8"))
return "UTF-8";
else if(value.contains("1251"))
return "windows-1251";
} catch (IOException e) {
Log.d("debug", "Exception: " + e);
return "XML not found";
}
Then just reset bs (BufferedInputStream) and work with it in any needed charset.

Related

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

Converting decode utf-8 string to file

I am trying to save image which I am receiving from android device. From Android getting utf-8 encode string and below is the code I am using to save.
String test = java.net.URLDecoder.decode(image_base64, "UTF-8");
byte[] data = Base64.decodeBase64(test.getBytes());
FileOutputStream stream = null;
try {
stream = new FileOutputStream("/var/lib/easy-tomcat7/webapps/test/test1.bmp");
stream.write(data);
stream.flush();
test1 += "success";
}
catch (IOException e)
{
test1 = "failuare";
e.getMessage();
}
finally
{
test1 += "finally";
stream.close();
}
File is creating but the it is corrupted. I did lot of research on this but not getting why it is happening. Please help me to solve this issue.
I assume you are using Base64 from Apache Commons Codec.
Note that you are dealing with multiple different kinds of encodings:
URL encoding
Base64 encoding
UTF-8 character encoding
Those are three totally different things, and you should understand all of them to understand what is happening exactly.
Check how exactly the image is encoded that you get from the Android device. Your code is assuming that you are getting it as URL-encoded Base64 data, using the UTF-8 character set. Is that indeed how the Android device is sending the data? You will have to check that with whoever wrote the Android application.
What does the string image_base64 contain? Is it valid, URL-encoded Base64 data?
You shouldn't call getBytes() on the string before you pass it to Base64.decodeBase64 - that will convert the string into a byte array using the default character encoding of the system you're running it on. Just do this instead:
byte[] data = Base64.decodeBase64(test);
To make matters worse, there are several variants of Base64 encoding (as you can see on the Wikipedia page about Base64). It may be the case that whatever variant the Android app used is different from what the Base64 class is using.
Use the encoding also for getBytes()
Base64.decodeBase64(test.getBytes("utf-8"));

Special characters in Text node not getting parsed by SAX's characters() method

I'm making an android application in which I'm parsing an XML using SAX parser.
In the XML there is tag:
<title>Deals & Dealmakers: Technology, media and communications M&A </title>
As you can see it contains some special charters like &
The issue is I'm using SAX's implicit method:
#Override
public void characters(char[] ch, int start, int length) throws SAXException{}
Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals & Dealmakers: Technology, media and communications M&A
But it is only getting "Deals ".
How can I solve this issue?
One issue might be because of the way I'm passing the XML to the SAX parser. Do I need to change the encoding or format?
Currently, I'm passing the XML as InputStream & using the below code:
HttpResponse httpResponse = utils.sendRequestAndGetHTTPResponse(URL);
if (httpResponse.getStatusLine().getStatusCode() == 200) {
HttpEntity entity = httpResponse.getEntity();
InputStream in = entity.getContent();
parseResponse(in);
}
// Inside parseResponse method:
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xmlReader = sp.getXMLReader();
MyHandler handler = new MyHandler();
xmlReader.setContentHandler(handler);
xmlReader.parse(new InputSource(in));
} catch (Exception e) {
}
Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals & Dealmakers: Technology, media and communications M&A But it is only getting "Deals ".
You seem to be assuming that you'll get the whole text in one call. There's no guarantee of that. I strongly suspect that your characters method will be called multiple times for the same text node, which is valid for the parser to do. You need to make sure your code handles that.
From the documentation:
SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
There may be a feature you can set to ensure you get all the data in one go; I'm not sure.
I guess UTF-8 is exactly the problem . In the file,you parsing the encoding is defined as ISO-8859-1
so just try following code:
InputSource is = new InputSource(yourInputStream);
is.setEncoding("ISO-8859-1");
xmlReader.parse(is);
hope this helps.

SAX Parser doesn't recognize windows-1255 encoding

I'm working on a rss parser in android
(upgrading a parser I found on the internet).
From what I know SAX Parser recognize the encoding automatically from the xml tag, but when I try to parse a feed that declare windows-1255 encoding it doesn't parsing it and throws and exception.
I tried few things:
final InputSource source = new InputSource(feed);
Reader isr = new InputStreamReader(feed);
source.setCharacterStream(isr);
I even tried telling him the specific encoding.
source.setEncoding("Windows-1255");
Tried to look at the locator:
#Override
public void setDocumentLocator(Locator locator) {
}
And it recognize the encoding as UTF-16.
Please help me solve this annoying problem!
Sorry for the mess with code snippets the code button refuse to work for some reason.
Chances are the platform itself doesn't know about the "windows-1255" encoding. After all, it's a Windows-based encoding - I wouldn't want to rely on it being available on any other platforms, particularly mobile ones where things are generally cut down to the "must-have" options.
You need to set the encoding to the InputStreamReader.
Reader isr = new InputStreamReader(feed, "windows-1255");
final InputSource source = new InputSource(isr);
From javadoc the logic for reading from InputSource goes something like this:
Is there a character stream? if there is, use that(This is what happens if you use a Reader like InputStreamReader)
Otherwise:
No character stream? Use byte stream. (InputStream)
Is there a encoding set for InputSource? Use that
There was no encoding set? Try parsing the encoding from the xml file

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!
The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)
This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/
I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM
I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it
In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.
Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>
Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.
I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));
In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.
I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.
In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...
bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>
I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.
Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);

Categories