I'm writing a rss reader app for android and now i need to know what is the encoding of xml before i start parsing it (windows-1251 or utf-8). This is described in xml declaration header i.e. <?xml version="1.0" encoding="UTF-8"?>. How can i get this header before parsing? I use android.sax implementation of sax parser and pass encoding as string parameter to InputStreamReader.
I found a related question:
SAX Parser doesn't recognize windows-1255 encoding - but the solution there is to convert cp-1251 to utf-8, which is too cumbersome and demanding on resources. I think there must be better solution, as i only need to know encoding value from header <?xml version="1.0" encoding="UTF-8"?>. But i can't manage to get this header from xml, parser starts from <rss> tag. How should i get it?
I would recommend switching to Android's officially supported xmlPullParser and the encoding support issue should go away.
Here is the Android doc on it.
Do not think lightly of this as the SAX parser does not work well in Android v3.0+
Well, the question was pretty obvious :) Here is the code that worked, based on Squonk's comment:
byte[] data = new byte[50];
try{
bs.mark(60);
bs.read(data, 0, data.length);
String value = new String(data,"UTF-8");
if(value.toLowerCase().contains("utf-8"))
return "UTF-8";
else if(value.contains("1251"))
return "windows-1251";
} catch (IOException e) {
Log.d("debug", "Exception: " + e);
return "XML not found";
}
Then just reset bs (BufferedInputStream) and work with it in any needed charset.
I have a XML file with some SOAP tags that I want to ignore.
I was parsing the XML file with pull-parser but it stop working since that SOAP tags came along.
The XML file looks something like:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getAllUsersListResponse xmlns:ns1="http://webservice.business.ese.wiccore.myent.com/">
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
and inside the tag <User> come all the tags that I want to parse (and I know how with pull-parser) and then
</User>]]></return>
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
until
</User>]]></return>
</ns1:getAllUsersListResponse>
</soap:Body>
</soap:Envelope>
The thing is, I know how to parse normal tags, but I don't want to parse this Soap tags, I want to IGNORE the SOAP tags! Anyone know how to achieve this?
Not being overly familiar with pull-parsing (I'm typically a SAX guy), I'm not probably not the most authoritative source on such things, but here goes...
I believe most (if not all) Java pull parsers should expose CDATA sections using a specific CDATA node (I believe in StAX, for example, the relevant event type is XMLStreamConstants.CDATA). As such, you'll want to parse your document and pull out that CDATA section (inside the SOAP <return> element) and extract its contents.
The contents of that section are the document you are interested in, so then you'd want to in turn run a new pull-parse over the contents you just extracted.
I'm sorry I can't be more help. Hopefully there will be someone else out there that can flesh the details out a bit more for you.
EDIT: in response to comments, you can achieve this using SAX as follows (exception handling omitted for brevity):
import org.xml.sax.ext.DefaultHandler2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
class MyParsingApp extends DefaultHandler2 // see note 1
{
private boolean inCdata, parsingSubDocument;
private String subDocument;
public static void main (String args[])
{
InputStream stream = ... // see note 2
XMLReader reader = XMLReaderFactory.createXMLReader(); // see note 3
reader.setContentHandler (new MyParsingApp ( ));
reader.parse (new InputSource(stream));
parsingSubDocument = true;
reader.parse (new InputSource(new StringReader(subDocument)));
...
}
public MyParsingApp ( )
{
inCdata = parsingSubDocument = false;
subDocument = "";
}
#Override
public void startCDATA() throws SAXException
{
inCdata = true;
}
#Override
public void endCDATA() throws SAXException
{
inCdata = false;
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException
{
if (inCdata)
subDocument += new String(ch, start, length); // see note 4
}
}
Some important notes:
Normally you would use a separate class as your content handler, probably one for the "main" document (including SOAP elements), and one for your "target" document (in the CDATA section). I've not done so here just to keep it as short as possible.
I'm not sure what format your XML is in, but I'm assuming it's in an InputStream here. The InputSource class will happily use an InputStream, a Reader or a String specifying a filename to read from. Use whatever suits you best.
You will need to use a SAX2 reader to be able to handle CDATA content. Your default SAX reader may or may not be SAX2 compliant. As such, you may need to (for example) manually create an instance of a particular SAX2 parser. You can find a list of some SAX2 parsers here, if that's the case.
There are probably more efficient ways of doing this too (StringBuffer/StringBuilder might be options). Again, I'm just doing it this way for simplicity.
I've not actually tested this code. Your mileage may vary.
If you've not used SAX before, it's probably also worth running through the SAX Quickstart Guide.
My Application is using XForms for view and XForms generate output XML containing the answer given by user. If we include the following line
<fr:xforms-inspector xmlns:fr="http://orbeon.org/oxf/xml/form-runner"/>
in the code we can see the generated output in the screen. So for username if user type amit it would also come with the generated XML.
I actually wanted to get this generated XML in my Java Class to save it in database and parse it and split its contents. I have tried the following code for getting that XML but not able to get the generated XML.
BufferedReader requestData = new BufferedReader(new InputStreamReader(request.getInputStream()));
StringBuffer stringBuffer = new StringBuffer();
String line;
try{
while ((line = requestData.readLine()) != null) {
stringBuffer.append(line);
}
} catch (Exception e){}
return stringBuffer.toString();
}
Please let me know what wrong I am doing.
Assuming that you'd like to have Java code inside a servlet or JSP that receives XML posted to the servlet or JSP through an XForms submission, then I would recommend you parse it using an XML parser rather than doing this by hand. Doing this with Dom4j is quite simple; for instance to get the content of the root element (assuming that all you receive is an element with some text in it):
Document queryDocument = xmlReader.read(request.getInputStream());
String query = queryDocument.getRootElement().getStringValue();
And for reference, see the full source of an example this is taken from.
I have a web application that can display a generated PDF file to the user using the following Java code on the server:
#Path("MyDocument.pdf/")
#GET
#Produces({"application/pdf"})
public StreamingOutput getPDF() throws Exception {
return new StreamingOutput() {
public void write(OutputStream output) throws IOException, WebApplicationException {
try {
PdfGenerator generator = new PdfGenerator(getEntity());
generator.generatePDF(output);
} catch (Exception e) {
logger.error("Error getting PDF file.", e);
throw new WebApplicationException(e);
}
}
};
}
This code takes advantage of the fact that I only need so much data from the front end in order to generate the PDF, so it can easily be done using a GET function.
However, I now want to return a PDF in a more dynamic way, and need a bunch more information from the front end in order to generate the PDF. In other areas, I'm sending similar amounts of data and persisting it to the data store using a PUT and #FormParams, such as:
#PUT
#Consumes({"application/x-www-form-urlencoded"})
public void put(#FormParam("name") String name,
#FormParam("details") String details,
#FormParam("moreDetails") String moreDetails...
So, because of the amount of data I need to pass from the front end, I can't use a GET function with just query parameters.
I'm using Dojo on the front-end, and all of the dojo interactions really don't know what to do with a PDF returned from a PUT operation.
I'd like to not have to do this in two steps (persist the data sent in the put, and then request the PDF) simply because the PDF is more "transient" in this uses case, and I don't want the data taking up space in my data store.
Is there a way to do this, or am I thinking about things all wrong?
Thanks.
I can't quite understand what do you need to accomplish - looks like you want to submit some data to persist it and then return pdf as a result? This should be straightforward, doesn't need to be 2 steps, just submit, on the submit save the data and return PDF.
Is this your problem? Can you clarify?
P.S.
Ok, you need to do the following in your servlet:
response.setHeader("Content-disposition",
"attachment; filename=" +
"Example.pdf" );
response.setContentType( "application/pdf" );
Set the "content-length" on the response, otherwise the Acrobat Reader plugin may not work properly, ex. response.setContentLength(bos.size());
If you provide output in JSP you can do this:
<%# page contentType="application/pdf" %>
I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!
The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)
This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/
I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM
I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it
In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.
Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>
Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.
I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));
In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.
I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.
In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...
bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>
I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.
Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);