I'm trying to parse a XML with the following code:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL("http://www.cinemark.com.br/mobile/xml/films/").openStream());
But get the following error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at Programacao.main(Programacao.java:53)
Accessing the url, you can see there are some portuguese characters, and seeing the response, I could see the first line of the xml file:
<?xml version="1.0" encoding="iso-8859-1"?>
So I tried doing this:
URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream ism = url.openStream();
InputSource is = new InputSource(ism);
is.setEncoding("iso-8859-1");
Document doc = db.parse(is.getByteStream());
But I still got the EXACT same error.
How can I parse the xml using a different encondig?
Also, how can I know if the xml is really in the encoding described in the file?
I'm using JDK 1.7.0_51 on Fedora Linux 20
Thanks
SOLUTION
What I did to solve the problem, based on Seelenvirtuose answer:
URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream ism = url.openStream();
GZIPInputStream gis = new GZIPInputStream(ism);
Reader decoder = new InputStreamReader(gis);
InputSource is = new InputSource(decoder);
Document doc = db.parse(is);
The difference in behavior is as following:
When accessing the URL in a browser, after some time it displays:
<?xml version="1.0" encoding="iso-8859-1"?>
<cinemark>
<films>
<film ...>...</film>
...
</films>
</cinemark>
However, when simply running curl (for example), then you get an output similar to:
‹ ¬YMsÛ6½ûW`xôT¨Oªc) [...]
So, what actually is happening? Easy: This is called HTTP compresson. So when running the following command
curl -o films.zip http://www.cinemark.com.br/mobile/xml/films/
you will get a file called films.zip that contains a single file called films, which in turn contains the expected XML document.
So, what you should do is: Take the output stream as a compressed stream, extract the content, and parse that.
Related
I get the following error due to Latin text in my XML.
Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible
My XML is written to a String variable (I don't import a file).
I tried to set encoding to "UTF-8", but I might have done it wrong.
Can you help please?
My code:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
InputStream inputStream = new ByteArrayInputStream(GET_XML.getBytes());
Document doc = dBuilder.parse(inputStream);
doc.getDocumentElement().normalize();
You are seeing this error, because you are feeding xml containing ISO-8859-1 (aka Latin-1) characters without proper XML declaration:
<?xml version='1.0' encoding='ISO-8859-1' standalone='no' ?>
You have two options either correct it by sourcing xml with above declaration.
OR forcing UTF-8 during byte conversion.
new ByteArrayInputStream(GET_XML.getBytes(StandardCharsets.UTF_8));
I'm trying to transform an XML file into a document like this:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse("C:/xml/41111208890622000144550010000000011000003066-nfe.xml");
Document document = db.parse(new InputSource(new StringReader("C:/xml/41111208890622000144550010000000011000003066-nfe.xml")));
but it is giving the error message:
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
someone knows what to do?
You're currently creating a reader containing the string
"C:/xml/41111208890622000144550010000000011000003066-nfe.xml"
and asking the DocumentBuilder to parse that as if it were XML, when it's clearly not. (I'm referring to the second parse call, which I suspect is the one in your actual code. The code you've provided wouldn't compile as you've declared document twice.)
You can create a FileInputStream or perhaps an InputStreamReader wrapped around it:
String filename = "C:/xml/41111208890622000144550010000000011000003066-nfe.xml";
try (FileInputStream input = new FileInputStream(filename))
{
Document document = db.parse(new InputSource(input));
}
(I prefer to use a stream directly, and let the parser detect the encoding.)
Now this call:
Document document = db.parse("C:/xml/...");
would nearly work and may actually work, using DocumentBuilder.parse(String) - it depends on whether parse is happy to handle a filename as a URI. (I've seen some XML APIs that are fine with that, and some that aren't.) If it doesn't work, try using the file:// scheme:
Document document = db.parse("file://C:/xml/...");
I have a problem converting a UTF-8 document correctly into HTML.
When the code is run inside a glassfish server on Solaris, it produces invalid results.
When run on my test environment on Windows, the results are valid.
Both environments are using Java 1.6.0_27
I am including the code for both - I don't see any relevant difference.
The output from the app server test is below (if the UTF-8 goes through this process):
<CityName>Ð Ð¾Ð²Ð¾Ñ Ð¸Ð±Ð¸Ñ€Ñ Ðº</CityName>
<CityName>ÐовоÑибиÑÑк</CityName>
Note that the second has several entity encodings, whereas the first is pure UTF-8. The encodings are wrong.
First - the app server code (logger is a log4j Logger - it shows good results on test13, bad on test14):
logger.info("===test13:"+descr+req);
DocumentBuilderFactory domFactory = DocumentBuilderFactory .newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document doc = domBuilder.parse(new InputSource(new StringReader(req)));
logger.info("===test14:"+XmlHelper.domToString(doc));
Then, the test code - "out" shows good data in both cases:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
String inmsg = TestRvcMessage.readUtfFile(fname);
inmsg = inmsg.trim();
out.println("INPUT:"+inmsg);
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document doc = domBuilder.parse(new InputSource(new StringReader(inmsg)));
out.println("OUTPUT:"+xmlHelper.domToString(doc);
I'm building an android app that communicates to a web server and am struggling with the following scenario:
Given ONE line of XML in a String eg:
"<test one="1" two="2" />"
I would like to extract the values into a HashMap so that:
map.get("one") = "1"
map.get("two") = "2"
I already can do this with a full XML document using the SAX Parser, this complains when i try to just give it the above string with a MalformedUrlException: Protocol not found
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
Document doc = null;
builder = factory.newDocumentBuilder();
doc = builder.parse("<test one="1" two="2" />"); //here
I realize some regex could do this but Id really rather do it properly.
The same behaviour can be found at http://metacpan.org/pod/XML::Simple#XMLin which is what the web server uses.
Can anyone help? Thanks :D
DocumentBuilder.parse(String) treats the string as a URL. Try this instead:
Document doc = builder.parse(new InputSource(new StringReader(text)));
(where text contains the XML, of course).
I have a JAVA application where I am sending some xml requests and receiving xml responses. I first receive response in string and then write a file and storing this file into file system. Then while parsing the xml response file I am accessing this from file system and use some of the data for further business logic.
File file = new File("log\\XMLMessage\\LastXMLResponse.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
Now I am thinking to distribute this JAVA application via Java Web Start (JWS) application and as I know I cannot keep this file into jar file since there will be modification in this file on regular basis.
What do you suggest me to do? Can I parse the String directly (no need to store the response into file)?
Document doc = db.parse(xmlMessage);
or where can I keep this file? I don't want to show this file to the user of my application.
Take the String, make a StringReader from String, make a InputSource from StringReader, then call parse on your DocumentBuilder.
Yes, you can parse the string directly,no need to store it in a file.
Try this:
String xml = "<xml></xml>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(xml.getBytes()));
System.out.println(doc);