Android java XML junk after document element - java

I'm using SAX to read/parse XML documents and I have it working fine except for this particular site where eclipse tells me "junk after document element" and I get no data returned
http://www.zachblume.com/apis/rhyme.php?format=xml&word=example
The site is not mine..just trying to get some data from it.

Yes, that's not an XML document. It's trying to include more than one root element:
<?xml version="1.0"?>
<word>ampal</word>
<word>ample</word>
<word>hampel</word>
<word>hample</word>
<word>lampl</word>
<word>pampel</word>
<word>sample</word>
The parser regards everything after <word>ampal</word> as by that time it's read a complete document... hence the complain about "junk after document element".
An XML document can only have one root, but several children within the root. For example:
<?xml version="1.0"?>
<words>
<word>ampal</word>
<word>ample</word>
<word>hampel</word>
<word>hample</word>
<word>lampl</word>
<word>pampel</word>
<word>sample</word>
</words>

The page does not contain XML. It contains an XML snippet at best:
<?xml version="1.0"?>
<word>ampal</word>
<word>ample</word>
<word>hampel</word>
<word>hample</word>
<word>lampl</word>
<word>pampel</word>
<word>sample</word>
This is incorrect since there is no document element. SAX interprets the first <word> as the document element, and correctly reports "junk after document element" since for all it knows, the document element ends on line 1.
To get around the error, do not treat this document as XML. Download it as text, remove the XML declaration (<?xml version="1.0"?>) and then wrap it in a fake document element before you try to process it.

Related

When using Citrus <TestMessage> child tag of <payload> element I am getting the following error

cvc-complex-type.2.4.a: Invalid content was found starting with element
'TestMessage'. One of '{WC[##other:"http://www.citrusframework.org/schema/
testcase"]}' is expected.
Here is the screen shot XML dsl file click here to see xml image of citrus
This is because your <TestMessage> is not using any namespace. This is not allowed in payload element. You should use a proper XML namespace or go with <data> element instead of <payload>.

JDOM2 xpath finding nodes within a different namespace

I'm attempting to use JDOM2 in order to extract the information I care about out of a XML document. How do I get a tag within a tag?
I have been only partially successful. While I have been able to use xpath to extract <record> tags, the xpath query to extract the title, description and other data with in the record tags has been returning null.
I've been using Xpath successfully to extract <record> tags out of the document. To do this I use the follwing xpath query: "//oai:record" where the "oai" namespace is a namespace I made up in order to use xpath.
You can see the XML document I'm parsing here, and I've put a sample below: http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&set=cwp&metadataPrefix=oai_dc
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.pnp/cph.3a02293</identifier>
<datestamp>2009-05-27T07:22:37Z</datestamp>
<setSpec>cwp</setSpec>
<setSpec>lcphotos</setSpec>
</header>
<metadata>
<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Jubal A. Early</dc:title>
<dc:description>This record contains unverified, old data from caption card.</dc:description>
<dc:date>[between 1860 and 1880]</dc:date>
<dc:type>image</dc:type>
<dc:type>still image</dc:type>
<dc:identifier>http://hdl.loc.gov/loc.pnp/cph.3a02293</dc:identifier>
<dc:language>eng</dc:language>
<dc:rights>No known restrictions on publication.</dc:rights>
</oai_dc:dc>
</metadata>
</record>
If you look in the larger document you will see that there is never a "xmlns" attribute listed on any of the tags. There is also the matter of there being three different namespaces in the document ("none/oai", "oai_dc", "dc").
What is happening is that the xpath is matching nothing, and evaluateFirst(parent) is returning null.
Here is some of my code to extract the title, date, description etc. out of the record element.
XPathFactory xpf = XPathFactory.instance();
XPathExpression<Element> xpath = xpf.compile("//dc:title",
Filters.element(), null,
namespaceList.toArray(new Namespace[namespaceList.size()]));
Element tag = xpath.evaluateFirst(parent);
if(tag != null)
{
return Option.fromString(tag.getText());
}
return Option.none();
Any thoughts would be appreciated! Thanks.
In your XML, dc prefix mapped to the namespace uri http://purl.org/dc/elements/1.1/, so make sure you declared the namespace prefix mapping to be used in the XPath accordingly. This is part where the namespace prefix declare in your XML :
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
XML parser only see the namespace explicitly declared in the XML, it won't try to open the namespace URL since namespace is not necessarily a URL. For example, the following URI which I found in this recent SO question is also acceptable for namespace : uuid:ebfd9-45-48-a9eb-42d

How to check for opening and closing tags in xml file using java?

I have a xml file like the following:
<file>
<students>
<student>
<name>Arthur</name>
<height>168</height>
</student>
<student>
<name>John</name>
<height>176</height>
</student>
</students>
</file>
How do I check whether for each opening tag, there is an ending tag? For example, if I do not provide the ending tag as:
<file>
<students>
<student>
<name>Arthur</name>
<height>168</height>
// Ending tag for student missing here
<student>
<name>John</name>
<height>176</height>
</student>
</students>
</file>
How do I continue parsing the rest of the file?
I tried with SAX parser as explained here, but its not very suitable for me as it throws an exception in case I do not provide a closing tag as in the second xml code I provided.
An XML file that does not verify your condition "for each opening tag, there is an ending tag", is not well formed.
To check that an XML file is well formed is the first job of a XML parser (it's its first task). Hence, you need a XML parser.
The tutorial you found has a bug in it. characters() maybe called multiple times for the same element (source). The proper way to mark the end of an element is to reset the respective boolean states inside of endElement(). The comments section has code that shows the required change.
With that issue fixed, you can do error checking in startElement() to ensure that the file is not trying to start an invalid element given the current state. This will also allow you to ensure that a name element is only found inside of a student element.
You can implement the following algorithm (pseudo-code):
String xml = ...
stack = new Stack()
while True:
tag = extractNextTag(xml)
// no new tag is found
if tag == null:
break
if (tag.isOpening()):
stack.push(tag.name)
else:
oldTagName = stack.pop()
if (oldTagName != tag.name):
error("Open/close tag error")
if ! stack.isEmpty():
error("Open/close tag error")
you can implement function extractNewTag with 10-20 lines of codes using some knowled about parsers or just writing simple regular expression.
Of course when you search for a new tag you need to start searching from the symbol that follows the last tag you found.

java dom xml parser get html tags(<p color="something">some text</p>) from xml

I have an xml file with html tags like:
<?xml version="1.0" encoding="utf-8" ?>
<blog>
<blogid>49</blogid>
<title>[FIXED] Job requests page broken</title>
<fulltext>
<img title="page broken" src="images/west/blog/site-broken.jpg" alt="page broken" />
<p><span style="background-color: #ccffcc;">Update 28/05/2011</span>: Job requests page seems to be working OK now. If you find any issues please use the contact page to notify us. Thank you for your patience!</p>
<p>Â </p>
<p>Well, what can I say? Why does it always have to be that way? You are trying to create something new and something else gets broken on the way...</p>
</fulltext>
Now I want the whole html part between tag as it is.
What I get right now is blank as I think dom is parsing html tags as well.
I tried xpath but it is not working with android.
I don't think you can get this not well-formed XML into a DOM as-is. (EDIT: or is it well-formed?)
You would need to a) either escape the characters - making the XML well-formed and parseable (but probably not into a DOM you want, I guess you want to display the HTML in a different system) or b) parse it using a stream processor or c) fix it using string manipulation (add <[[CDATA .. ]]>) and then parse it into a DOM.
HTH
HTML is a sub-language of XML (without getting into details related to XHTML). Therefore, there is no reason for the DOM parser not to treat those inner tags as XML tags.
Maybe what you're looking for is a way to flatten what's inside <fulltext>?
use a library like Jsoup for this purpose.
public static void main(String args[]){
String html = "<?xml version="1.0"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}

XML Editor in java(jsp,sevlet)

I am developing xml editor using jsp and servlet. In this case i am using DOM parser.
I have one problem in XML editor ,
How to edit the following xml file without losing elements.
eg:
<book id="b1">
<bookbegin id="bb1">
<para id="p1">This is<b>first</b>line</para>
<para id="p2">This is<b>second</b>line</para>
<para id="p3">This is<b>third</b>line</para>
</bookbegin>
</book>
I try to edit the above xml file using dtd using jsp,servlet. but while i read the textvalue from xml, it return only first,second,third.How to read the 'This is' and 'line '. Then how to store back to the xml file using xpath.
thank in advance.
The <b> tag inside the <para> tag is another element, not a formatting tag (in XML). Therefore, you need to traverse down to it.
Like #JRL says, the <b> tags are cosnidered as well-formed XML and, as a consequence, splitted by your DOM processor.
I think youf ail to read other text elements because you only read text when an XML node has no more XML node, which is not your case here.

Categories