Extracting Text Nodes From XML File Using SAX Parser in JAVA - java

So I am currently using SAX to try and extract some information from a a number of xml documents I am working from. Thus far, it is really easy to extract the attribute values. However, I have no clue how to go about extracting actual values from a text node.
For example, in the given XML document:
<w:rStyle w:val="Highlight" />
</w:rPr>
</w:pPr>
- <w:r>
<w:t>Text to Extract</w:t>
</w:r>
</w:p>
- <w:p w:rsidR="00B41602" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
<w:pStyle w:val="Copy" />
I can extract "Highlight" no problem by getting the value from val. But I have no idea how to get into that text node and get out "Text to Extract".
Here is my Java code thus far to pull out the attribute values...
private static final class SaxHandler extends DefaultHandler
{
// invoked when document-parsing is started:
public void startDocument() throws SAXException
{
System.out.println("Document processing starting:");
}
// notifies about finish of parsing:
public void endDocument() throws SAXException
{
System.out.println("Document processing finished. \n");
}
// we enter to element 'qName':
public void startElement(String uri, String localName,
String qName, Attributes attrs) throws SAXException
{
if(qName.equalsIgnoreCase("Relationships"))
{
// do nothing
}
else if(qName.equalsIgnoreCase("Relationship"))
{
// goes into the element and if the attribute is equal to "Target"...
String val = attrs.getValue("Target");
// ...and the value is not null
if(val != null)
{
// ...and if the value contains "image" in it...
if (val.contains("image"))
{
// ...then get the id value
String id = attrs.getValue("Id");
// ...and use the substring method to isolate and print out only the image & number
int begIndex = val.lastIndexOf("/");
int endIndex = val.lastIndexOf(".");
System.out.println("Id: " + id + " & Target: " + val.substring(begIndex+1, endIndex));
}
}
}
else
{
throw new IllegalArgumentException("Element '" +
qName + "' is not allowed here");
}
}
// we leave element 'qName' without any actions:
public void endElement(String uri, String localName, String qName) throws SAXException
{
// do nothing;
}
}
But I have no clue where to start to get into that text node and pull out the values inside. Anyone have some ideas?

Here's some pseudo-code:
private boolean insideElementContainingTextNode;
private StringBuilder textBuilder;
public void startElement(String uri, String localName, String qName, Attributes attrs) {
if ("w:t".equals(qName)) { // or is it localName?
insideElementContainingTextNode = true;
textBuilder = new StringBuilder();
}
}
public void characters(char[] ch, int start, int length) {
if (insideElementContainingTextNode) {
textBuilder.append(ch, start, length);
}
}
public void endElement(String uri, String localName, String qName) {
if ("w:t".equals(qName)) { // or is it localName?
insideElementContainingTextNode = false;
String theCompleteText = this.textBuilder.toString();
this.textBuilder = null;
}
}

Related

Java SAX is not parsing properly

I would appreciate any help on this.
This is my first handler I wrote.
I got I REST Webservice returning XML of links. It has quite simple structure and is not deep.
I wrote a handler for this:
public class SAXHandlerLnk extends DefaultHandler {
public List<Link> lnkList = new ArrayList();
Link lnk = null;
private StringBuilder content = new StringBuilder();
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals("link")) {
lnk = new Link();
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("link")) {
lnkList.add(lnk);
}
else if (qName.equals("applicationCode")) {
lnk.applicationCode = content.toString();
}
else if (qName.equals("moduleCode")) {
lnk.moduleCode = content.toString();
}
else if (qName.equals("linkCode")) {
lnk.linkCode = content.toString();
}
else if (qName.equals("languageCode")) {
lnk.languageCode = content.toString();
}
else if (qName.equals("value")) {
lnk.value = content.toString();
}
else if (qName.equals("illustrationUrl")) {
lnk.illustrationUrl = content.toString();
}
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
content.append(ch, start, length);
}
}
Some XML returned can be empty eg. or . When this happens my handler unfortunatelly adds previous value to the Object lnk. So when is empty in XML, I got lnk.illustrationUrl = content; equal to lnk.value.
Link{applicationCode='onedownload', moduleCode='onedownload',...}
In the above example, I would like moduleCode to be empty or null, because in XML it is an empty tag.
Here is the calling class:
public class XMLRepositoryRestLinksFilterSAXParser {
public static void main(String[] args) throws Exception {
SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandlerLnk handler = new SAXHandlerLnk();
parser.parse({URL}, handler);
for ( Link lnk : handler.lnkList){
System.out.println(lnk);
}
}
}
Like stated in my comment, you'd do the following. The callbacks are usually called in startElement, characters, (nested?), characters, endElement order, where (nested?) represents an optional repeat of the entire sequence.
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
content = null;
if (qName.equals("link")) {
lnk = new Link();
}
}
Note that characters may be called multiple times per a single XML element in your document, so your current code might fail to capture all content. You'd be better off using a StringBuilder instead of a String object to hold your character content and append to it. See this answer for an example.

Getting SAX Parser attributes

<Details><propname key="workorderid">799</propname>
How do i get 799 from workorderid useing SAXParing?
when i use this code i get "workorderid" but not the value of workorderid
if(localName.equals("propname")){
String workid = attributes.getValue("key");
if(localName.equals("propname")){
//set one flag here and in endElement() get the value associated with your localname(propname)
String workid = attributes.getValue("key");
I am providing you the code try to understand and customize in your way.
public class ExampleHandler extends DefaultHandler {
private String item;
private boolean inItem = false;
private StringBuilder content;
public ExampleHandler() {
items = new Items();
content = new StringBuilder();
}
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
content = new StringBuilder();
if(localName.equalsIgnoreCase("propname")) {
inItem = true;
} else attributes.getValue("key");
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if(localName.equalsIgnoreCase("propname")) {
if(inItem) {
item = (content.toString());
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
content.append(ch, start, length);
}
public void endDocument() throws SAXException {
// you can do something here for example send
// the Channel object somewhere or whatever.
}
}
May somewhere wrong i'm in hurry. If helps Appreciate.
The following will hold the value of the node.
public void characters(char[] ch, int start, int length) throws SAXException {
tempVal = new String(ch,start,length);
}
In the event handler method, you need to get it like this:
if(qName.equals("propname")) {
System.out.println(" node value " + tempVal); // node value
String attr = attributes.getValue("key") ; // will return attribute value for the propname node.
}
In propname the attribute Key having value workorderid which is correct.
You need to get the value propname.
//Provide you tagname which is propname
NodeList nl = ele.getElementsByTagName(tagName);
if(nl != null && nl.getLength() > 0) {
Element el = (Element)nl.item(0);
textVal = el.getFirstChild().getNodeValue();
}

How to get content of <tagname> that contains other embedded XML tag in Java?

I have an XML document that has HTML tags included:
<chapter>
<h1>title of content</h1>
<p> my paragraph ... </p>
</chapter>
I need to get the content of <chapter> tag and my output will be:
<h1>title of content</h1>
<p> my paragraph ... </p>
My question is similar to this post: How parse XML to get one tag and save another tag inside
But I need to implement it in Java using SAX or DOM or ...?
I found a soluton using SAX in this post: SAX Parser : Retrieving HTML tags from XML but it's very buggy and doesn't work with large amounts of XML data.
Updated:
My SAX implementation:
In some situation it throw exception: java.lang.StringIndexOutOfBoundsException: String index out of range: -4029
public class MyXMLHandler extends DefaultHandler {
private boolean tagFlag = false;
private char[] temp;
String insideTag;
private int startPosition;
private int endPosition;
private String tag;
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase(tag)) {
tagFlag = true;
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (qName.equalsIgnoreCase(tag)) {
insideTag = new String(temp, startPosition, endPosition - startPosition);
tagFlag = false;
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
temp = ch;
if (tagFlag) {
startPosition = start;
tagFlag = false;
}
endPosition = start + length;
}
public String getInsideTag(String tag) {
this.tag = tag;
return insideTag;
}
}
Update 2: (Using StringBuilder)
I have accumulated characters by StringBuilder in this way:
public class MyXMLHandler extends DefaultHandler {
private boolean tagFlag = false;
private char[] temp;
String insideTag;
private String tag;
private StringBuilder builder;
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase(tag)) {
builder = new StringBuilder();
tagFlag = true;
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (qName.equalsIgnoreCase(tag)) {
insideTag = builder.toString();
tagFlag = false;
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (tagFlag) {
builder.append(ch, start, length);
}
}
public String getInsideTag(String tag) {
this.tag = tag;
return insideTag;
}
}
But builder.append(ch, start, length); doesn't append Start tag like<EmbeddedTag atr="..."> and </EmbeddedTag> in the Buffer. This Code print Output:
title of content
my paragraph ...
Instead of expected output:
<h1>title of content</h1>
<p> my paragraph ... </p>
Update 3:
Finally I have implemented the parser handler:
public class MyXMLHandler extends DefaultHandler {
private boolean tagFlag = false;
private String insideTag;
private String tag;
private StringBuilder builder;
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase(tag)) {
builder = new StringBuilder();
tagFlag = true;
}
if (tagFlag) {
builder.append("<" + qName);
for (int i = 0; i < attributes.getLength(); i++) {
builder.append(" " + attributes.getLocalName(i) + "=\"" +
attributes.getValue(i) + "\"");
}
builder.append(">");
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (tagFlag) {
builder.append("</" + qName + ">");
}
if (qName.equalsIgnoreCase(tag)) {
insideTag = builder.toString();
tagFlag = false;
}
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length)
throws SAXException {
temp = ch;
if (tagFlag) {
builder.append(ch, start, length);
}
}
public String getInsideTag(String tag) {
this.tag = tag;
return insideTag;
}
}
The problem with your code is that you try to remember the start and end positions of the string passed to you via the characters method. What you see in the exception thrown is the result of an inside tag that starts near the end of a character buffer and ends near the beginning of the next character buffer.
With sax you need to copy the characters when they are offered or the temporary buffer they occupy might be cleared when you need them.
Your best bet is not to remember the positions in the buffers, but to create a new StringBuilder in startElement and add the characters to that, then get the complete string out the builder in endElement.
Try to use Digester, I've used it years ago, version 1.5 and it were simply to create mapping for xml like you. Just simple article how to use Digester, but it is for version 1.5 and currently there is 3.0 I think last version contains a lot of new features ...

Reading nested tags with sax parser

i am trying to read a xml file with following tag, but the sax parser is unable to read nested tags like
<active-prod-ownership>
<ActiveProdOwnership>
<Product code="3N3" component="TRI_SCORE" orderNumber="1-77305469" />
</ActiveProdOwnership>
</active-prod-ownership>
here is the code i am using
public class LoginConsumerResponseParser extends DefaultHandler {
// ===========================================================
// Fields
// ===========================================================
static String str="default";
private boolean in_errorCode=false;
private boolean in_Ack=false;
private boolean in_activeProdOwnership= false;
private boolean in_consumerId= false;
private boolean in_consumerAccToken=false;
public void startDocument() throws SAXException {
Log.e("i am ","in start document");
}
public void endDocument() throws SAXException {
// Nothing to do
Log.e("doc read", " ends here");
}
/** Gets be called on opening tags like:
* <tag>
* Can provide attribute(s), when xml was like:
* <tag attribute="attributeValue">*/
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) throws SAXException {
if(localName.equals("ack")){
in_Ack=true;
}
if(localName.equals("error-code")){
in_errorCode=true;
}
if(localName.equals("active-prod-ownership")){
Log.e("in", "active product ownership");
in_activeProdOwnership=true;
}
if(localName.equals("consumer-id")){
in_consumerId= true;
}
if(localName.equals("consumer-access-token"))
{
in_consumerAccToken= true;
}
}
/** Gets be called on closing tags like:
* </tag> */
public void endElement(String namespaceURI, String localName, String qName)
throws SAXException {
if(localName.equals("ack")){
in_Ack=false;
}
if(localName.equals("error-code")){
in_errorCode=false;
}
if(localName.equals("active-prod-ownership")){
in_activeProdOwnership=false;
}
if(localName.equals("consumer-id")){
in_consumerId= false;
}
if(localName.equals("consumer-access-token"))
{
in_consumerAccToken= false;
}
}
/** Gets be called on the following structure:
* <tag>characters</tag> */
public void characters(char ch[], int start, int length) {
if(in_Ack){
str= new String(ch,start,length);
}
if(str.equalsIgnoreCase("success")){
if(in_consumerId){
}
if(in_consumerAccToken){
}
if(in_activeProdOwnership){
str= new String(ch,start,length);
Log.e("active prod",str);
}
}
}
}
but on reaching the tag in_activeProdOwnersip read only "<" as the contents of the tag
please help i need to the whole data to be read
The tags in your XML file and parser does not match. I think you are mixing-up tags with attribute names. Here is the code that correctly parses your sample XML:
public class LoginConsumerResponseParser extends DefaultHandler {
public void startDocument() throws SAXException {
System.out.println("startDocument()");
}
public void endDocument() throws SAXException {
System.out.println("endDocument()");
}
public void startElement(String namespaceURI, String localName,
String qName, Attributes attrs)
throws SAXException {
if (qName.equals("ActiveProdOwnership")) {
inActiveProdOwnership = true;
} else if (qName.equals("Product")) {
if (!inActiveProdOwnership) {
throw new SAXException("Product tag not expected here.");
}
int length = attrs.getLength();
for (int i=0; i<length; i++) {
String name = attrs.getQName(i);
System.out.print(name + ": ");
String value = attrs.getValue(i);
System.out.println(value);
}
}
}
public void endElement(String namespaceURI, String localName, String qName)
throws SAXException {
if (localName.equals("ActiveProdOwnership"))
inActiveProdOwnership = false;
}
public void characters(char ch[], int start, int length) {
}
public static void main(String args[]) throws Exception {
String xmlFile = args[0];
File file = new File(xmlFile);
if (file.exists()) {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
DefaultHandler handler = new Test();
parser.parse(xmlFile, handler);
}
else {
System.out.println("File not found!");
}
}
private boolean inActiveProdOwnership = false;
}
A sample run will produce the following output:
startDocument()
code: 3N3
component: TRI_SCORE
orderNumber: 1-77305469
endDocument()
I suspect this is what's going wrong:
new String(ch,start,length);
Here, you're passing a char[] to the String constructor, but the constructor is supposed to take a byte[]. The end result is you get a mangled String.
I suggest instead that you make the str field a StringBuilder, not a String, and then use this:
builder.append(ch,start,length);
You then need to clear the StringBuilder each time startElement() is called.

How to adjust my code to this situation for SAX XML parsing in Android

On advice on someone here on Stackoverflow I changed my method of parsing to the SAXParser.
Thanks to different tutorials I'm able to get it to work, and I have to say that it does work faster (which is very important for my app).
The problem, however, is that my XML file goes deeper than the tutorial's example XML's I've seen.
This a sample of my XML file:
<Message>
<Service>servicename</Service>
<Insurances>
<BreakdownInsurance>
<Name>Insurance name</Name>
<InsuranceNR/>
<LicenseNr/>
</BreakdownInsurance>
<CarDamageInsurance>
<Name>Insurance name 2</Name>
<InsuranceNR></InsuranceNR>
</CarDamageInsurance>
</Insurances>
<Personal>
<Name>my name</Name>
</Personal>
</Message>
I can get the personal details like name, but my code doesn't seem to work with the insurances. I think this is because it's one node more.
This is the code I'm using in my Handler class:
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException {
currentElement = true;
if (localName.equals("Message")) {
geg = new GegevensXML();
}
}
#Override
public void endElement(String namespaceURI, String localName, String qName) throws SAXException {
currentElement = false;
/********** Autopech **********/
if (localName.equals("Name")) {
geg.setAutopechMaatschappij(currentValue);
}
else if (localName.equals("InsuranceNR")){
geg.setAutopechPolis(currentValue);
}
else if (localName.equals("LicenseNr")){
geg.setAutopechKenteken(currentValue);
}
#Override
public void characters(char ch[], int start, int length) {
if (currentElement) {
currentValue = new String(ch, start, length);
currentElement = false;
}
So how must I adjust it?
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException {
currentElement = true;
if (localName.equals("Message")) {
geg = new GegevensXML();
}
if(localName.equals("BreakdownInsurance"))
{
BreakdownInsurance = true;
}
}
#Override
public void endElement(String namespaceURI, String localName, String qName) throws SAXException {
currentElement = false;
/********** Autopech **********/
if (localName.equals("Name"))
{
if(BreakdownInsurance)
{
geg.setBreakdownInsuranceName(currentValue);
BreakdownInsurance = false;
}
else
{
geg.setAutopechMaatschappij(currentValue);
}
}
else if (localName.equals("InsuranceNR")){
geg.setAutopechPolis(currentValue);
}
else if (localName.equals("LicenseNr")){
geg.setAutopechKenteken(currentValue);
}
Similarly do it for other cases... 'BreakdownInsurance' is a boolean. use it as a Flag...
The depth should not be a problem. I have more levels and the exact code works fine for me. Could it be that you have several nodes with the same name? "Name" in Personal, and "Name" in those insurances nodes?
Just modify the endElement()...
Add Flags to indicate where current Name is to be saved since you are having Name coming under both <BreakdownInsurance>
and <Personal>.

Categories