How do I parse answers from AWS Mechanical Turk in Java?

How do I parse answers from AWS Mechanical Turk in Java? - java

The answer() method for an Assignment returns a String like this:
<?xml version="1.0" encoding="ASCII"?><QuestionFormAnswers xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2005-10-01/QuestionFormAnswers.xsd"><Answer><QuestionIdentifier>blah</QuestionIdentifier><FreeText>toplevel</FreeText></Answer></QuestionFormAnswers>
How am I supposed to parse this to get the actual answers? I see in older versions of the API there's a QuestionFormAnswers type. This is also referenced in the documentation, which states:
public String getAnswer()
The Worker's answers submitted for the HIT contained in a QuestionFormAnswers document, if the Worker provides an answer. If the Worker does not provide any answers, Answer may contain a QuestionFormAnswers document, or Answer may be empty.
Returns:
The Worker's answers submitted for the HIT contained in a QuestionFormAnswers document, if the Worker provides an answer. If the Worker does not provide any answers, Answer may contain a QuestionFormAnswers document, or Answer may be empty.
But it actually returns a String and not a QuestionFormAnswers. How do I parse this string XML result? Can I just use any standard method of parsing XML documents?

The answer appears to be yes, you can use any standard XML parsing technique.
Here is what worked for me:
private static Map<String, String> parseXML(String answerXML) {
try {
List<String> identifierList = new ArrayList<>();
List<String> answerList = new ArrayList<>();
InputSource is = new InputSource(new StringReader(answerXML));
Document document = null;
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(is);
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList identifiers = null;
try {
identifiers = (NodeList) xpath.evaluate("//Answer/QuestionIdentifier", document,
XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
for (int i = 0; i < identifiers.getLength(); i++) {
Node identifier = identifiers.item(i);
String relation = identifier.getTextContent();
identifierList.add(relation);
}
NodeList texts = (NodeList) xpath.evaluate("//Answer/FreeText", document, XPathConstants.NODESET);
for (int i = 0; i < texts.getLength(); i++) {
Node text = texts.item(i);
String answer = text.getTextContent();
answerList.add(answer);
}
Map<String, String> result = new HashMap<>();
for (int k = 0; k < identifierList.size(); k++) {
result.put(identifierList.get(k), answerList.get(k));
}
return result;
} catch (Exception e) {
log.error("Failed to parse XML " + answerXML, e);
}
return null;
}
This creates a map from the input ids to the answers.

Related

Xpath - why only 1 value returned?

Given the xml snippet:
<AddedExtras>
<AddedExtra Code="1234|ABCD" Quantity="1" Supplier="BDA"/>
<AddedExtra Code="5678|EFGH" Quantity="1" Supplier="BDA"/>
<AddedExtra Code="9111|ZXYW" Quantity="1" Supplier="BDA"/>
</AddedExtras>
The following XPath expression:
//*["AddedExtra"]/#Code
when run through a checker evaluates to:
Attribute='Code=1234|ABCD'
Attribute='Code=5678|EFGH'
Attribute='Code=9111|ZXYW'
Why then, does the following code only return the first line?
private String allCodes = "//*["AddedExtra"]/#Code";
get my XML from system and parse it into a Doc:
public Document parseResponse(String response){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder;
Document doc = null;
//Create a document reader and an XPath object
try {
builder = factory.newDocumentBuilder();
doc = builder.parse(new InputSource((new StringReader(response))));
} catch (ParserConfigurationException | org.xml.sax.SAXException | IOException e) {
e.printStackTrace();
}
return doc;
}
Get the new doc:
public Document getParsedResponse(String response) {
return parseResponse(response);
}
Return the Xpath value from the doc:
public String getAllCodeOptions(String response){
Document doc = getParsedResponse(response);
return getNodeValueFromNodeList(doc, allCodes);
}
new method to read the XML nodes:
public String getNodeValueFromNodeList(Document doc, String expression){
NodeList nodeList = null;
String nodes = null;
try {
nodeList = (NodeList) xpath.compile(expression).evaluate(doc, XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
for(int i=0; i < nodeList.getLength(); i++){
Node node = nodeList.item(i);
nodes = node.getNodeValue();
}
return nodes;
}
returns:
Attribute='Code=1234|ABCD'

You would require to use right evaluate method which takes return type as an argument. Something like below,
NodeSet result = (NodeSet)e.evaluate(e, doc, XPathConstants.NODESET);
for(int index = 0; index < result.getLength(); index ++)
{
Node node = result.item(index);
String name = node.getNodeValue();
}

The problem is that you are asking for only one value.
Try this:
NodeList nodeList = (NodeList)e.evaluate(doc, XPathConstants.NODESET);
for multiple values.
See http://viralpatel.net/blogs/java-xml-xpath-tutorial-parse-xml/ for a tutorial.

Retrieve XML Element names with Java from unknown message format

I am parsing XML from lots of JMS messaging topics, so the structure of each message varies a lot and I'd like to make one general tool to parse them all.
To start, all I want to do is get the element names:
<gui-action>
<action>some action</action>
<params>
<param1>blue</param1>
<param2>tall</param2>
<params>
</gui-action>
I just want to retrieve the strings "gui-action", "action", "params", "param1", and "param2." Duplicates are just fine.
I've tried using org.w3c.dom.Node, Element, NodeLists and I'm not having much luck. I keep getting the element values, not the names.
private Element root;
private Document doc;
private NodeList nl;
//messageStr is passed in elsewhere in the code
//but is a string of the full XML message.
doc = xmlParse( messageStr );
root = doc.getDocumentElement();
nl = root.getChildNodes();
int size = nl.getLength();
for (int i=0; i<size; i++) {
log.info( nl.item(i).getNodeName() );
}
public Document xmlParse( String xml ){
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
InputSource is;
try {
//Using factory get an instance of document builder
db = dbf.newDocumentBuilder();
is = new InputSource(new StringReader( xml ) );
doc = db.parse( is );
} catch(ParserConfigurationException pce) {
pce.printStackTrace();
} catch(SAXException se) {
se.printStackTrace();
} catch(IOException ioe) {
ioe.printStackTrace();
}
return doc;
//parse using builder to get DOM representation of the XML file
}
My logged "parsed" XML looks like this:
#text
action
#text
params
#text

Figured it out. I was iterating over only the child nodes, and not including the parent. So now I just filter out the #texts, and include the parent. Derp.
log.info(root.getNodeName() );
for (int i=0; i<size; i++) {
nodeName = nl.item(i).getNodeName();
if( nodeName != "#text" ) {
log.info( nodeName );
}
}
Now if anyone knows a way to get a NodeList of the entire document, that would be awesome.

Xpath approach in case of large files

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/#id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[#id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse and evaluate huge xml files using xpath?

You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html

First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

A question on how to solve the string problem in Java

I've created a simple xml file here:
http://roberthan.host56.com/productsNew.xml
which is quite simple, the root node is [products] while all other element nodes are [product]. Under each [product] node, there are two child nodes, [code] and [name], so it basically looks like:
[product]
[code]ddd[/code]
[name]ssss[/name]
[/product]
I've also written up the following Java code to parse this XML file and take out the text content of the [product] node, and add it to a JComboBox.
docBuilder = docFactory.newDocumentBuilder();
doc = docBuilder.parse("http://roberthan.host56.com/productsNew.xml");
NodeList productNodes = doc.getElementsByTagName("product");
productlist.clear();
for (i = 0; i < productNodes.getLength(); i++)
{
Node childNode = productNodes.item(i);
if (childNode.hasChildNodes()) {
NodeList nl = childNode.getChildNodes();
Node nameNode = nl.item(2);
productlist.add(nameNode.getTextContent());
}
}
final JComboBox productComboB = new JComboBox();
Iterator iterator = productlist.iterator();
while(iterator.hasNext())
{
productComboB.addItem(iterator.next().toString());
}
The code is quite straightforward, I firstly parse the xml and get all the product nodes and put them into a nodelist, and the productList is an arrayList. I loop through the all the [product] nodes, for each of them, if it has child nodes, then I take the second child node (which is the [name] node) and put the text content of it in the array list, and finally, I loop through the arrayList and add each item to the combo box.
The problem I got is, if I select the [code] child node, which means "Node nameNode = nl.item(1)", it will work perfectly; however, if I change that item(1) to item(2) to extract all the [name] nodes, the combo box will have a drop down list, but all the items are blank, like I have inserted 10 empty strings.
Also, if I try to add a "Hello World" string into the combo box after the above code, the "Hello World" item will appear after the 10 empty items.
I have spent the whole afternoon debugging this but still no breakthrough, the XML is actually quite simple and the Java is straightforward too. Could anyone share some thoughts with me on this please. Thanks a lot!

It is because the node list contains text nodes also.
If you add the following snippet to your code you will find that
for(int j = 0;j<nl.getLength();j++){
System.out.println(nl.item(j).getNodeName());
}
It will give the following output for each iteration of the product
#text
code
#text
name
#text
This means you have to get the 3rd element to get the name node.
Node nameNode = nl.item(3);
But I'll suggest you to use XPath to solve this problem.
NodeList nodelist = XPathAPI.selectNodeList(doc, "//products/product/name");
for (int i = 0; i < nodelist.getLength(); i++) {
productlist.add(nodelist.item(i).getTextContent());
}

XPath using this expression will easily solve your problem:
String XPATH_EXPRESSION1 = "//name/text()";
e.g.,
public static final String PRODUCTS_NEW = "http://roberthan.host56.com/productsNew.xml";
public static final String XPATH_EXPRESSION1 = "//name/text()";
public XmlFun() {
URL productsUrl;
try {
productsUrl = new URL(PRODUCTS_NEW);
List<String> nameList = xPathExtract(productsUrl.openStream());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}
private List<String> xPathExtract(InputStream inStream) throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document domDoc = builder.parse(inStream);
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression xExpr = xpath.compile(XPATH_EXPRESSION1);
NodeList nodes = (NodeList)xExpr.evaluate(domDoc, XPathConstants.NODESET);
List<String> resultList = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
String node = nodes.item(i).getNodeValue();
resultList.add(node);
}
return resultList;
}

Parsing xml with dom4j or jdom or anyhow

I wanna read feed entries and I'm just stuck now. Take this for example : https://stackoverflow.com/feeds/question/2084883 lets say I wanna read all the summary node value inside each entry node in document. How do I do that? I've changed many variations of code this one is closest to what I want to achieve I think :
Element entryPoint = document.getRootElement();
Element elem;
for(Iterator iter = entryPoint.elements().iterator(); iter.hasNext();){
elem = (Element)iter.next();
System.out.println(elem.getName());
}
It goes trough all nodes in xml file and writes their name. Now what I wanted to do next is
if(elem.getName().equals("entry"))
to get only the entry nodes, how do I get elements of the entry nodes, and how to get let say summary and its value? tnx
Question: how to get values of summary nodes from this link

Have you tried jdom? I find it simpler and convenient.
http://www.jdom.org/
To get all children of an xml element, you can just do
SAXBuilder sb = new SAXBuilder();
StringReader sr = new StringReader(xmlDocAsString);
Document doc = sb.build(sr);
Element root = doc.getRootElement();
List l = root.getChildren("entry");
for (Iterator iter = l.iterator(); iter.hasNext();) {
...//do whatever...
}

Here's how you'd do it using vanilla Java:
//read the XML into a DOM
StreamSource source = new StreamSource(new StringReader("<theXml></theXml>"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();
//make XPath object aware of namespaces
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext(){
#Override
public String getNamespaceURI(String prefix) {
if ("atom".equals(prefix)){
return "http://www.w3.org/2005/Atom";
}
return null;
}
#Override
public String getPrefix(String namespaceURI) {
return null;
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
});
//get all summaries
NodeList summaries = (NodeList) xpath.evaluate("/atom:feed/atom:entry/atom:summary", root, XPathConstants.NODESET);
for (int i = 0; i < summaries.getLength(); ++i) {
Node summary = summaries.item(i);
//print out all the attributes
for (int j = 0; j < summary.getAttributes().getLength(); ++j) {
Node attr = summary.getAttributes().item(j);
System.out.println(attr.getNodeName() + "=" + attr.getNodeValue());
}
//print text content
System.out.println(summaries.item(i).getTextContent());
}

if(elem.getName() == "entry")
I have no idea whether this is your problem (you don't really state what your problem is), but never test string equality with --. Instead, use equals():
if(elem.getName().equals("entry"))

A bit late but it might be useful for people googling...
There is a specialized API for dealing with RSS and Atom feeds in Java. It's called Rome, can be found here :
http://java.net/projects/rome/
It is really quite useful, it makes easy to read feed whatever the RSS or Atom version. You can also build feeds and generate the XML with it though I have no experience with this feature.
Here is a simple example that reads a feed and prints out the description nodes of all the entries in the feed :
URL feedSource = new URL("http://....");
feed = new SyndFeedInput().build(new XmlReader(feedSource));
List<SyndEntryImpl> entries = (List<SyndEntryImpl>)feed.getEntries();
for(SyndEntryImpl entry : entries){
System.out.println(entry.getDescription().getValue());
}
Simple enough.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I parse answers from AWS Mechanical Turk in Java? - java

Related

Xpath - why only 1 value returned?

Retrieve XML Element names with Java from unknown message format

Xpath approach in case of large files

A question on how to solve the string problem in Java

Parsing xml with dom4j or jdom or anyhow

Categories

Resources