Lucene xml retrieval - java

I have an xml file with like 500 fields,
I have looked into the apache digester option,
but involves a lot of work. I have to create many classes to cover these 500 fields.
Is there any other way for me to get
suppose i search for 10013,i can get the matched document name, but can i get the name field in the below xml syntax.
<contact type="individual"> <name>name feild</name>
<address>999 W. Prince St.</address>
<city>New York</city>
<province>NY</province>
<postalcode>10013</postalcode>
<country>USA</country>
<telephone>1-212-345----</telephone>
</contact>
Or does apache solr give me a better api for this.

Related

Java parser to read nested and lengthy xml

I have gone through many Stackoverflow pages and web to decide on parser which fits in for my requirement.
I need to read nested and big xml files in java , so DOM parser would not be good fit . My xml looks like below(snippet)-
<products>
<product>
<productCode></productCode>
<Code>3002191</Code>
<anotherCode></anotherCode>
<entityName>entityName value</entityName>
<entityName2>entityName value</entityName2>
<Type>value</Type>
<List>1</List>
<SecondCode>124</SecondCode>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>045</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>1219</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
</product>
<product>
..
</product>
</products>
Requirement: I need to store products information into list of hashmap for further processing with other xmls. Firstly, I thought to use Stax api to do this.But element docInfo has countries element so there can be multiple document for many countries and I cant parse backward to save another document(which has same document info but with country) . Please let me know if I am clear enough
Please let me know which parser will be good to handle this situation , i do not have any schemas for this xml.
Thanks a lot.
To parse a big amount of XML, the best is to use SAX :
https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
You implement the ContentHandler interface and you can put the logic you need when parsing the docInfo and subsequent countries.

Index single Xml-file with Lucene

I'm writing a Java application and want to index an Xml-file with Lucene so I can search for a drug that has a given target. The file size is 400MB and it is filled with over 8000 drug-entries.
<drug type="biotech" created="2005-06-13" updated="2015-11-27">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
....
<targets>
<target position="1">
<id>BE0000767</id>
<name>Epidermal growth factor receptor</name>
....
</target>
....
</targets>
</drug>
<drug>
....
</drug>
How can I index this file so one drug-entry is one Document?
If someone has some useful links/resources or tips on how to index this Xml please let me know :)
The most flexible strategy is usually to just use SolrJ through a small java application that reads the file and transforms it to a suitable format for indexing in Solr. That way you can easily preprocess certain fields before they're received by Solr.
Another option is to use XSL to transform the XML file into something that Solr understands. This can be used either server-side (as with XSLTUpdateRequestHandler linked) or client-side (transform an XML document into an update request and submit it to the standard request handler).

Transform xml with data from different sources to another xml using java

I have an xml file in which some elements contain certain values.For example:
<item>
<origin>
<![CDATA[KWI]]>
</origin>
<destination>
<![CDATA[DOH]]>
</destination>
</item>
I have an excel sheet containing the country codes and port code mapping as:
COUNTRY_CODE PORT_CODE MANAGING_PORT_STATION
KW KWI MPS1
QA DOH MPS2
In the output xml, i need to put it something like:
<itemOut>
<country><![CDATA[KW]]></country>
<managingPortStation>MPS1</managingPortStation>
<dest><![CDATA[DOH]]></dest>
</itemOut>
So in short, I need to combine some non xml sources into the output xml file based on the input xml file, along with the xml file.
To accomplish the above, what should I use? Is it possible via xslt? Or what API's are available with java. I have just skimmed through the jaxp. But is it worth spending more time for my case? I would prefer to do it with java,rather than xslt since I am more familiar with it.

Xstream append to existing XML file

First let me say I am very new to Java. I've been trying to figure out how to append a chunk of XML to an existing XML file using Xstream.
Example XML:
<root>
<first>
<a>Some Value</a>
<b>Some Value B</b>
</first
<second>
<a>Another Value</a>
<b>Another Value B</b>
</second>
</root>
How would I go about appending the following using Xstream?
<third>
<a>More A</a>
<b>More B</b>
</third>
Have you followed the two minute tutorial of Xstream yet? This can be found here.
You should address some implementation choices first: which way to go with Xstream?
For example: is the XML document a large document or a small document (if small, you could use a DomDriver.If large, use a StaxDriver)?
Does your XML document uses namespaces? If so, be aware that not all Xstream parsers implement namespace awareness, see the Xstream faq.
More information can be found here to get you started.
Please provide us your SSCCE so that users can try out your code example.
See details on how to write a SSCCE here and here
Also include a small valid XML file in your SSCCE.

Using org.apache.commons.digester.Digester in XML with attributes

I'm going to extract values from this XML/RDF:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:j.0="urn:turismoculturale.itdc.filas-1.0.0-RC1#">
<j.0:Chiesa rdf:ID="turismoCulturale_POI_880">
<j.0:title xml:lang="en">Church of S. Giuda Taddeo or S. Onofrio - Gaeta</j.0:title>
<j.0:title xml:lang="it">Chiesa S. Giuda Taddeo o S. Onofrio - Gaeta</j.0:title>
</j.0:Chiesa>
</rdf:RDF>
I would like to get en title when I am in "en" language and "it" title otherwise. I am able to set the title value in the Poi bean by using:
Digester digester = new Digester();
digester.setNamespaceAware( true );
digester.setRuleNamespaceURI( "urn:turismoculturale.itdc.filas-1.0.0-RC1#" );
digester.addObjectCreate( "*/Chiesa", Poi.class);
digester.addBeanPropertySetter("*/title", "title");
...
but I don't know if it is the english title or the italian one.
Ok - first and foremost, don't try to parse RDF/XML with an XML parser. It's never going to work because the semantics of the XML document are irrelevant with respect to RDF/XML and it is a bad idea (if you know how RDF/XML works), especially in your case where the RDF/XML is being generated dynamically (you can tell by the namespaces). You need to use an RDF parser to parse RDF.
So that means don't use an XML to Java object mapping tool, use an RDF to Java Object mapping tool.
Here is a great link explaining how to do this:
http://answers.semanticweb.com/questions/859/best-way-to-convert-rdfxml-file-to-pojos
And another:
http://answers.semanticweb.com/questions/3251/experience-using-java-based-frameworks-for-rdf-to-pojo-and-vice-versa-mapping
Along with links to all the tools in the aforementioned resource:
Jenabean
Empire
AliBaba
RDFReactor
For an out and out RDF parser, look at Jena:
http://incubator.apache.org/jena
It's an Apache project that is also nicely Maven'ed up.
The Commons Digester FAQ says:
Occasionally, people ask how they can fire a rule for an element based on the value of an attribute
There is no simple way to do this with Digester; the built-in rule-matching engines only provide the ability to match on element name. There is no support available for XPath expressions
It might be possible to create a custom "filtering" rule that has a child rule, and fires that child rule only when the appropriate conditions are set. There are no examples of such a solution, however.
Digester isn't a very good tool. It's too simplistic. Consider using a more comprehensive event-based API such as StAX.

Categories