I'm writing a Java application and want to index an Xml-file with Lucene so I can search for a drug that has a given target. The file size is 400MB and it is filled with over 8000 drug-entries.
<drug type="biotech" created="2005-06-13" updated="2015-11-27">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
....
<targets>
<target position="1">
<id>BE0000767</id>
<name>Epidermal growth factor receptor</name>
....
</target>
....
</targets>
</drug>
<drug>
....
</drug>
How can I index this file so one drug-entry is one Document?
If someone has some useful links/resources or tips on how to index this Xml please let me know :)
The most flexible strategy is usually to just use SolrJ through a small java application that reads the file and transforms it to a suitable format for indexing in Solr. That way you can easily preprocess certain fields before they're received by Solr.
Another option is to use XSL to transform the XML file into something that Solr understands. This can be used either server-side (as with XSLTUpdateRequestHandler linked) or client-side (transform an XML document into an update request and submit it to the standard request handler).
Related
I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin
If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.
I have an xml file in which some elements contain certain values.For example:
<item>
<origin>
<![CDATA[KWI]]>
</origin>
<destination>
<![CDATA[DOH]]>
</destination>
</item>
I have an excel sheet containing the country codes and port code mapping as:
COUNTRY_CODE PORT_CODE MANAGING_PORT_STATION
KW KWI MPS1
QA DOH MPS2
In the output xml, i need to put it something like:
<itemOut>
<country><![CDATA[KW]]></country>
<managingPortStation>MPS1</managingPortStation>
<dest><![CDATA[DOH]]></dest>
</itemOut>
So in short, I need to combine some non xml sources into the output xml file based on the input xml file, along with the xml file.
To accomplish the above, what should I use? Is it possible via xslt? Or what API's are available with java. I have just skimmed through the jaxp. But is it worth spending more time for my case? I would prefer to do it with java,rather than xslt since I am more familiar with it.
I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?
*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?
<?xml version="1.0"?>
-<playlist xmlns="http://xspf.org/ns/0/" version="1">
-<tracklist>
<location>http://radiotool.com/fransn.mp3</location>
<title>France, Paris radio 104.5</title>
</tracklist>
</playlist>
*edit: Fixed XML.
It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:
declare variable urls as xs:string* external;
<data>{
for $u in $urls return doc($u)//*:tracklist
}</data>
The Java code would be something like:
Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();
Have a look at the JSoup library here: http://jsoup.org/
It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.
I have an xml file with like 500 fields,
I have looked into the apache digester option,
but involves a lot of work. I have to create many classes to cover these 500 fields.
Is there any other way for me to get
suppose i search for 10013,i can get the matched document name, but can i get the name field in the below xml syntax.
<contact type="individual"> <name>name feild</name>
<address>999 W. Prince St.</address>
<city>New York</city>
<province>NY</province>
<postalcode>10013</postalcode>
<country>USA</country>
<telephone>1-212-345----</telephone>
</contact>
Or does apache solr give me a better api for this.
I am trying to dynamically y create an XML file in Java to display a timetable. I have created a DTD for my XML file and I have an XSL file I would like to use to transform the XML. I don't know exactly how to continue.
What I've tried so far is onClick of some button a Servlet is called which generates the string of the content of the XML file (inserting the dynamic parts of the XML into the String. I now have a String containing the content of the XML file. I would now like to transform the XML file using an XSL file i have on my server and display the result in the page which has called the Servlet (doing this via AJAX).
I'm not sure if I'm in the direction, perhaps I shouldn't even create the XML code in String form from the beginning. So my question is, how do I continue from here? how do I transform the XML string, using the XSL file, and send it as a response to the AJAX call so I can plant the generated code into the page? Or if this is not the way to do it, how do I create a dynamic XML file in a different way producing the same result?
You can use JAXP for this. It's part of standard Java SE API.
StringReader xmlInput = new StringReader(xmlStringWhichYouHaveCreated);
InputStream xslInput = getServletContext().getResourceAsStream("file.xsl"); // Or wherever it is. As long as you've it as an InputStream, it's fine.
Source xmlSource = new StreamSource(xmlInput);
Source xslSource = new StreamSource(xslInput);
Result xmlResult = new StreamResult(response.getOutputStream()); // XML result will be written to HTTP response.
Transformer transformer = TransformerFactory.newInstance().newTransformer(xslSource);
transformer.transform(xmlSource, xmlResult);
Depending on how complicated and large your XML is going to be I would suggest two options. For small, simple structures Java's DOM implementation (Document) will suffice.
If your XML is more elaborate I would look into JAXB. The benefit there is that there are tools that automatically create Java classes from an XML schema (XSD). So you'd have to transform your DTD into an XSD first, but that shouldn't be a problem. You end up with plain data transfer objects (plain objects with getters/setters for the values of the corresponding XML elements) and parsing/encoding plus setting namespaces correctly is done for you. It's quite convenient but can also be a bit of an overkill for simple XML structures.
In both cases, you will end up with a Document instance that you can finally transform using JAXP.
Apache XMLBeans are a nice solution to serializing to and from XML. Here's what you need to do:
Download XMLBeans from http://www.apache.org/dyn/closer.cgi/xmlbeans/binaries
Use the XMLBeans inst2xsd executable (in the bin dir0 to convert your DTD to an XSD
Use the XMLBeans ANT task to convert the XSD into classes which you can use in your app
Here's an example ANT script to use XMLBeans to create the classes:
<project name="my_project" basedir="..">
<property name="my_project.project.path" value="${basedir}"/>
<property name="xbean.dir" value="C:/lib/xmlbeans-2.2.0/lib" />
<path id="classpath">
<fileset dir="${xbean.dir}" includes="**/*.jar" />
</path>
<taskdef name="xmlbean" classname="org.apache.xmlbeans.impl.tool.XMLBean" classpathref="classpath" />
<xmlbean schema="${testing_project.project.path}/my.xsd" srcgendir="${my_project.project.path}/src-tms-template-filter-fields" classgendir="${my_project.project.path}/bin">
<classpath><path refid="classpath" /></classpath>
</xmlbean>
You'll now have nice Java classes which you can use for clean code to create the XML from the data stored in your DB. Use BalusC's answer for the XSLT.