How to index a XML file with Lucene [duplicate] - java

I am new in lucene I want to indexing with lucene of large xml files(15GB) that contain plain text as well as attribute and so many xml tags. how to parse and indexing this xml file using lucene with any sample and if we use lucene we need any database
How to parse and index huge xml file using lucene ? Any sample or links would be helpful to me to understand the process. Another one, if I use lucene, will I need any database, as I have seen and done indexing with Databases..

Your indexing would be build as you would have done using a database, just iterate through all data you want to index and write it to the index. Just go with the XmlReader class to parse your xml in a forward-only fashion. You will, just as with a database, need to index some kind of primary-key so you know what the search result represents.
A database helps when it comes to looking up the indexed data from the primary-key. It will be messy to read the data for a primary-key if you need to iterate a 15 GiB xml file at every request.
A database is not required, but it helps a lot. I would build this as an import tool that reads your xml, dumps it into your database, and then use your "normal" database indexing code you've built before.

You might like to look at Michael Sokolov's Lux product, which combines Lucene and Saxon:
http://www.mail-archive.com/solr-user#lucene.apache.org/msg84102.html
I haven't used it myself and can't claim to fully understand its capabilities.

Related

How to store java objects on Solr

I want to store java objects as part of the Solr document.
They don't need to be parsed or searched, only be returned as part of the document.
I can convert them to json or XML and store the text but I prefer something more efficient.
If I could use Java serialization and then add the binary blob to the document it could be ideal.
I'm aware of the option to convert the binary blob with base64 but I was wondering if there is a more efficient way.
I do not share the opinions of the first two answers.
An additional database call can in some scenarios be completely unnecessary, Solr can act as a NoSQL database, too.
It can even use compression for some fields, which affects CPU cost, but saves some cache memory for some kind of binary data.
Take a look at BinaryField and the lazy loading field declarations within your schema.xml.
As you can construct an id in Solr to pass with any document, you can store this object in other way (database for example) and query it as you get the id back from solr.
For example, we're storing web pages in Solr. When we index it, we're creating an id which match the id of a WebPage Object created by the ORM in the database
When a search is performed, we get the id back and load the java object from the database
No need to store it in solr (which has been made to store and index documents)

JAVA : file exists Vs searching large xml db

I'm quite new to Java Programming and am writing my first desktop app, this app takes a unique isbn and first checks to see if its all ready held in the local DB, if it is then it just reads from the local DB, if not it requests the data from isbndb.com and enters it into the DB the local DB is in XML format. Now what im wondering is which of the following two methods would create the least overhead when checking to see if the entry all ready exists.
Method 1.) File Exists.
On creating said DB entry the app would create a seperate file for every isbn number named isbn number.xml (ie. 3846504937540.xml) and when checking would use the file exists method to check if an entry all ready exists using the user provided isbn .
Method 2.) SAX XML Parser.
All entries would be entered into a single large XML file and when checking for existing entries the SAX XML Parser would be used to parse the file and then the user provided isbn would be checked against those in the XML DB for a match.
Note :
The resulting entries could number in the thousands over time.
Any information would be greatly appreciated.
I don't think either of your methods is all that great. I strongly suggest using a DBMS to store the data. If you don't have a DBMS on the system, or if you want an app that can run on systems without an installed DBMS, take a look at using SQLite. You can use it from Java with SQLiteJDBC by David Crawshaw.
As far as your two methods are concerned, the first will generate a huge amount of file clutter, not to mention maintenance and consistency headaches. The second method will be slow once you have a sizable number of entries because you basically have to read (on the average) half the data base for every query. With a DBMS, you can avoid this by defining indexes for the info you need to look up quickly. The DBMS will automatically maintain the indexes.
I don't like too much the idea of relying on the file system for that task: I don't know how critical is your application, but many things may happen to these xml files :) plus, if the folder gets very very big, you would need to think about splitting these files in some hierarchcal folder structure, to have decent performance.
On the other hand, I don't see why using an xml file as a database, if you need to update frequently.
I would use a relational database, and add a new record in a table for each entry, with an index on the isbn_number column.
If you are in the thousands records, you may very well go with sqlite, and you can replace it with a more powerful non-embedded DB if you ever need it, with no (or little :) ) code modification.
I think you'd better use DBMS instead of your 2 methods.
If you want least overhead just for checking existence, then option 1 is probably what you want, since it's direct look up. Parsing XML each time for checking requires you to to pass through the whole XML file in worst case. Although you can do caching with option 2 but that gets more complicated than option 1.
With option 1 though, you need to beware that there is a limit of how many files you can store under a directory, so you probably have to store the XML files by multiple layer (for example /xmldb/38/46/3846504937540.xml).
That said, neither of your options is good way to store data in the long run, you will find them become quite restrictive and hard to manage as data grows.
People already recommended using DBMS and I agree. On top of that I would suggest you to look into document-based database like MongoDB as your database.
Extend your db table to not only include the XML string but also the ISBN number.
Then you select the XML column based on the ISBN column.
Query: Java escaped, "select XMLString from cacheTable where isbn='"+ isbn +"'"
A different approach could be to use an ORM like Hibernate.
In ORM instead of saving the whole XML document in one column you use different different columns for each element and attribute and you could even split upp your document over several tables for a simpler long term design.

Data Retrieving from XML file with some limitation in Java

I am using XML in my project for data to be Insert/Update/Delete.
Currently i am using XPath for doing the above operations from my Java application.
I am facing a problem while retrieving the data from XML. If there are 1000 records in the XML file i want to get the data from XML file with some limit (same as limit in a MySQL select query) in the rows, for implementing the pagination in the view page. I want to display 100 records at a time, so that end-user can click on next button to see all the 1000 records.
Can anyone tell me the best way to full-fill this requirement?
Ya, we can do it with "position()" function but the problem is i want to get the data in an sorted order. position() will return the data from the XML file respective to the given number(in XML file the data may not be in an order). So i want to read the data along with order. I am not able to find the XML query for Sorting and Paginated data in XPath.
You can consider using JAXB instead of direct XML manipulation.
As you are using XPath to access your XML data, one possibility could be the position() function to get "paginated" data from the XML. Like:
/path/to/some/element[position() >= 100 and position() <= 200]
Of course you have to store the boundaries (e.g. 100 - 200 as an example) then between user requests.
Ok, if you need sorted output aswell... as far as I know there is no sort function in pure xpath (1.0/2.0). Maybe you are using a library that offers this as an extension. Or you maybe have the possibility to use an XSLT and xsl:sort. Or you use XML binding as written in the other answer.

can Lucene be used to search inside db?

Can we use Lucene to search text stored in DB?
I saw this article that shows how to use it for normal articles stored as files
http://javatechniques.com/blog/lucene-in-memory-text-search-example/
Can someone suggest?
Look at the below question from their FAQ. If you are using Hibernate then I recommend you to consider Hibernate Search.
How can I use Lucene to index a database?
You should use the Compass Framework. It's built upon Lucene and integrates nicely with several ORMs
Update: you should now use ElasticSearch instead (thanks Pangea)
Can we use Lucene to search text stored in DB?
Yes, you can. Lucene is able to read different kind of database-tables (like mysql,etc). In order to search stored text in an DB, lucene needs to index all the data you like to search.
But don't forgett: lucene is just an index. To access lucene - that mens to search inseide or to start import (whatever) you need an 2nd part oft software, to "use" (control,...) the data inside lucene.
This could be solr, for example http://lucene.apache.org/solr/
On the RDBMS you don't need an fulltext index for that anymore.

Querying a Lucene index file

I'm trying to query a Lucene index file through QueryParser. However I would like to see the format of the index file before querying it. Is there a way to lookup the structure of a Lucene index file, sort of like how I'm able to lookup the structure of a regular SQL table?
The reason is that I haven't built this index file myself and would like to get my way around it before querying it.
You can use Luke, or programmatically IndexReader.getFieldNames().
Luke - Lucene Index Toolbox
Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways

Categories