I have cloudant database with some already populated documents in use... I'm using a cloudant java client to fetch data from that. I plan to change the indexes that are used currently. Basically I plan to change over from using createIndex() to https://github.com/cloudant/java-cloudant#cloudant-search. Also would like to change the fields on which the documents are indexed.
Would changing the index impact the underlying data or cause any migration issues with existing data when I start to use the new Index?
It sounds like you want to change from using Cloudant Query to Cloudant Search. This should be straight forward and safe.
Adding a new index will not change or affect the existing data -- the main thing to be careful of is not deleting your old index before you've migrated your code. The easiest way to do this is by using a new design document for your new search indexes:
Create a new design document containing your search index and upload it to Cloudant (https://github.com/cloudant/java-cloudant#creating-a-search-index).
Migrate your app to use the new search index.
(Optionally) remove the design document containing the indexes that you no longer need. Cloudant will then clean up the index files that are no longer needed (https://github.com/cloudant/java-cloudant#comcloudantclientapidatabaseremovedoc-idrev-id).
I included links to the relevant parts of the Java API, but obviously you could do this through the dashboard.
Related
I have a use-case where I create month based indexes in ElasticSearch. The data in these indexes can be updated (append-only to array type fields) if a document already exists in any month based index or else the document will be created in the current month index.
Can I do this with a single operation (append if exists in any index or else create in latest index)? If not, what is the simplest way of achieving this (using JAVA)?
If you are using JAVA API, try Rest High Level Client. You can search the existing document using GET API and you can send an Update Request to Elasticsearch. While Updating the existing document, make sure that you have added the existing meta-data in the content. Index API will be useful for indexing it for the first time.
Once you are familiar with the concepts, instead of sending Update Request, you can directly send Index Request which will be considered as an update request by the Elasticsearch itself.
I am new in lucene I want to indexing with lucene of large xml files(15GB) that contain plain text as well as attribute and so many xml tags. how to parse and indexing this xml file using lucene with any sample and if we use lucene we need any database
How to parse and index huge xml file using lucene ? Any sample or links would be helpful to me to understand the process. Another one, if I use lucene, will I need any database, as I have seen and done indexing with Databases..
Your indexing would be build as you would have done using a database, just iterate through all data you want to index and write it to the index. Just go with the XmlReader class to parse your xml in a forward-only fashion. You will, just as with a database, need to index some kind of primary-key so you know what the search result represents.
A database helps when it comes to looking up the indexed data from the primary-key. It will be messy to read the data for a primary-key if you need to iterate a 15 GiB xml file at every request.
A database is not required, but it helps a lot. I would build this as an import tool that reads your xml, dumps it into your database, and then use your "normal" database indexing code you've built before.
You might like to look at Michael Sokolov's Lux product, which combines Lucene and Saxon:
http://www.mail-archive.com/solr-user#lucene.apache.org/msg84102.html
I haven't used it myself and can't claim to fully understand its capabilities.
I am using elastic search first time.but i can not finalize which api to use for update.it can be done by update api and also index api.but in performance which one is better?
Update API and Index API are two different things. In index API , you can over-write existing whole documents but then Update API , you can change or edit parts of the documents.
Under the hood , both are marking the original document deleted and creating a new document.
I've to change the format of elasticseach document id, I was wondering if its possible without deleting and re-indexing all the documents.
You have to reindex. The simplest way to apply these changes to your existing data is: create a new index with the new settings and copy all of your documents from the old index to the new index with bulk-api, see:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
Yes, You can do it via fetching the data and Re-Indexing it. But in case you have GB's of data you should Run it like a long term Job.
So, you can do like, Fetch the old format Documents Id's of the indexed data and store/index in the new storage such as Cassandra, MongoDB or even in the SQL(As such your application need) by mapping the new format ID to that older one and when you fetch it and while using or on the displaying the data replace that with the mapped newer ID.
I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:
get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after
Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
false in order to have good search performance, or
true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing
<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>