I want to store java objects as part of the Solr document.
They don't need to be parsed or searched, only be returned as part of the document.
I can convert them to json or XML and store the text but I prefer something more efficient.
If I could use Java serialization and then add the binary blob to the document it could be ideal.
I'm aware of the option to convert the binary blob with base64 but I was wondering if there is a more efficient way.
I do not share the opinions of the first two answers.
An additional database call can in some scenarios be completely unnecessary, Solr can act as a NoSQL database, too.
It can even use compression for some fields, which affects CPU cost, but saves some cache memory for some kind of binary data.
Take a look at BinaryField and the lazy loading field declarations within your schema.xml.
As you can construct an id in Solr to pass with any document, you can store this object in other way (database for example) and query it as you get the id back from solr.
For example, we're storing web pages in Solr. When we index it, we're creating an id which match the id of a WebPage Object created by the ORM in the database
When a search is performed, we get the id back and load the java object from the database
No need to store it in solr (which has been made to store and index documents)
Related
I'm working on a project with an existing cassandra database.
The schema looks like this:
partition key (big int)
clustering key1 (timestamp)
data (text)
1
2021-03-10 11:54:00.000
{a:"somedata", b:2, ...}
My question is: Is there any advantage storing data in a json string?
Will it save some space?
Until now I discovered disadvantages only:
You cannot (easily) add/drop columns at runtime, since the application could override the json string column.
Parsing the json string is currently the bottleneck regarding performance.
No, there is no real advantage to storing JSON as string in Cassandra unless the underlying data in the JSON is really schema-less. It will also not save space but in fact use more because each item has to have a key+value instead of just storing the value.
If you can, I would recommend mapping the keys to CQL columns so you can store the values natively and accessing the data is more flexible. Cheers!
Erick is spot-on-correct with his answer.
The only thing I'd add, would be that storing JSON blobs in a single column makes updates (even more) problematic. If you update a single JSON property, the whole column gets rewritten. Also the original JSON blob is still there...just "obsoleted" until compaction runs. The only time that storing a JSON blob in a single column makes any sense, is if the properties don't change.
And I agree, mapping the keys to CQL columns is a much better option.
I don't disagree with the excellent and already accepted answer by #erick-ramirez.
However there is often a good case to be made for using frozen UDTs instead of separate columns for related data that is only ever going to be set and retrieved at the same time and will not be specifically filtered as part of your query.
The "frozen" part is important as it means less work for cassandra but does mean that you rewrite the whole value each update.
This can have a large performance boost over a large number of columns. The nice ScyllaDB people have a great post on that:
If You Care About Performance, Employ User Defined Types
(I know Scylla DB is not exactly Cassandra but I've seen multiple articles that say the same thing about Cassandra)
One downside is that you add work to the application layer and sometimes mapping complex UDTs to your Java types will be interesting.
I am new to jbpm and would like to know if the already configured H2 db stores the objects(DataItems) associated with the process and work item in it somewhere.
I can see there is a byte array present in both the tables and I am not sure what exactly that bytearray stores and how to unmarshall it.
Any sort of information would be really helpful.
Thanks.
The *Info objects do store all relevant data the engine needs in a binary format. This is not meant for query purposes however. If you want to get access to variable values, either use the audit logs or use pluggable variable persistence to store them separately (for example by making them a JPA entity they will be stored in a separate table).
I am new in lucene I want to indexing with lucene of large xml files(15GB) that contain plain text as well as attribute and so many xml tags. how to parse and indexing this xml file using lucene with any sample and if we use lucene we need any database
How to parse and index huge xml file using lucene ? Any sample or links would be helpful to me to understand the process. Another one, if I use lucene, will I need any database, as I have seen and done indexing with Databases..
Your indexing would be build as you would have done using a database, just iterate through all data you want to index and write it to the index. Just go with the XmlReader class to parse your xml in a forward-only fashion. You will, just as with a database, need to index some kind of primary-key so you know what the search result represents.
A database helps when it comes to looking up the indexed data from the primary-key. It will be messy to read the data for a primary-key if you need to iterate a 15 GiB xml file at every request.
A database is not required, but it helps a lot. I would build this as an import tool that reads your xml, dumps it into your database, and then use your "normal" database indexing code you've built before.
You might like to look at Michael Sokolov's Lux product, which combines Lucene and Saxon:
http://www.mail-archive.com/solr-user#lucene.apache.org/msg84102.html
I haven't used it myself and can't claim to fully understand its capabilities.
I am following closely to a Lucene Tutorial with Lucene 3.6.
I am able to create and perform searches on Document objects, but I would like to get back the original objects I used to create the Documents. Unfortunately, Lucene seems to be serialising/deserialising the Documents, so I have not been able to create a lookup map between them.
How do I keep a relationship between Lucene's Documents and my Objects? Is there a preferred Lucene way of doing this?
I should note that the tutorial didn't work out of the box for me, I had to add a call to IndexWriter.commit() after creating/attaching the Documents and I also have to make calls to IndexWriterConfig.setMaxBufferedDocs() and IndexWriterConfig.setRAMBufferSizeMB() with big numbers to stop Lucene looking on the hard drive.
First, you need a unique reference to the original object. If your objects are rows in a database, you might go with the primary key, let's assume, it is a unique ID.
Secondly, when creating the searchable Documents, just add a field like
doc.add(new Field("id", object.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
You can later retrieve this field from the found documents and retrieve the original document (database entry) based on the id.
If you may have different document types, e.g. database entries and PDFs, just save the document type in the same manner, allowing you to handle the different types differently.
It is advisable to store some information(meta-data) about a content in the Id(or key) of that content ?
In other words, I am using a time based UUIDs as the Ids (or key) for some content stored in the database. My application first accesses the list of all such Ids(or keys) of the content (from the database) and then accessed the corresponding content(from the database). These Ids are actually UUIDs(time based). My idea is to store some extra information about the content, in the Ids itself, so that the my software can access this meta-content without accessing the entire content from the database again.
My application context is a website using Java technology and Cassandra database.
So my question is,
whether I should do so ? I am concerned since lots of processing may be required (at the time of presentation of data to user) in order to retrieve the meta data from the ids of the content!! Thus it may be instead better to retrieve it from database then getting it through processing of the Id of that content.
If suggested then , How should I implement that in an efficient manner ? I was thinking of following way :-
Id of a content = 'Timebased UUID' + 'UserId'
where, 'timebasedUUID' is the generated ID based on the timestamp when that content was added by a user & 'userId' represents the Id of the user who put that content.
so my example Id would look something like this:- e4c0b9c0-a633-15a0-ac78-001b38952a49(TimeUUID) -- ff7405dacd2b(UserId)
How should I extract this userId from the above id of the content, in most efficient manner?
Is there a better approach to store meta information in the Ids ?
I hate to say it since you seem to have put a lot of thought into this but I would say this is not advisable. Storing data like this sounds like a good idea at first but ends up causing problems because you will have many unexpected issues reading and saving the data. It's best to keep separate data as separate variables and columns.
If you are really interested in accessing meta-content with out main content I would make two column families. One family has the meta-content and the other the larger main content and both share the same ID key. I don't know much about Cassandra but this seems to be the recommended way to do this sort of thing.
I should note that I don't think that all this will be necessary. Unless the users are storing very large amounts of information their size should be trivial and your retrievals of them should remain quick
I agree with AmaDaden. Mixing IDs and data is the first step on a path that leads to a world of suffering. In particular, you will eventually find a situation where the business logic requires the data part to change and the database logic requires the ID not to change. Off the cuff, in your example, there might suddenly be a requirement for a user to be able to merge two accounts to a single user id. If user id is just data, this should be a trivial update. If it's part of the ID, you need to find and update all references to that id.