How to manage two different entities in SOLR? - java

I have several different entities I want to index in SOLR, for example:
users
products
blogs
All are completely different in schema.
All are searched for in different places in my app.
Is there a way to do it in the same core? Is it the right approach?
Is a core the conceptual equivalence of a table in a relational DB (In which case the answer is obvious).

Really depends on how you will search this data. The main question is: What will you search for?
If you will search for products (i.e. the search results are products), then design the schema around products. If you search for products by users or blogs, model users/blogs as dynamic/multivalued fields.
If you have an app that searches for products, and another app that searches for blogs, and they're completely unrelated, put them in separate cores.
From the Solr wiki:
The more heterogeneous (different kinds of data) you have in one field or in one index, the less useful it is.
So don't blindly put everything in a single core. Carefully consider what your search scenarios are.

Here is some guidance from the Solr Wiki on Flattening Data into a Single Index. The key take away from flattening data is:
This type of approach can be particularly well suited to situations where you need to "blend" results from conceptually distinct sets of documents.
If you want to index your three types and keep them separate and distinct, you can leverage Cores within Solr to keep them fairly isolated, but allow you to manage them under one Solr container.

Related

What is the correct way to structure this kind of data in Firestore?

I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.
What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.

java complex search

I am trying to build a java based web application using Spring Boot and REST architecture using Spring MVC for the following purpose:
Search car parts through multiple set of criteria.
I try to explain it in different scenarios:
find part A of Brand B for Make C of Model D of year x.
find out what parts are available of Brand B for Make C of Model D of Year x.
Search multiple items at once for Vehicle C of Model D of Year x. For example if an engine is damaged and I want to quickly find out whether I have the parts (like pistons, cylinders, gaskets, etc.) in the supply. The result of this search is a list of the parts with their brands and prices.
My primary concern at this moment are the following two questions:
How should I model the data so that the search scenarios are achieved efficiently? What I mean is that how the relation between the entities in the Java and the persistence system should look like?
What kind of database should I use? SQL or NoSQL?
All the REST end-points will return Json objects.
I will be using Angular with Bootstrap for the front-end
Isn't this scenario a typical "faceted search"? I think that any solution designed to implement faceted search should work fine. For example Solr or Elasticsearch.
The advantage of the "faceted search" for the end users is the option to refine the search. Users can start with a broad search and the system will provide refining filter criteria, based on the current search results.
Today, all the major e-commerce sites have a kind of faceted search and every search engine support this type of browsing.
It seems to me that engines like Solr and Elasticsearch are the more natural solution, but even a standard RDBMS like Oracle has support for faceted search.
Faceted search in Solr
Filters vs. Facets: Definitions
I would put the focus on modelling it cleanly rather than efficiently - unless you already know that you will have a massive amount of data. Having it structured cleanly will make it easy to optimise if that is required later.
Normalise your data - there will be plenty of information out there on how to do this. The car industry is becoming more consolidated so many parts are now shared across different models and even different brands.
An ORM like hibernate can be used to map your entities to your tables. Spring provides extra support in this area which you might consider as you are already plan on using Spring MVC.

Auditing using Data tables vs Separate Audit tables

I am in the process of designing a new java application which has very strict requirements for auditing. A brief context is here:
I have a complex entity with multiple one to many nested relationships. If any of the field changes, I need to consider it as a new version of the object and all this need to be audited as well. Now I have two options:
1.) Do not do any update operation, just insert a new entity whenever anything changes. This would require me to create all the relational objects (even if they have not been changed) as I do not want to hold references to any previous version objects. My data tables becomes my auditing table as well.
OR
2.) Always do an update operation and maintain the auditing information in separate tables. That would add some more complexity in terms of implementation.
I would like to know if there is a good vs bad practice for any of these two approaches.
Thanks,
-csn
What should define your choice is your insert/update/read patterns for both the "live" data and the audits.
Most commonly these pattern are very different for both kinds.
- Conserning "live" it depends a lot on your application but I can imagine you have significants inserts; significatant updates; lot of reads. Live data also require transactionality and have lot relationship between tables for which you need to keep consistency. They might require fast and complex search. Indexes on many columns
- Audits have lot of inserts; almost no update; few reads. Read, search don't requires complex search (e.g. you only consult audits and sort them by date) and indexes on many columns.
So with increased load and data size you will probably need to split the data and optimize tables for your use cases.

Dynamic Typed Table/Model in Java EE?

Usually with Java EE when we create Model, we define the fields and types of fields through XML or annotation before compilation time. Is there a way to change those in runtime? Or better, is it possible to create a new Model based on the user's input during the runtime? Such that the number of columns and types of fields are dynamic (determined at runtime)?
Help is much appreciated. Thank you.
I felt the need to clarify myself.
Yes, I meant database modeling, when talking about Model.
As for the use cases, I want to provide a means for users to define and create their own tables. Infinite flexibility is not required. However some degree of freedom has to be there: e.g. the users can define what fields are needed to describe their product.
You sound like you want to be able to change both objects and schema according to user input at runtime. This sounds like a chaotic recipe for disaster to me. I've never seen it done.
I have seen general schemas that incorporate foreign key relationships to generic tables of name/value pairs, but these tend to become infinitely flexible abstractions that can neither be easily understood nor get out of their own way when it comes to performance.
I'm betting that your users really don't want infinite flexibility. I'd caution you against taking this direction. Better to get your real use cases straight.
Anything is possible, of course. My direct experience tells me that it's a bad idea that your users will hate if you can pull it off. Best of luck.
I worked on a system where we had such facilities. To stay efficient, we would generate/alter the table dynamically for the customer schema. We also needed to embed a meta-model (the model of the model) to process information in the entities dynamically.
Option 1: With custom tables, you have full flexibility, but it also increases the complexity significantly, notably the update/migration of existing data. Here is a list of things you will need to consider:
What if the type of a column change?
What if a column is added? Is there a default value?
What if a column is removed? Can I discard the existing information?
How to manage renaming of a column?
How to make things portable across databases?
How to make it efficient at database-level (e.g. indexes) ?
How to manage a human error (e.g. user removes a column then changes its mind)?
How to manage migration (script, deployment, etc.) when new version of the system is installed at customer site?
How to have this while using an ORM?
Option 2: A lightweight alternative is to add a few "spare" columns in the business tables of different types (e.g.: "USER_DATE_1", "USER_DATE_2", etc.) I've seen that a few times. It will makes your DBA scream and is not really considered a good practice, but at least can facilitates a few things, e.g. (migration scripts, ORM integration).
Option 3: Another option is to store everything in a table with a structure property/data. But then it's really a disaster for database performance. Anything that is not completely trivial will require many joins. And the DBA will scream even more.
Option 4: It is a mix of options 2 and 3. Core tables are fixed, but a table with property/data can be used to somehow extend them.
In summary: think twice before you go this way. It can be done, but has a significant impact on the design and maintenance of the application.
This is somehow possible using meta-modeling techniques:
tables for table / column / types at the database level
key/value structures at the Java level
But this has obvious limitations (lack of strong typed objects) and can IMHO get quickly very complicated (not even sure how to deal with relations). I wouldn't use this approach to define domain objects entirely, but only to extend existing ones (products, articles, etc).
If I remember well, this is what some e-commerce solutions (e.g. BroadVision) were doing.
I think I have found a good answer myself. Those new no-sql (hbase, cassandra) database seems to be exactly what I was looking for. Thanks everyone for your answeres.

How to model a many-to-many relationship in App Engine?

I have a question regarding how to model a many-to-many relationship in App Engine:
A Blogentry can have many tags, a tag can apply to many blog entries.
I see a couple of scenarios:
Use a Set of Strings as an attribute on the blog entry.
This allows me to easily query for an entry using a tag
This does not allow me to fetch all tags and their weights (how many entries they apply to)
Use an unowned relationship between an Entry and a Tag class (Set of keys for Tags in Entry class and vise versa)
This allows me to fetch all tags and their weights
This is much more comples to maintain
Are Set attributes lazyloaded, or would I fetch the entire graph of object every time? (Fetch an Entry, which fetches a number of Tags, each in turn fetching a number of Entries)
use 1. but maintain data on tags and their weights seperately
This has synchronisation issues between the Tag data and the tags in the Entries
Any input and pointers would be appreciated. I think this is a quite common scenario but I haven't seen any good solutions yet.
Like many other database management systems, many-to-many relationships are not natively supported in App Engine Datastore, but could be solved through a "junction table". However, since App Engine's query language does not support joins, this will be very painful to use in your application. Google's BigTable architecture in fact discourages this, because distributed joins are not efficient.
So, I suggest going with the "keep it simple stupid" rule; use the simplest thing that works. A list of strings in a "Blogentry" object sounds fairly robust. Even if it's prone to race conditions (people making updates in parallel, overwriting each other's changes), but how many people do you have editing the same blog post anyway?
I decided to go with option 3., to maintain a seperate list of tags with their weights.
This seems to work ok, although the insert/update code is a bit cluttered.

Categories