Appengine Search API vs Datastore

Appengine Search API vs Datastore - java

I am trying to decide whether I should use App-engine Search API or Datastore for an App-engine Connected Android Project. The only distinction that the google documentation makes is
... an index search can find no more than 10,000 matching documents.
The App Engine Datastore may be more appropriate for applications that
need to retrieve very large result sets.
Given that I am already very familiar with the Datastore: Will someone please help me, assuming I don't need 10,000 results?
Are there any advantages to using the Search API versus using Datastore for my queries (per the quote above, it seems sensible to use one or the other)? In my case the end user must be able to search, update existing entries, and create new entities. For example if my app is a bookstore, the user must be able to add new books, add reviews to existing books, search for a specific book.
My data structure is such that the content will be supplied by the end user. Document vs Datastore entity: which is cheaper to update? $$, etc.
Can they supplement each other: Datastore and Search API? What's the advantage? Why would someone consider pairing the two? What's the catch/cost?

Some other info:
The datastore is a transactional system, which is important in many use cases. The search API is not. For example, you can't put and delete and document in a search index in a single transaction.
The datastore has a lot in common with a NoSql DB like Cassandra, while the search API is really a textual search engine, very similar to something like Lucene. If you understand how a reverse index works, you'll get a better understanding of how the search API works.
A very good reason to combine usage of the datastore API and the search API is that the datastore makes it very difficult to do some types of queries (e.g. free text queries, geospatial queries) that the search API handles very easily. Thus, you could store your main entities in the datastore, but then use the search API if you need to search in ways the datastore doesn't allow. Down the road, I think it would be great if the datastore and search API were more tightly integrated, for example by letting you do free text search against indexed Text fields, where app engine would automatically create a search Document Index behind the scenes for you.

The key difference is that with the Datastore you cannot search inside entities. If you have a book called "War and peace", you cannot find it if a user types "war peace" in a search box. The same with reviews, etc. Therefore, it's not really an option for you.

The most serious con of Search API is Eventual Consistency as stated here:
https://developers.google.com/appengine/docs/java/search/#Java_Consistency
It means that when you add or update a record with Search API, it may not reflect the change immediately. Imagine a case where a user upload a book or update his account setting, and nothing changes because the change hasn't gone to all servers yet.
I think Search API is only good for one thing: Search. It basically acts as a search engine for your data in Datastore.
So my advice is to keep the data in datastore that user expects immediate result, and use Search API to search the data that user won't expect immediate result.

The Datastore only provides a few query operators (=, !=, <, >), doing nested filters and multiple inequalities would either be costly or impossible (timeouts) and search results may give a lot of False Positives. You can do partial string search by tokenizing but this will bloat your entity. Best way to get through these limitations is using Structured Properties and/or Ancestor Queries.
Search API on the other hand runs a Full Text search on Search Documents, which is faster and more accurate than NDB queries without relying on tokenized data. Downside is it relies on data staying up to date.
Use Datastore to process your data (create, update, delete), then run a function to put these data as documents and cluster using indexes, then run the searches using the Search API.

Related

Is it possible to compare two collections in Firestore?

I'm developing an Android app with Java and using Firestore, It's a social network and I have a collection with all the posts. I'm trying to show only those posts that belong to the followed users, so I make a query to show all the posts ordered by timestamp, but I don't know if I can filtered them by comparing with the collection "followed" inside "User".
The main collection "Users" has documents, each of them is a user, inside every user there is a subcollection "followed" that contains the followed users, every document is a user and the document id is the same that the User ID.
The posts are stored in another main collection called Posts, so I need to compare the id User inside "Posts" documents with the id of the docs in the subcollection "followed". I hope somebody can help me, I spent a lot of time and I can't find anything, thank you.

Firestore does not have the ability to "join" documents in collections as you're describing here. It's relatively straightforward in SQL (if your server has enough memory), but Firestore (and other NoSQL databases) aren't built for this, due to its distributed nature, and the way it needs to scale.
The only way to do what you want is to write code to read every document in every collection that would need a comparison, and also perform that comparison with the documents in memory.

How to query a graph of documents of different types at once in Marklogic?

Background
I'm using NoSql database supporting graphs for the first time. It is a huge medical application handling thousands of patients. It is a greenfield project and we as a team are struggling with our persistence layer. We don't know how relationships should be represented and if we should use Triples to handle queries involving huge amount of data. We are using Java API.
Data structure
Imagine that there are 3 types of JSON documents in our Marklogic database: Patient, Event, File Evidence.
There are thousands of patients in the application
One patient can have multiple events associated with this patient (admitted, discharged, transferred, prescribed medications, added note, changed internal status etc.)
each event can have multiple files attached to it as an evidence
Assume there are hundreds of thousands of patients, events and files.
Question
Is it possible to query patients with events and files at once? Is using semantics (possible triples: 'patient has event', 'event has file') recommended in our case?
Our approach
We try to use triples to provide relationships between our documents, add them to one graph, use combination query to fetch IRI first and then in the second call fetch documents by IRI. We tried self-paced trainings and exploring https://github.com/marklogic/marklogic-samplestack but with no luck. Help of someone who has done that in the past and would like to share his experience would be great.

I your situation, keep in mind that you can also store the triples in each of the documents themselves (with the inferred subject being the document itself). Then in your example, you could be combining cts:triple-range-query with standard cts:search.
Example:
If I had events and embedded a triple such as [this event-> ownedByPatient -> [iri/for/patiens#12345]
Then I could query:
search for events filtered by fragments where the cts:triple-range-query states that the events are owned by patient 12345
This approach is a combination of semantics and MarkLogic search - using triples to link the appropriate types.
As for different types of documents, triples do not care what they are pointing at - an IRI of a person, event, etc. Its just about how you model you data itself and the ontology used to describe the relationships. So, you can also approach this as managed triples (not embedded) and treat it all as a graph database pointing at your content (like the approach you are describing)
Once you get further along, you may also decide to force restrictions on the types of relationships using RDF rules.

You've given us very little information to work with to answer such broad questions. Nevertheless, I'll do my best with what you gave.
One option is organize the data however is most intuitive to you, and use server-side Javascript (SJS) to combine the documents at query time into whatever you need for a particular query. That SJS could be in the form of a resource extension or search response transform. A resource extension has the advantage that it could do multiple queries across different document types and piece them together to form an answer. A search response transform, on the other hand will be given the results of only one query but could do additional queries as needed to bring in more data. Since you only have hundreds of thousands of records, you may not need to stress too much about raw speed.
If you plan to scale to millions of documents and want raw speed, you could keep everything you want to query about one patient in the patient record. That would allow you to find a patient by full-text search through all their records plus field-match on patient-specific data.
That assumes the only search results you ever want are patients. If you want something else, you'll need to let us know what other search results you might want.
When you say "attachment" I think of binary documents with scanned images, no metadata, and no full-text to search. Those would obviously be stored as separate binary documents. If they have metadata or full-text, you'll have to decide whether any of that should be in the big patient record for fast queries or in separate documents. All "attachment" documents that are separate JSON files could have a field that points to the patient by id.
I'd avoid triples at first. As David Ennis pointed out, you can combine triples and search, but it's a bit of a ninja move. One big JSON document per patient is much easier for most developers to understand.

Scalable searching algorithm SQL

So I have a list of users stored in a postgres database and i want to search them by username on my (Java) backend and present a truncated list to the user on the front end (like facebook user search). One could of course, in SQL, use
WHERE username = 'john smith';
But I want the search algorithm to be a bit more sophisticated. Beginning with near misses for example
"Michael" ~ "Micheal"
And potentially improving it to use context e.g geographical proximity.
This has been done so many times I feel like I would be reinventing the wheel and doing a bad job of it. Are there libraries that do this? Should this be handled on the backend (so in Java) or in the DB (Postgresql). How do I make this model scalable (i.e use a model in which complexity is easy to add down the road)?

A sophisticated algorithm will not magically appear, you have to implement it. Your last question is whether you should do it in Java or in the database. In overwhelming majority of cases it's better to use the database for queries. Things like "Michael" ~ "Micheal" or spatial queries are standard features in many modern SQL databases. You just have to implement the appropriate SQL query.
Another point is, however, if SQL database is a right tool for "sophisticated queries". You may also consider alternatives like Elasticsearch.

Modeling Objectify data in Google App Engine

I'm developing an app backend in Google App Engine with objectify, as would be the best way to model the entities of the following data (The following notation is only for purposes of exemplifying)
{user:{
userInfo:{...},
activities:{
activityByDay1:{
activityDescription{...},
tasks:{
task1:{"indexed by date"...},
task2:{...},...
}
},
activityByDay2:{...},...
}
}
}
Tasks can be up to 86400 records.
And how do I build a query that returns this structure with the latest 20 tasks and a cursor on activityByDay?.
Thank you!

There is no single answer to this question. How you model it will depend on the amount of breadth at each 'level' of the tree and the type of queries you intend to perform. It depends on whether you need those queries to be strongly consistent, whether you need to edit lots of data in a single transaction, and how often various bits of data mutate.
If you want meaningful advice, expand your question to include as much of this information as possible. There is no such thing as "normal form" in the datastore like there is in SQL.

Connecting AppEngine Datastore and Search API

I wonder what the best way is to connect the Datastore and the Search API.
What I'm looking for is whenever I create some entity (e.g. a product) that this product will be added to a search index. On update the index should be updated as well, and when deleting the product - you guess right - the product should be removed from search index.
When searching for a product I want to do a full-text search on the product index, but instead of the documents I need the real entities. Probably I will need to first search using the index, and then do a second call to the datastore?
What worries me most is keeping the datastore and search index in synch.
And of course also going through the search index and the datastore will not only be cumbersome but I feel it might also give pains in terms of pagination.
I wonder if some people already have "connected" the datastore and search api this way and what the results have been, and maybe some best practices available. The appengine docs are not telling much is this area.

In order to user the Search API, you need to define your searchable data into documents, and then structure them into an index by using the Index class. Thus, for the time being you need to do exactly what you describe, keep in sync the searchable documents with your datastore entities.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.