I'm developing an app backend in Google App Engine with objectify, as would be the best way to model the entities of the following data (The following notation is only for purposes of exemplifying)
{user:{
userInfo:{...},
activities:{
activityByDay1:{
activityDescription{...},
tasks:{
task1:{"indexed by date"...},
task2:{...},...
}
},
activityByDay2:{...},...
}
}
}
Tasks can be up to 86400 records.
And how do I build a query that returns this structure with the latest 20 tasks and a cursor on activityByDay?.
Thank you!
There is no single answer to this question. How you model it will depend on the amount of breadth at each 'level' of the tree and the type of queries you intend to perform. It depends on whether you need those queries to be strongly consistent, whether you need to edit lots of data in a single transaction, and how often various bits of data mutate.
If you want meaningful advice, expand your question to include as much of this information as possible. There is no such thing as "normal form" in the datastore like there is in SQL.
Related
Background
I'm using NoSql database supporting graphs for the first time. It is a huge medical application handling thousands of patients. It is a greenfield project and we as a team are struggling with our persistence layer. We don't know how relationships should be represented and if we should use Triples to handle queries involving huge amount of data. We are using Java API.
Data structure
Imagine that there are 3 types of JSON documents in our Marklogic database: Patient, Event, File Evidence.
There are thousands of patients in the application
One patient can have multiple events associated with this patient (admitted, discharged, transferred, prescribed medications, added note, changed internal status etc.)
each event can have multiple files attached to it as an evidence
Assume there are hundreds of thousands of patients, events and files.
Question
Is it possible to query patients with events and files at once? Is using semantics (possible triples: 'patient has event', 'event has file') recommended in our case?
Our approach
We try to use triples to provide relationships between our documents, add them to one graph, use combination query to fetch IRI first and then in the second call fetch documents by IRI. We tried self-paced trainings and exploring https://github.com/marklogic/marklogic-samplestack but with no luck. Help of someone who has done that in the past and would like to share his experience would be great.
I your situation, keep in mind that you can also store the triples in each of the documents themselves (with the inferred subject being the document itself). Then in your example, you could be combining cts:triple-range-query with standard cts:search.
Example:
If I had events and embedded a triple such as [this event-> ownedByPatient -> [iri/for/patiens#12345]
Then I could query:
search for events filtered by fragments where the cts:triple-range-query states that the events are owned by patient 12345
This approach is a combination of semantics and MarkLogic search - using triples to link the appropriate types.
As for different types of documents, triples do not care what they are pointing at - an IRI of a person, event, etc. Its just about how you model you data itself and the ontology used to describe the relationships. So, you can also approach this as managed triples (not embedded) and treat it all as a graph database pointing at your content (like the approach you are describing)
Once you get further along, you may also decide to force restrictions on the types of relationships using RDF rules.
You've given us very little information to work with to answer such broad questions. Nevertheless, I'll do my best with what you gave.
One option is organize the data however is most intuitive to you, and use server-side Javascript (SJS) to combine the documents at query time into whatever you need for a particular query. That SJS could be in the form of a resource extension or search response transform. A resource extension has the advantage that it could do multiple queries across different document types and piece them together to form an answer. A search response transform, on the other hand will be given the results of only one query but could do additional queries as needed to bring in more data. Since you only have hundreds of thousands of records, you may not need to stress too much about raw speed.
If you plan to scale to millions of documents and want raw speed, you could keep everything you want to query about one patient in the patient record. That would allow you to find a patient by full-text search through all their records plus field-match on patient-specific data.
That assumes the only search results you ever want are patients. If you want something else, you'll need to let us know what other search results you might want.
When you say "attachment" I think of binary documents with scanned images, no metadata, and no full-text to search. Those would obviously be stored as separate binary documents. If they have metadata or full-text, you'll have to decide whether any of that should be in the big patient record for fast queries or in separate documents. All "attachment" documents that are separate JSON files could have a field that points to the patient by id.
I'd avoid triples at first. As David Ennis pointed out, you can combine triples and search, but it's a bit of a ninja move. One big JSON document per patient is much easier for most developers to understand.
So I have a list of users stored in a postgres database and i want to search them by username on my (Java) backend and present a truncated list to the user on the front end (like facebook user search). One could of course, in SQL, use
WHERE username = 'john smith';
But I want the search algorithm to be a bit more sophisticated. Beginning with near misses for example
"Michael" ~ "Micheal"
And potentially improving it to use context e.g geographical proximity.
This has been done so many times I feel like I would be reinventing the wheel and doing a bad job of it. Are there libraries that do this? Should this be handled on the backend (so in Java) or in the DB (Postgresql). How do I make this model scalable (i.e use a model in which complexity is easy to add down the road)?
A sophisticated algorithm will not magically appear, you have to implement it. Your last question is whether you should do it in Java or in the database. In overwhelming majority of cases it's better to use the database for queries. Things like "Michael" ~ "Micheal" or spatial queries are standard features in many modern SQL databases. You just have to implement the appropriate SQL query.
Another point is, however, if SQL database is a right tool for "sophisticated queries". You may also consider alternatives like Elasticsearch.
I'm making an instant messaging app for android and using Java and app engine for the backend.
To store conversations and messages in the backend, I have 2 options (as I see it) to store the data.
Create 2 root entities:
conversation (ID, message IDs) and message (ID, "text").
OR
conversation(ID) message (child of conversation entity)(ID, "text")
Though technically both can work, I do not understand about the limits of the datastore (ex 1 write/sec for some entities), am worried about CPU overhead when querying, as well as having potentially millions of message root entries. I guess I am not sure if ancestral entities are required, or best for such an application.
tl;dr what is the best way to architect such a database?
Do not use ancestors queries Unless you are sure they fit your needs. this was to me the most confusing part about datastore because at first, parent/child seems like a great way to structure data like a tree.
In short, use them when you must have inmediate consistency when you write data. It has sevetal restrictions regarding total size and writes per second.
dont worry about having millions of "root" entities. This is precisely what the datastore (and nosql in general) is good about.
all datastore queries are efficient, it wont even let you run one that it isnt (so you must add all needed indexes beforehand) thus dont worry about query performance unless you cant express the query with an index.
in your case, given that a conversation shouldnt be huge and users normally dont type more than 5 entries per second, you could use ancestors and you will gain inmediate consistency within each conversation.
At this point i think its too broad to ask for the arquitecture but this should point you the right way.
I am trying to decide whether I should use App-engine Search API or Datastore for an App-engine Connected Android Project. The only distinction that the google documentation makes is
... an index search can find no more than 10,000 matching documents.
The App Engine Datastore may be more appropriate for applications that
need to retrieve very large result sets.
Given that I am already very familiar with the Datastore: Will someone please help me, assuming I don't need 10,000 results?
Are there any advantages to using the Search API versus using Datastore for my queries (per the quote above, it seems sensible to use one or the other)? In my case the end user must be able to search, update existing entries, and create new entities. For example if my app is a bookstore, the user must be able to add new books, add reviews to existing books, search for a specific book.
My data structure is such that the content will be supplied by the end user. Document vs Datastore entity: which is cheaper to update? $$, etc.
Can they supplement each other: Datastore and Search API? What's the advantage? Why would someone consider pairing the two? What's the catch/cost?
Some other info:
The datastore is a transactional system, which is important in many use cases. The search API is not. For example, you can't put and delete and document in a search index in a single transaction.
The datastore has a lot in common with a NoSql DB like Cassandra, while the search API is really a textual search engine, very similar to something like Lucene. If you understand how a reverse index works, you'll get a better understanding of how the search API works.
A very good reason to combine usage of the datastore API and the search API is that the datastore makes it very difficult to do some types of queries (e.g. free text queries, geospatial queries) that the search API handles very easily. Thus, you could store your main entities in the datastore, but then use the search API if you need to search in ways the datastore doesn't allow. Down the road, I think it would be great if the datastore and search API were more tightly integrated, for example by letting you do free text search against indexed Text fields, where app engine would automatically create a search Document Index behind the scenes for you.
The key difference is that with the Datastore you cannot search inside entities. If you have a book called "War and peace", you cannot find it if a user types "war peace" in a search box. The same with reviews, etc. Therefore, it's not really an option for you.
The most serious con of Search API is Eventual Consistency as stated here:
https://developers.google.com/appengine/docs/java/search/#Java_Consistency
It means that when you add or update a record with Search API, it may not reflect the change immediately. Imagine a case where a user upload a book or update his account setting, and nothing changes because the change hasn't gone to all servers yet.
I think Search API is only good for one thing: Search. It basically acts as a search engine for your data in Datastore.
So my advice is to keep the data in datastore that user expects immediate result, and use Search API to search the data that user won't expect immediate result.
The Datastore only provides a few query operators (=, !=, <, >), doing nested filters and multiple inequalities would either be costly or impossible (timeouts) and search results may give a lot of False Positives. You can do partial string search by tokenizing but this will bloat your entity. Best way to get through these limitations is using Structured Properties and/or Ancestor Queries.
Search API on the other hand runs a Full Text search on Search Documents, which is faster and more accurate than NDB queries without relying on tokenized data. Downside is it relies on data staying up to date.
Use Datastore to process your data (create, update, delete), then run a function to put these data as documents and cluster using indexes, then run the searches using the Search API.
I have been reading a lot on ways to do aggregate queries on the datastore (thru stackoverflow and elsewhere). The preponderance of answers is that it cannot be done in a pleasant way. But then those answers are dated, and the same people tend to also claim that you cannot do things such as order by on the datastore.
As it exists today, you actually can specify ORDER BY on the datastore. So I am wondering if aggregation is also possible.
Consider the scenario where I have five candidates Alpha, Brave, Charie, Delta and Echo; and 10,000 voters. I want to retrieve the candidates and the number of votes each received in order. How would I do that on the datastore? I am using java.
Also, as an aside, if the answer is still no and fanning-in is my best option: is fan-in thread safe? By fanning-in I mean keeping an explicit counter that counts the vote each candidate receives (in a separate table). Could I experience a race condition or some other faults in the data when multiple users are voting concurrently?
If by aggregating you mean having the datastore compute the total # of votes for you, then no, the datastore won't do that.
The best way to do what you're describing is:
Create a set of sharded counters per candidate (google search for app engine sharded counters).
When someone votes, update the sharded counter for the given delegate.
When you want to read the votes, query for your delegates, then for each delegate, query for the sharded counters and sum them up.
Memcache for better performance, the GAE sharding counters example available in the docs shows this pretty well.
Its recently launched and available for use now: https://cloud.google.com/datastore/docs/aggregation-queries.
There are various client libraries also which support this particular feature.