Compute Rank of an Entity in Objectify - java

My goal is to compute the absolute rank of an entity based on some attribute provided as a string. The basic approach I am using is to issue a query, sort its results in descending order and count all those results which are greater than the attribute value of my particular entity. The query looks something like this
int rank = o.query(Entity.class).order(String.format("-%s",attribute)).
.filter(String.format("%s >", attribute), something).count();
However, something is the part where I am stuck. As far as I understand the concepts of objectify, querying for specific entity members is out of the question. My next step would either be to use an (ugly but fast in terms of dev time) if-construct to identify the particular entity member, or start using reflection (a bit less ugly, but slow in terms of dev time).
Either way, I am left with the feeling that I miss some obvious and/or elegant way to accomplish this task. Any suggestions? Thx.

If I understand this correctly, you want the first entity and you want the count of remaining entities? There are two ways to do this:
Use two queries. Use limit(1) on the >= one that will return the first entity. Start it first, but don't materialize the result so that it runs asynchronously in parallel with the second.
Instead of count(), run a keys-only query with >=. Keep the first key to do a fetch, and count the rest manually. Keys-only queries cost the same (small ops per count) as count() queries because count() queries are essentially the same thing under the covers.
I would probably go with #2. Either way, I hope that your counts are not large because otherwise you will churn through a lot of small datastore ops and your app will be expensive!

Related

Hibernate: initialization of complex object

I have problems with full loading of very complex object from DB in a reasonable time and with reasonable count of queries.
My object has a lot of embedded entities, each entity has references to another entities, another entities references yet another and so on (So, the nesting level is 6)
So, I've created example to demonstrate what I want:
https://github.com/gladorange/hibernate-lazy-loading
I have User.
User has #OneToMany collections of favorite Oranges,Apples,Grapevines and Peaches. Each Grapevine has #OneToMany collection of Grapes. Each fruit is another entity with just one String field.
I'm creating user with 30 favorite fruits of each type and each grapevine has 10 grapes. So, totally I have 421 entity in DB - 30*4 fruits, 100*30 grapes and one user.
And what I want: I want to load them using no more than 6 SQL queries.
And each query shouldn't produce big result set (big is a result set with more that 200 records for that example).
My ideal solution will be the following:
6 requests. First request returns information about user and size of result set is 1.
Second request return information about Apples for this user and size of result set is 30.
Third, Fourth and Fifth requests returns the same, as second (with result set size = 30) but for Grapevines, Oranges and Peaches.
Sixth request returns Grape for ALL grapevines
This is very simple in SQL world, but I can't achieve such with JPA (Hibernate).
I tried following approaches:
Use fetch join, like from User u join fetch u.oranges .... This is awful. The result set is 30*30*30*30 and execution time is 10 seconds. Number of requests = 3. I tried it without grapes, with grapes you will get x10 size of result set.
Just use lazy loading. This is the best result in this example (with #Fetch=
SUBSELECT for grapes). But in that case that I need to manually iterate over each collection of elements. Also, subselect fetch is too global setting, so I would like to have something which could work on query level. Result set and time near ideal. 6 queries and 43 ms.
Loading with entity graph. The same as fetch join but it also make request for every grape to get it grapevine. However, result time is better (6 seconds), but still awful. Number of requests > 30.
I tried to cheat JPA with "manual" loading of entities in separate query. Like:
SELECT u FROM User where id=1;
SELECT a FROM Apple where a.user_id=1;
This is a little bit worse that lazy loading, since it requires two queries for each collection: first query to manual loading of entities (I have full control over this query, including loading associated entities), second query to lazy-load the same entities by Hibernate itself (This is executed automatically by Hibernate)
Execution time is 52, number of queries = 10 (1 for user, 1 for grape, 4*2 for each fruit collection)
Actually, "manual" solution in combination with SUBSELECT fetch allows me to use "simple" fetch joins to load necessary entities in one query (like #OneToOne entities) So I'm going to use it. But I don't like that I have to perform two queries to load collection.
Any suggestions?
I usually cover 99% of such use cases by using batch fetching for both entities and collections. If you process the fetched entities in the same transaction/session in which you read them, then there is nothing additionally that you need to do, just navigate to the associations needed by the processing logic and the generated queries will be very optimal. If you want to return the fetched entities as detached, then you initialize the associations manually:
User user = entityManager.find(User.class, userId);
Hibernate.initialize(user.getOranges());
Hibernate.initialize(user.getApples());
Hibernate.initialize(user.getGrapevines());
Hibernate.initialize(user.getPeaches());
user.getGrapevines().forEach(grapevine -> Hibernate.initialize(grapevine.getGrapes()));
Note that the last command will not actually execute a query for each grapevine, as multiple grapes collections (up to the specified #BatchSize) are initialized when you initialize the first one. You simply iterate all of them to make sure all are initialized.
This technique resembles your manual approach but is more efficient (queries are not repeated for each collection), and is more readable and maintainable in my opinion (you just call Hibernate.initialize instead of manually writing the same query that Hibernate generates automatically).
I'm going to suggest yet another option on how to lazily fetch collections of Grapes in Grapevine:
#OneToMany
#BatchSize(size = 30)
private List<Grape> grapes = new ArrayList<>();
Instead of doing a sub-select this one would use in (?, ?, etc) to fetch many collections of Grapes at once. Instead ? Grapevine IDs will be passed. This is opposed to querying 1 List<Grape> collection at a time.
That's just yet another technique to your arsenal.
I do not quite understand your demands here. It seems to me you want Hibernate to do something that it's not designed to do, and when it can't, you want a hack-solution that is far from optimal. Why not loosen the restrictions and get something that works? Why do you even have these restrictions in the first place?
Some general pointers:
When using Hibernate/JPA, you do not control the queries. You are not supposed to either (with a few exceptions). How many queries, the order they are executed in, etc, is pretty much beyond your control. If you want complete control of your queries, just skip JPA and use JDBC instead (Spring JDBC for instance.)
Understanding lazy-loading is key to making decisions in these type of situation. Lazy-loaded relations are not fetched when getting the owning entity, instead Hibernate goes back to the database and gets them when they are actually used. Which means that lazy-loading pays off if you don't use the attribute every time, but has a penalty the times you actually use it. (Fetch join is used for eager-fetching a lazy relation. Not really meant for use with regular load from the database.)
Query optimalization using Hibernate should not be your first line of action. Always start with your database. Is it modelled correctly, with primary keys and foreign keys, normal forms, etc? Do you have search indexes on proper places (typically on foreign keys)?
Testing for performance on a very limited dataset probably won't give the best results. There probably will be overhead with connections, etc, that will be larger than the time spent actually running the queries. Also, there might be random hickups that cost a few milliseconds, which will give a result that might be misleading.
Small tip from looking at your code: Never provide setters for collections in entities. If actually invoked within a transaction, Hibernate will throw an exception.
tryManualLoading probably does more than you think. First, it fetches the user (with lazy loading), then it fetches each of the fruits, then it fetches the fruits again through lazy-loading. (Unless Hibernate understands that the queries will be the same as when lazy loading.)
You don't actually have to loop through the entire collection in order to initiate lazy-loading. You can do this user.getOranges().size(), or Hibernate.initialize(user.getOranges()). For the grapevine you would have to iterate to initialize all the grapes though.
With proper database design, and lazy-loading in the correct places, there shouldn't be a need for anything other than:
em.find(User.class, userId);
And then maybe a join fetch query if a lazy load takes a lot of time.
In my experience, the most important factor for speeding up Hibernate is search indexes in the database.

Flexible search in database

I have a legacy system that allows users to manage some entities called "TRANSACTION" in the (MySQL) DB, and mapped to Transaction class in Java. Transaction objects have about 30 fields, some of them are columns in the DB, some of them are joins to another tables, like CUSTOMER, PRODUCT, COMPANY and stuff like that.
Users have access to a "Search" screen, where they are allowed to search using a TransactionId and a couple of extra fields, but they want more flexibility. Basically, they want to be able to search using any field in TRANSACTION or any linked table.
I don't know how to make the search both flexible and quick. Is there any way?. I don't think that having an index for every combination of columns is a valid solution, but full table scans are also not valid... is there any reasonable design? I'm using Criteria to build the queries, but this is not the problem.
Also, I think mysql is not using the right indexes, since when I make hibernate log the sql command, I can almost always improve the response time by forcing an index... I'm starting to use something like this trick adapted to Criteria to force a specific index use, but I'm not proud of the "if" chain. I'm getting something like
if(queryDto.getFirstName() != null){
//force index "IDX_TX_BY_FIRSTNAME"
}else if(queryDto.getProduct() != null){
//force index "IDX_TX_BY_PRODUCT"
}
and it feels horrible
Sorry if the question is "too open", I think this is a typical problem, but I can't find a good approach
Hibernate is very good for writing while SQL still excels on reading data. JOOQ might be a better alternative in your case, and since you're using MySQL it's free of charge anyway.
JOOQ is like Criteria on steroids, and you can build more complex queries using the exact syntax you'd use for native querying. You have type-safety and all features your current DB has to offer.
As for indexes, you need can't simply use any field combination. It's better to index the most used ones and try using compound indexes that cover as many use cases as possible. Sometimes the query executor will not use an index because it's faster otherwise, so it's not always a good idea to force the index. What works on your test environment might not stand still for the production system.

Is it better to repeat "rows" or "columns" when designing for app-engine datastore

I'm fairly new to the app-engine datastore but get that it is designed more like a Hashtable than a database table. This leads me to think it's better to have fewer rows (entities) and more columns (object properties) "in general".
That is, you can create a Car object with properties color and count or you can create it with properties redCount, blueCount, greenCount, assuming you know all the colors (dimensions). If you are storing instances of those objects you would have either three or one:
For each color and count, save new entity:
"red", 3
"blue", 8
"green", 4
Or save one entity with properties for each possible color: 3, 8, 4
Obviously there are some design challenges with the latter approach but wondering what the advice is for getting out of relational thinking? Seems datastore is quite happy with hundreds of "columns" / properties.
Good job trying to get out of relational thinking. It's good to move away from the row/table thinking.
A closer approximation, at least on the programming side, would be to think of entities as data structure or class instances stored remotely. These entities have properties. Separate from the entities are indexes, which essentially store lists of entities that match certain criteria for properties.
When you write an entity, the datastore updates that instance in memory/storage, and then updates all the indexes.
When you do a query, you essentially walk through one of the index lists.
That should give you a basic framework to think about the datastore.
When you design for the datastore, you generally have to design for cost, and to a lesser degree, performance. On the write side, you want to minimize the number of indexes. On the read side, you want to minimize the number of entities you're reading, so the idea of having separate entities for red, blue, green could be a bad idea, tripling your read costs if you constantly need to read back the number of red/blue/green cars. There could be some really obscure corner case where this makes sense.
Your design considerations generally should go along the lines of:
What types of queries do I need to do?
How do I structure my data to make these queries easy to do (since the GAE query capabilities are limited)? Would a query be easier if I duplicate data somehow, and would I be able to maintain this duplicated data on my own?
How can I minimize the number of indexes that need to be updated when I update an entity?
Are there any special cases where I must have full consistency and therefore need to adjust the structure so that consistent queries can be made?
Are there any write performance cases I need to be careful about.
Without knowing exactly what kind of query you're going to make, this answer will likely not be right, but it should illustrate how you might want to think of this.
I'll assume you have an application where people register their cars, and you have some dashboard that polls the datastore and displays the number of cars of each color, the traditional mechanism of having a Car class with color, count attributes still makes sense because it minimizes the number of indexed properties, thus reducing your write costs.
It's a bit of an odd example, because I can't tell if you want to just have a single entity that keeps track of your counts (in which case you don't even need to do a query, you can just fetch that count), or if you have a number of entities of counts that you may fetch and sum up.
If user updates modify the same entity though, you might run into performance problems, you should read through this: https://developers.google.com/appengine/articles/sharding_counters
I would recommend not storing things in your own standard in the one cell. Unless it is encoded in JSON or something similar.
{'red':3, 'blue':4}
JSON is ok because it can be easily decoded into a data structure within java like a list or something.
There is nothing wrong with lots of columns in an app. You will get more gains by having a column for red, blue and green. There would have to be a very large number of columns to see a big slow down.
I think it safe to say that there is no significant performance penalty for having a lot of properties (columns) for each entity (row) in a database model. Nor is there a penalty for lots of rows (entities), or even lots of tables (db classes). If I were doing your example, I would definitely set up separate properties for color and count. We always explicitly call out indexed=False/True to ensure we avoid the dread problem of wondering why your indexes are so large when you only have a few properties indexed (forgetting that the default is True). Although GAE gives you nice properties such as lists that can be indexed, these specialized properties are not without their overhead costs. Understand these well whenever you use them.
One thing that I think is important to remember with GAE when plotting your design is that standard queries are slow, and slow equates to increased latency, and increased latency results in more instances, and more expense (and other frustrations). Before defaulting to a standard query, always ask (if this is a mission-critical part of your code) if you can accomplish the same by setting up a more denormalized datastructure. For example, linking a set of entities together using a common key then doing a series of get_by_id() calls can often be advantageous (be sure to manage ndb's auto memcache when doing this - not everything needs to be cached). Ancestor queries are also much faster than standard queries (but impose a 1 update per second limit on the family group.)
Concluding: within reason the number properties (columns) in an entity (rows) and also the total number of classea (tables) will not impose any real issues. However, if you are coming from a standard relational DB background, your inclination will be to use SQL-like queries to move your logic along. Remember in GAE that standard GQL queries are slow and costly, and always think about things links using denormalization to avoid them. GAE is a big, flat, highly performant noSQL-like resource. Use it as such. Take the extra time to avoid reliance on GQL queries, it will be worth it.

JDO on GoogleAppEngine: How to count and group with BigTable

I need to collect some statistics on my entities in the datastore.
As an example, I need to know how many objects of a kind I have, how
many objects with some properties setted to particular values, etc.
In usual relational DBMS I may use
SELECT COUNT(*) ... WHERE property=<some value>
or
SELECT MAX(*), ... GROUP BY property
etc.
But here I cannot see any of these structures.
Moreover, I cannot take load all the objects in memory (e.g. using
pm.getExtent(MyCall.class, false)) as I have too much entities (more
than 100k).
Do you know any trick to achieve my goal?
Actually it depends on your specific requirements.
Btw, there is a common way, to prepare this stats data in background.
For example, you can run few tasks, by using Queue service, that will use query like select x where x.property == some value + cursor + an sum variable. If you at the first step, cursor will be empty and sum will be zero. Then, you'll iterate your query result, for 1000 items (query limit) or 9 minutes (task limit), incrementing sum on every step, and then, if it's not finished, call this task with new cursor and sum values. I mean you add request to next step into queue. Cursor is easily serializable into string.
When you have final step - you have to save result value somewhere into stat results table.
Take a look at:
task queues - http://code.google.com/intl/en/appengine/docs/java/taskqueue/
cursor - http://code.google.com/intl/en/appengine/docs/java/datastore/queries.html#Query_Cursors
And also, this stats/aggregation stuff is really depends on your actual task/requirements/project, there few way to accomplish this, optimal for different tasks. There is no standard way, like in SQL
Support for aggregate functions is limited on GAE. This is primarily an artifact of the schema-less nature of BigTable. The alternative is to maintain the aggregate functions as separate fields yourself to access them quickly.
To do a count, you could do something like this --
Query q = em.createQuery("SELECT count(p) FROM your.package.Class p");
Integer i = (Integer) q.getSingleResult();
but this will probably return you just 1000 rows since GAE limits the number of rows fetched to 1000.
Some helpful reading how to work around these issues --
http://marceloverdijk.blogspot.com/2009/06/google-app-engine-datastore-doubts.html
Is there a way to do aggregate functions on Google App Engine?

What is the best way to handle one to many relationships in the low level datastore api?

I've been using the low level datastore API for App Engine in Java for a while now and I'm trying to figure out the best way to handle one to many relationships. Imagine a one to many relationship like "Any one student can have zero or more computers, but every computer is owned by exactly one student".
The two options are to:
have the student entity store a list of Keys of the computers associated with the student
have the computer entity store a single Key of the student who owns the computer
I have a feeling option two is better but I am curious what other people think.
The advantage of option one is that you can get all the 'manys' back without using a Query. One can ask the datastore for all entities using get() and passing in the stored list of keys. The problem with this approach is that you cannot have the datastore do any sorting of the values that get returned from get(). You must do the sorting yourself. Plus, you have to manage a list rather than a single Key.
Option two seems nice because there is no list to maintain. Also, you can sort by properties of the computer as long as their is an index for that property. Imagine trying to get all the computers for a student where the results are sorted by purchase date. With approach two it is a simple query, no sorting is done in our code (the datastore's index takes care of it)
Sorting is not really hard, but a little more time consuming (~O(nlogn) for a sort) than having a sorted index (~O(n) for going through the index). The tradeoff is an index (space in the datastore) for processing time. As I said my instinct tells me option two is a better general solution because it gives the developer a little more flexibility in getting results back in order at the cost of additional indexes (which with the google pricing model are pretty cheap). Does anyone agree, disagree, or have comments?
Both approaches are valid in different situations, though option two - storing a single reference on the 'many' side - is the more common approach. Which you use depends on how you need to access your data.
Have you considered doing both? Then you could quickly get a list of computers a student owns by key OR use a query which returns results in some sorted order. I don't think maintaining a list of keys on the student model is as intimidating as you think.
Don't underestimate the benefit of fetching entities directly by keys. According to this article, this can be 4-5x faster than queries.

Categories