GAE Datastore - Objectify Read Performance - java

I'm more used to SQL databases than Google DataStore, but I'm finding that my read performance is much slower than I expected. I just want to check if either I've structured my data incorrectly, or if my performance expectations are unreasonable.
I have a DataStore entity called assessment, and on that entity I have added a property Client. This is modelling a master-detail relationship where client is the master and assessment is the detail.
In the GCP console, my entity looks like this :
There are 112 occurrences of this entity, and to query this entity by client takes just under 3 seconds.
I use Objectify to retrieve the data, with the following statement :
List<Assessment> assessments = ofy().load()
.type(Assessment.class)
.filter("client = ", Key.create(Client.class, clientId))
.list();
Coming from an SQL background, 3 seconds to read from 112 a row table via an index seems unreasonably slow to me. Are my expectations of DataStore wrong, or am I doing something stupid in either the way I've chosen to store the data, or the way that I'm reading it?

I'm leaving this here in case another poor soul comes along ....
It seems the problem was my NordLayer VPN
Log out of that and the response time falls to 68 milliseconds!

Related

Java Hibernate generating 1-off queries taking minutes to complete

When running queries in hibernate it is loading related records with one-off queries.
Short version, can someone verify that this is a N+1 type issue?
And, if so provide a good resource on resolving them?
There are some queries that my application runs that return thousands of records. This is normal, however, (what i think is happening) hibernate is then loading related records using specific one-off queries.
In my case, i am querying the db about 6 times per record in the desired outter-most query. i.e. if there are 500 results in the original query, there are about 3,000 total queries being run.
What i think is happening:
Imagine i have a people table in the DB, i may also have an emails table, phone numbers table, and addresses table. I think that when i query the person table hibernate is fetching related records from phone numbers, emails ... In my case, looking at the generated HQL i can see that hibernate is running queries like this:
11:56:47,413 INFO [stdout] (default task-3) Hibernate: select identityen0_.id as id1_14_, identityen0_.auth_code as auth_cod2_14_, identityen0_.auth_provider_name as auth_pro3_14_, identityen0_.auth_provider_user_access_token as auth_pro4_14_, identityen0_.created_timestamp as created_5_14_, identityen0_.expiration as expirati6_14_, identityen0_.last_updated_timestamp as last_upd7_14_, identityen0_.person_id as person_10_14_, identityen0_.user_auth_provider_id as user_aut8_14_, identityen0_.username as username9_14_ from identities identityen0_ where identityen0_.auth_code=?
Notice that there are hundreds of these queries (one for each identity (person)).
I think this is because looking at the end of the query we can see where identityen0_.auth_code=? which implies that hibernate is doing a single query to get the identity info (one at a time) from a list of auth codes that it has.
This query takes minutes to complete and i am trying to speed that up. The obvious starting point would be to run fewer DB queries (avg latency of DB is 50-250 ms). I am wondering where to even start? Surely hibernate supports some kind of process to resolve this kind of issue, right?
Using hibernate-entitymanager 5.3.20.final
Thanks for any help.

Processing large amount of data from PostgreSQL

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"
Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}
If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)
Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.
Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20

Mapping a single row with a BLOB column using iBatis is very slow

Please help me crack this issue, basically a single row is being retrieved from a Oracle db that contains a BLOB column (size of the BLOB is about 350k) and map to a Java object using iBatis 2.5 but the mapping part (result is mapped to a resultmap) is taking around 40 seconds to be complete. Do you maybe know what can be the bottleneck in this situation ?
The easiest way to find a bottleneck on the Oracle side is to examine the wait events of the session running the SQL. Use v$session and or v$active_session_history (if AWR is licesned).
Sometimes "idle" wait events on the Oracle side point to bottlenecks on client or network.

Synchronizing table data across databases

I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
I use a global last_sync_date parameter and query table Master for
the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?
It sounds like you need a transactional message queue.
How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).
Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.
I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.
Based on your comments:
See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
You could even send the direct SQL statements as a message!
BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.
EDIT - Non MQ route
Since I gave you a tough time about asking this quesiton I will continue to try to help.
Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.
Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).
I am not sure about the solution. But I hope these links may help you.
http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm
http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases
Have a look at Oracle GoldenGate:
Oracle GoldenGate is a comprehensive software package for enabling the
replication of data in heterogeneous data environments. The product
set enables high availability solutions, real-time data integration,
transactional change data capture, data replication, transformations,
and verification between operational and analytical enterprise
systems.
SymmetricDS:
SymmetricDS is open source software for multi-master database
replication, filtered synchronization, or transformation across the
network in a heterogeneous environment. It supports multiple
subscribers with one direction or bi-directional asynchronous data
replication.
Daffodil Replicator:
Daffodil Replicator is a Java tool for data synchronization, data
migration, and data backup between various database servers.
Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.
In case you cannot alter schema usage in an existing app:
Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.
Very interesting question.
In may case I was having enough RAM to load all ids from master and slave tables to diff them.
If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).
To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).
After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.
Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.
Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.
Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields
hope this helps..
A potential option for allowing deletes within your current workflow:
In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).
Those ids would need to be inserted by a trigger on your master table on delete.
Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.
This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.
If your database has a transaction dump log, just ship that one.
It is possible with MySQL and should be possible with PostgreSQL.
I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.
Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?
If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

solr indexing performance with mysql driver

so, using solr 4.0
I have a fairly straight-up setup of an entity, with 1 sub entity (1:N relation)
the data to import sits on a mysql server
the main table has about 30 million records
the sub table has about 5 million records(most parent entities don't have the sub entity, the rest generally have a single 1)
I am running into rather horrible indexing(importing) performance. about 80 entities(docs) per second. so to index this table it'll in theory take few days.
now from what I am seeing that solr reports is, for example, if I tell it to index the first 1000 entities it actually issues 1000+ queries to sql. I have also tried setting the batchSize property for the data source with no luck... only -1 works(otherwise out of memory exception).
really not sure what I can do to optimize this, is there no PROPER data importer for mysql?
you could use CachedSqlEntityProcessor so that the sub entity query at least is cached...
Thought the cachedEntity approach helped me in another issue, I have found that using nested entities is usually not just the went to go.
The logic to fire the sub entity query for each "root" entity is just never going to work.
I've re-written my statements to SQL JOIN which fetches both root and sub entities as a single row and mapped to fields accordingly and performance improved significantly.

Categories