I have a database with a quite large entity model.
In total, there are 14 tables with around 100k records on average.
When I tested my application that takes one entity from the database and converts it to json and throws that back to the caller, it took 7 seconds to get the entry.
This doesnt seem an initialization issue because if I do the same call twice in a row, they both take around 10 seconds to get the data.
When I enable sql script logging, I find that for each entity read, hibernate sends hundreds of sql requests to the database.
(The actual number is based on how many connections/licenses/products/services etc the entity has but for my test entry, I got 286 queries (which took around 7 seconds in total))
When the database is empty (except for the data that the test should return), then it takes around 6 seconds.
I suppose the issue is because I have my #OneToMany's and #ManyToMany's fetch set to Lazy, but when I set them to eager, I get the error of multiple fetch bags.
#javax.persistence.OneToMany(fetch = javax.persistence.FetchType.LAZY)
#org.hibernate.annotations.LazyCollection(org.hibernate.annotations.LazyCollectionOption.FALSE)
#javax.persistence.JoinColumn(name = "ProductId")
This is an example of a OneToMany relation I have, which simulates the eager fetch without the multiple fetch bags error.
Is this a common issue?
And how do I improve the speed on the reading?
Hibernate does not allow to have more then one collection of type List marked as EAGER.
The best to load the data eager is to use a JOIN FETCH
select e from YourEntity JOIN FETCH e.yourList l
Another option could be to use an EntityGraph to define the EAGER loading.
Read more about that topic here:
https://vladmihalcea.com/hibernate-facts-the-importance-of-fetch-strategy/
If you use Spring Data JPA repository interfaces you can use NamedEntityGraphs:
https://docs.spring.io/spring-data/jpa/docs/current/reference/html/#jpa.entity-graph
I'm writing a batch processing with Spring Batch. I have to move circa about 2 000 000 records from the datasource (Oracle database) to the target (Kafka broker). I'm hesitating which ItemReader should I choose for this job:
JdbcCursorItemReader: if I understand correctly it will open the cursor, which will be iterating through the ResultSet of ALL of those records, one by one, performance is no issue; under the hood database keeps a snapshot of records satisfying where clause at the time of query execution;
RepositoryItemReader: might be less performant, partitioning based on paging mechanism, for each page the query will be executed, possibility of ommiting some records which could be written to database during fetch of 2 000 000 records, which wouldn't happen in the former case (is my reasoning even correct?)
Summary: As a result I want to send all of those 2 000 000 records as they were at the time of the query execution in a partitioned manner. Am I overthinking this problem? Maybe skipping new records isn't such a problem in case of the future executions of the job for updates? Or maybe my reasoning regarding RepositoryItemReader is not correct?
Keeping a cursor open for extended periods of time is not always ideal. Depending on the DB you're using it may not be optimized; i.e. some DBs do not honor fetchSize and will retrieve results one by one as they are requested.
I would go with the RepositoryItemReader or one of the PagingItemReader implementations.
I'm not quite following if your concern is that you DO or DO NOT want to omit new records.
If you DO want omit new records you should be able to add a predicate to your where clause to not pass a certain ID or timestamp field. If neither of those are available you can set the maxItemCount() on the reader based on a count query you execute up front before the job (in a listener, for example).
I have searched this enough & haven't found the answer yet. So I am asking.
According to the Google cloud datastore doc.
There is a write throughput limit of about one transaction per second
within a single entity group.
Now let's just say I have an Entity User & another entity Cars. They have a common parent. So User+Car+Their_Parent is one entity group. Right?
Let's assume In the datastore User & Car have a million instances/rows each.
If I fire a transactional query to update instance/row in the datastore.
My confusion is how many Entity group instances get locked for applying the write limit for Google DataStore?
A. User + Car (Comprehensively with twenty million instances)
B. Just 1 instance of User + Car? (1 user row & 1 car row)
In database parlance, User is an Entity Kind/Table. So does the entire
Table/Kind gets locked for 1 write operation or just one instance/Row
gets locked for 1 write operation?
If A is the case does that mean for 1 write, all 20 million rows of User+Car entities will be locked? That's crazy. What if I have to update all 20 million rows. If a write operation is updating just 1 row, will 20 million rows require 20 million secs to avoid any contention?
an entity group is a set of entities connected through ancestry to a
common root element. The organization of data into entity groups can
limit what transactions can be performed:
See the "Python" docs here. Surprised it wasn't somewhere in your Java documentation link
Finally found the answer here data store article
In the example above, each organization may need to update the record of any person in the organization. Consider a scenario where there are 1,000 people in the “ateam” and each person may have one update per second on any of the properties. As a result, there may be up to 1,000 updates per second in the entity group, a result which would not be achievable because of the update limit. This illustrates that it is important to choose an appropriate entity group design that considers performance requirements. This is one of the challenges of finding the optimal balance between eventual consistency and strong consistency.
so, using solr 4.0
I have a fairly straight-up setup of an entity, with 1 sub entity (1:N relation)
the data to import sits on a mysql server
the main table has about 30 million records
the sub table has about 5 million records(most parent entities don't have the sub entity, the rest generally have a single 1)
I am running into rather horrible indexing(importing) performance. about 80 entities(docs) per second. so to index this table it'll in theory take few days.
now from what I am seeing that solr reports is, for example, if I tell it to index the first 1000 entities it actually issues 1000+ queries to sql. I have also tried setting the batchSize property for the data source with no luck... only -1 works(otherwise out of memory exception).
really not sure what I can do to optimize this, is there no PROPER data importer for mysql?
you could use CachedSqlEntityProcessor so that the sub entity query at least is cached...
Thought the cachedEntity approach helped me in another issue, I have found that using nested entities is usually not just the went to go.
The logic to fire the sub entity query for each "root" entity is just never going to work.
I've re-written my statements to SQL JOIN which fetches both root and sub entities as a single row and mapped to fields accordingly and performance improved significantly.
How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.