How to remove duplicate rows in a Cursor (Android SDK)?

How to remove duplicate rows in a Cursor (Android SDK)? - java

Currently, I have made 3 queries (resulting in 3 cursors), and then I merged the cursors using the MergeCursor class. However, this has caused duplicates in the cursor, and I can't seem to find a way to remove them? What would be the ideal method to fix this problem?

A Cursor is an object tied to the ResultSet, not the data therein.
If the three result-sets have identical keys, their primary-keys will need to be fetched to de-duplicate the rows - the Cursor implementation does not provide this function. There are several options, two named here:
As eluded to in an earlier comment - do this server-side and have the joined result returned. Ex: Send the base query from the client, have the server start three queries and merge the result - though databases excel in set-operations and there is almost never a performance gain in doing this programatically.
Launch one task that in turn runs the three queries and does the work to fetch the rows, returning just the distinct set of keys.

Related

Split database into smaller ones. Too much data in single commit

I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.

When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.

JDBC Pagination: vendor specific sql versus result set fetchSize

There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof

Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.

There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.

ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.

Google App-Engine Datastore is extremely slow

I need help in understanding why the below code is taking 3 to 4 seconds.
UPDATE: Use case for my application is to get the activity feed of a person since last login. This feed could contain updates from friends or some new items outside of his network that he may find interesting. The Activity table stores all such activities and when a user logs in, I run a query on the GAE-DataStore to return above activities. My application supports infinite scrolling too, hence I need the cursor feature of GAE. At a given time, I get around 32 items but the activities table could have millions of rows (as it contains data from all the users).
Currently the Activity table is small and contains 25 records only and the below java code reads only 3 records from the same table.
Each record in the Activity table has 4 UUID fields.
I cannot imagine how the query would behave if the table contained millions of rows and result contained 100s of rows.
Is there something wrong with the below code I have below?
(I am using Objectify and app-engine cursors)
Filter filter = new FilterPredicate("creatorID", FilterOperator.EQUAL, userId);
Query<Activity> query = ofy().load().type(Activity.class).filter(filter);
query = query.startAt(Cursor.fromWebSafeString(previousCursorString));
QueryResultIterator<Activity> itr = query.iterator();
while (itr.hasNext())
{
Activity a = itr.next();
System.out.println (a);
}
I have gone through Google App Engine Application Extremely slow and verified that response time improves if I keep on refreshing my page (which calls the above code). However, the improvement is only ~30%
Compare this with any other database and the response time for such tiny data is in milliseconds, not even 100s of milliseconds.
Am I wrong in expecting a regular database kind of performance from the GAE DataStore?
I do not want to turn on memcache just yet as I want to improve this layer without caching first.

Not exactly sure what your query is supposed to do but it doesn't look like it requires a cursor query. In my humble opinion the only valid use case for cursor queries is a paginated query for data with a limited count of result rows. Since your query does not have a limit i don't see why you would want to use a cursor at all.
When you need millions of results you're probably doing ad-hoc analysis of data (as no human could ever interpret millions of raw data rows) you might be better off using BigQuery instead of the appengine datastore. I'm just guessing here, but for normal front end apps you rarely need millions of rows in a result but only a few (maybe hundreds at times) which you filter from the total available rows.
Another thing:
Are you sure that it is the query that takes long? It might as well be the wrapper around the query. Since you are using cursors you would have to recall the query until there are no more results. The handling of this could be costly.
Lastly:
Are you testing on appengine itself or on the local development server? The devserver can obviouily not simulate a cloud and thus could be slower (or faster) than the real thing at times. The devserver does not know about instance warmup times either when your query spawns new instances.
Speaking of cloud: The thing about cloud databases is not that they have the best performance for very little data but that they scale and perform consistently with a couple of hundreds and a couple of billions of rows.
Edit:
After performing a retrieval operation, the application can obtain a
cursor, which is an opaque base64-encoded string marking the index
position of the last result retrieved.
[...]
The cursor's position is defined as the location in the result list
after the last result returned. A cursor is not a relative position in
the list (it's not an offset); it's a marker to which the Datastore
can jump when starting an index scan for results. If the results for a
query change between uses of a cursor, the query notices only changes
that occur in results after the cursor. If a new result appears before
the cursor's position for the query, it will not be returned when the
results after the cursor are fetched.
(Datastore Queries)
These two statements make be believe that the query performance should be consistent with or without cursor queries.
Here are some more things you might want to check:
How do you register your entity classes with objectify?
What does your actual test code look like? I'd like to see how and where you measure.
Can you share a comparison between cursor query and query without cursors?
Improvement with multiple request could be the result of Objectifys integrated caching. You might want to disable caching for datastore performance tests

Multiple Prepared Statements or a Batch

My question is very simple and in the title. Google and stack overflow are giving me nothing so I figured it was time to ask a question.
I am currently in the process of making an sql query for when users register to my site. I have ALWAYS only used prepared statements b/c the extra coding in callable statements, and the performance hit of regular statements are both turn offs. However this query is causing me to think of possible alternatives to my previous one size fits all (prepared statements) ways.
This query has a total of 4 round trips to the database. The steps are
Insert a user into the database, get back the generated key (their user id) within a result set.
Take the user id and insert a row into the album table. Get back a generated key (album id)
Take the album id and insert a row into the images table. Get back a generated key (image id)
Take the image id and update the user tables current default column with the image id
Aside: For anyone interested in the way I am getting the keys back after my inserts it is with Statement.RETURN_GENERATED_KEYS and you can read a great article about this here - IBM Article
So anyway I'd like to know if the use of 4 round trip (but cacheable) prepared statements is okay or if I should go with batched (but not cacheable) statements?

JDBC batch statements let you reduce the number of roundtrips under a condition that there is no data dependency among the rows that you are inserting or updating. Your scenario fails this condition, because the changes are dependent on each other's data: statements 2 through 4 must pick up an ID from the prior statement 1 through 3.
On the other hand, four round-trips is definitely suboptimal. That is why scenarios like yours call for stored procedures: you can put all this logic into a create_user_proc, and return the user ID back to the caller. All insertions from 1 to 4 would happen inside your SQL code, letting you manage ID dependencies in SQL. You would be able to call this stored procedure in a single roundtrip, which is definitely faster, especially if you process multiple user registrations per minute.

I would advice to write one Stored Proc doing all this four operation and passing the all the required params from application (to stored proc) at once and there in stored proc, you can get the generated keys for resultset

To increase performance and reduce database round trips, I agree with dasblinkenlight and ajduke - stored procedures will achieve this.
But, it this really a performance bottleneck in your application?
How often do users register on your site?
Compare this to how often information is read from these tables (once per page access?)
If information in these tables are being read thousands of times more than being written via new registrations, then it might not be worth going for the stored procedure approach.
Why you might not want to use stored procedures and stick to prepared statements:
not as portable as using prepared statements (a different syntax/language for each database, some simpler databases don't even support them)
will not work with ORM solutions such as JPA* - you mentioned using PreparedStatements directly so this probably does not apply to you, at least not now but it might limit you later on if you wanted to use ORM in the future
*JPA 2.1 might actually support stored procedures, but as of writing it has not yet been released.

Does using Limit in query using JDBC, have any effect in performance?

If we use the Limit clause in a query which also has ORDER BY clause and execute the query in JDBC, will there be any effect in performance? (using MySQL database)
Example:
SELECT modelName from Cars ORDER BY manuDate DESC Limit 1
I read in one of the threads in this forum that, by default a set size is fetched at a time. How can I find the default fetch size?
I want only one record. Originally, I was using as follows:
SQL Query:
SELECT modelName from Cars ORDER BY manuDate DESC
In the JAVA code, I was extracting as follows:
if(resultSett.next()){
//do something here.
}

Definitely the LIMIT 1 will have a positive effect on the performance. Instead of the entire (well, depends on default fetch size) data set of mathes being returned from the DB server to the Java code, only one row will be returned. This saves a lot of network bandwidth and Java memory usage.
Always delegate as much as possible constraints like LIMIT, ORDER, WHERE, etc to the SQL language instead of doing it in the Java side. The DB will do it much better than your Java code can ever do (if the table is properly indexed, of course). You should try to write the SQL query as much as possibe that it returns exactly the information you need.
Only disadvantage of writing DB-specific SQL queries is that the SQL language is not entirely portable among different DB servers, which would require you to change the SQL queries everytime when you change of DB server. But it's in real world very rare anyway to switch to a completely different DB make. Externalizing SQL strings to XML or properties files should help a lot anyway.

There are two ways the LIMIT could speed things up:
by producing less data, which means less data gets sent over the wire and processed by the JDBC client
by potentially having MySQL itself look at fewer rows
The second one of those depends on how MySQL can produce the ordering. If you don't have an index on manuDate, MySQL will have to fetch all the rows from Cars, then order them, then give you the first one. But if there's an index on manuDate, MySQL can just look at the first entry in that index, fetch the appropriate row, and that's it. (If the index also contains modelName, MySQL doesn't even need to fetch the row after it looks at the index -- it's a covering index.)
With all that said, watch out! If manuDate isn't unique, the ordering is only partially deterministic (the order for all rows with the same manuDate is undefined), and your LIMIT 1 therefore doesn't have a single correct answer. For instance, if you switch storage engines, you might start getting different results.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.