In Scala program I use JDBC to get data from a simple table with 20 rows in SQL DB (Hive).
Table contains movie titles rated by users with rows in the following frormat:
user_id, movie_title, rating, date.
I start first JDBC cursor enumerating users. Next, with JDBC cursor 2, for every user I find movie titles he rated. Next, JDBC cursor 3, for every title the current user rated, I find other users who also rated this title. As a result I get a groups of users where every user rated at least one similar title with the first user who started this group. I need to get all such groups existing in the dataset.
So to group users by movie I do 3 nested select requests, pseudo-code:
1) select distinct user_id
2) for each user_id:
select distinct movie_title //select all movies that user saw
3) for each movie_title:
select distinct user_id //select all users who saw this movie
On a local table with 20 rows these nested queries work 26 min! Program returns first user_id after a minute!
Providing that real app will have to deal with 10^6 users, is there any way to optimize 3 nested selects in this case?
Without seeing the exact code is difficult to assess why it is taking so long. Given you've got 20 rows you there must be something fundamentally wrong there.
However as a general advise, I'd suggest looking back at the solution and thinking whether it can't be run with a single SQL query (instead of running hundreds of queries), which will allow you to benefit from features like indexes and save you huge amount of network traffic.
Assuming you have the following table Movies(user_id: NUMERIC, movie_title: VARCHAR(50), rating: NUMERIC, date: DATE) try running something along those lines (haven't tested it so might need to tweak it a bit):
SELECT DISTINCT m1.user_id, m2.user_id
FROM Movies m1, Movies m2
WHERE m1.user_id != m2.user_id
AND m1.movie_title = m2.movie_title
Once you've got the results you can group them in your Java/Scala code by first user_id and load it to the Multimap-like data structure.
Related
So i a bit lost and don t really know how to hang up this one...
Consider that i have a 2 DB table in Talend, let say firstly
A table invoices_only which has as fields, the invoiceNummer and the authors like this
Then, a table invoices_table with the field (invoiceNummer, article, quantity and price) and for one invoice, I can have many articles, for example
and through a tmap want to obtain a table invoice_table_result, with new columns, one for the article position, an one other for the total price. for the position i know that i can use something like the Numeric.sequence("s1",1,1) function, but don t know how to restart my counter when a new invoices nummer is found, and of course for the total price it is just a basic multiplication
so my result should be some thing like this
Here is a draft of my talend job, i m doing a lookup on the invoicenummer between the table invoice_only and invoices
Any Advices? thanks.
A trick I use is to do the sequence like this:
Numeric.sequence("s" + row.InvoiceNummer, 1, 1)
This way, the sequence gets incremented while you're still on the same InvoiceNummer, and a new one is started whenever a new InvoiceNummer is found.
There are two ways to achieve it,
tJavaFlex
Sql
tJavaFlex
You can compare current data with the previous data and reset the sequence value using below function,
if () {
Numeric.resetSequence(seqName, startValue);
}
Sql
Once data is loaded into the tables, create a post job and use an update query to update the records. You have to select the records and take the rank of the values. On top of the select you have to perform the update.
select invoicenumber, row_number() over(partition by invoicenumber, order by invoicenumber) from table name where -- conditions if any.
Update statements vary with respect to the database, please provide which database are you using, so that can provide the update query.
I would recommend you to achieve this through Sql
I have this scenario: user want to see tons of information about himself. For example: age, name, status, income, job, hobby, children's name, wife's name, chief's name, grandfather/grandmother names. About 50 variables. And he can choose any of variables to show the information.
So, I have this class *Impl.java passing with 50 params. Within 50 params, let's say 25 will be null and others will be shown. And it will return the selected information.
How can I create a query in SQL to get the columns selected from params? Should I create a procedure and then do the query select? Or is it bad to do what I'm trying to achieve?
I'm using Web Services and Spring JDBC. If requires more information, I'll edit.
Building a SELECT statement to return arbitrarily selected columns can be tricky (dynamic SQL) at best and dangerous (SQL Injection) at worst. If there are only 50 columns and the query used to pull them is relatively trivial*, I'd say write the query to pull all possible values for one user and then have the application sift and sort through the data they actually want to see.
*It really does seem like the query should be trivial. At a super-high average of 25 bytes per column that'd be 1250 bytes, aka nothing in 21st century terms, and at maybe one row per table joined via primary key it should still be sub-100th-second work.
I am working on a MySQL database with 3 tables - workout_data, excercises and sets tables. I'm facing issues related to generating reports based on these three tables.
To add more information, a number of sets make up an excercise and a number of excercises will be a workout.
I currently have the metrics to which a report is to be generated from the data in these tables. I've to generate reports for the past 42 days including this week. The queries run for a long time by the time I get the report by joining these tables.
For example - the sets table has more than 1 million records just for the past 42 days. The id in this table is the excercise_id in excercise table. The id of excercise table is the workout_id in workout_data table.
I'm running this query and it takes more than 10 minutes to get the data. I have to prepare a report and show it to the user in the browser. But due to this long running query the webpage times out and the user is not able to see the report.
Any advice on how to achieve this?
SELECT REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER
FROM EXCERCISES
INNER JOIN REPORTSETS ON EXCERCISES.ID=REPORTSETS.EXCERCISE_ID
where user_id=(select id from users where email='testuser1#gmail.com')
and substr(set_date,1,10)='2013-10-29'
GROUP BY REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER
Two things:
First, You have the following WHERE clause item to pull out a single day's data.
AND substr(set_date,1,10)='2013-10-29'
This definitively defeats the use of an index on the date. If your set_date column has a DATETIME datatype, what you want is
AND set_date >= `2013-10-09`
AND set date < `2013-10-09` + INTERVAL 1 DAY
This will allow the use of a range scan on an index on set_date. It looks to me like you might want a compound index on (user_id, set_date). But you should muck around with EXPLAIN to figure out whether that's right.
Second, you're misusing GROUP BY. That clause is pointless unless you have some kind of summary function like SUM() or GROUP_CONCAT() in your query. Do you want ORDER BY?
Comments on your SQL that you might want to look into:
1) Do you have an index on USER_ID and SET_DATE?
2) Your datatype for SET_DATE looks wrong, is it a varchar? Storing it as a date will mean that the db can optimise your search much more efficiently. At the moment the substring method will be called countless times per query as it has to be run for every row returned by the first part of your where clause.
3) Is the group by really required? Unless I'm missing something the 'group by' part of the statement brings nothing to the table ;)
It should make a significant difference if you could store the date either as a date, or in the format you need to make the comparison. Performing a substr() call on every date must be time consuming.
Surely the suggestions with tuning the query would help to improve the query speed. But I think the main point here is what can be done with more than 1 million plus records before session timed out. What if you have like 2 or 3 million records, will some performance tuning solve the problem? I don't think so. So:
1) If you want to display on browser, use pagination and query (for example) the first 100 record.
2) If you want to generate a report (like pdf), then use asynchronous method (JMS)
I am new to databases and before I start learning mySQL and using a driver in Java to connect with my database on the server-side, I wanted to get the design of my database down first. I have two columns in the database, CRN NUMBER, and DEVICE_TOKEN. The CRN number will be a string of five digits, and the DEVICE_TOKEN will be a string device token(iOS device token for push notifications). Let me try to describe what I am trying to do. I am going to have users send my server data from the iOS app, mainly their device token for push notifications and a CRN(course) they want to watch. There are going to be MANY device tokens requesting to watch the same CRN number. I wanted to know the most efficient way to store these in a database. I am going to have one thread looking through all of the rows in the DB, and polling the website for each CRN. If the event I am looking for takes place, I want to notify every device token associated with this CRN. Initially, I wanted to have one column being the CRN, and the other column being DEVICE_TOKENS. I have learned though that this is not possible, and that each column should only correspond to one entry. Can someone help me figure out the best way to design this database, that would be the most efficient?
CRN DEVICE_TOKEN
12345 "string_of_correct_size"
12345 "another_device_token"
Instead of me making multiple request to the website for the same CRN, it would be MUCH more efficient for me to poll the website per unique CRN ONCE per iteration, and then notify all device tokens of the change. How should I store this information? Thanks for your time
In this type of problem where you have a one-to-many relationship (one CRN with many Device_Tokens), you want to have a separate table to store the CRN where a unique ID is assigned for each new CRN. A separate table should then be made for your DEVICE_TOKENS that relates has columns for a unique ID, CRN, and DEVICE_TOKEN.
With this schema, you can go through the rows of the CRN table, poll against each CRN, and then just do a simple JOIN with the DEVICE_TOKEN table to find all subscribed devices if a change occurs.
The most normal way to do this would be to normalize out the Courses with a foreign key from the device tokens. E.g. Two tables:
Courses
id CRN
1 12345
InterestedDevices
id course DEVICE_TOKEN
1 1 "string_of_correct_size"
2 1 "another_device_token"
You can then find interested devices with a SQL like the following:
SELECT *
FROM Courses
JOIN InterestedDevices ON Courses.id = InterestedDevices.course
WHERE Courses.CRN = ?
This way you avoid duplicating the course information over and over.
I have been asked in an interview to write a SQL query which fetches the first three records with highest value on some column from a table. I had written a query which fetched all the records with highest value, but didn't get how exactly i can get only first three records of those.
Could you help me in this.
Thanks.
SELECT TOP 3 * FROM Table ORDER BY FieldName DESC
From here, but might be a little out of date:
Postgresql:
SELECT * FROM Table ORDER BY FieldName DESC LIMIT 3
MS SQL Server:
SELECT TOP 3 * FROM Table ORDER BY FieldName DESC
mySQL:
SELECT * FROM Table ORDER BY FieldName DESC LIMIT 3
Select Top 3....
Depending on the database engine, either
select top 3 * from table order by column desc
or
select * from table order by column desc limit 3
The syntax for TOP 3 varies widely from database to database.
Unfortunately, you need to use those constructs for the best performance.
Libraries like Hibernate help here, because they can translate a common API into the various SQL dialects.
Since you are asking about Java, it is possible to just SELECT everything from the database (with an ORDER BY), but just fetch only the first three rows. Depending on how the query needs to be executed this might be good enough (especially if no sorting on the database has to happen thanks to appropriate indexes, for example when you sort by primary key fields).
But in general, you want to go with an SQL solution.
In oracle you can also use where rownum < 4...
Also on mysql there is a Limit keyword (i think)