Make statistics out of a SQL table [closed]

Make statistics out of a SQL table [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a table in my database where I register readings from several sensors this way:
CREATE TABLE [test].[readings] (
[timestamp_utc] DATETIME2(0) NOT NULL, -- 48bits
[sensor_id] INT NOT NULL, -- 32 bits
[site_id] INT NOT NULL, -- 32 bits
[reading] REAL NOT NULL, -- 64 bits
PRIMARY KEY([timestamp_utc], [sensor_id], [site_id])
)
CREATE TABLE [test].[sensors] (
[sensor_id] int NOT NULL ,
[measurement_type_id] int NOT NULL,
[site_id] int NOT NULL ,
[description] varchar(255) NULL ,
PRIMARY KEY ([sensor_id], [site_id])
)
And I want to easily make statistics out of all these readings.
Some queries I would like to do:
Get me all readings for site_id = X between date_hour1 and date_hour2
Get me all readings for site_id = X and sensor_id in <list> between date_hour1 and date_hour2
Get me all readings for site_id = X and sensor measurement type = Z between date_hour1 and date_hour2
Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2
Get me all readings for site_id = X, aggregated (average) by DAY between date_hour1 and date_hour2 but in UTC+3 (this should give a different result than previous query because now the beginning and ending of days are shifted by 3h)
Get me min, max, std, mean for all readings for site_id = X between date_hour1 and date_hour2
So far I have been using Java to query the database and perform all this processing locally. But this ends up a bit slow and the code stays a mess to write and maintain (too much cicles, generic functions to perform repeated tasks, large/verbose code base, etc)...
To make things worse, table readings is huge (hence the importance of the primary key, which is also a performance index), and maybe I should be using a TimeSeries database for this (are there any good ones?). I am using SQL Server.
What is the best way to do this? I feel I am reinventing the wheel because all of this is kinda of an analytics app...
I know these queries sound simple, but when you try to parametrize all this you can end up with a monster like this:
-- Sums all device readings, returns timestamps in localtime according to utcOffset (if utcOffset = 00:00, then timestamps are in UTC)
CREATE PROCEDURE upranking.getSumOfReadingsForDevices
#facilityId int,
#deviceIds varchar(MAX),
#beginTS datetime2,
#endTS datetime2,
#utcOffset varchar(6),
#resolution varchar(6) -- NO, HOURS, DAYS, MONTHS, YEARS
AS BEGIN
SET NOCOUNT ON -- http://stackoverflow.com/questions/24428928/jdbc-sql-error-statement-did-not-return-a-result-set
DECLARE #deviceIdsList TABLE (
id int NOT NULL
);
DECLARE #beginBoundary datetime2,
#endBoundary datetime2;
SELECT #beginBoundary = DATEADD(day, -1, #beginTS);
SELECT #endBoundary = DATEADD(day, 1, #endTS);
-- We shift sign from the offset because we are going to convert the zone for the entire table and not beginTS endTS themselves
SELECT #utcOffset = CASE WHEN LEFT(#utcOffset, 1) = '+' THEN STUFF(#utcOffset, 1, 1, '-') ELSE STUFF(#utcOffset, 1, 1, '+') END
INSERT INTO #deviceIdsList
SELECT convert(int, value) FROM string_split(#deviceIds, ',');
SELECT SUM(reading) as reading,
timestamp_local
FROM (
SELECT reading,
upranking.add_timeoffset_to_datetime2(timestamp_utc, #utcOffset, #resolution) as timestamp_local
FROM upranking.readings
WHERE
device_id IN (SELECT id FROM #deviceIdsList)
AND facility_id = #facilityId
AND timestamp_utc BETWEEN #beginBoundary AND #endBoundary
) as innertbl
WHERE timestamp_local BETWEEN #beginTS AND #endTS
GROUP BY timestamp_local
ORDER BY timestamp_local
END
GO
This is a query that receives the site id (facilityId in this case), the list of sensor ids (the deviceIds in this case), the beginning and the ending timestamps, followed by their UTC offset in a string like "+xx:xx" or "-xx:xx", terminating with the resolution which will basically say how the result will be aggregated by SUM (taking the UTC offset into consideration).
And since I am using Java, at first glance I could use Hibernate or something, but I feel Hibernate wasn't made for these type of queries.

Your structure looks good at first glance but looking at your queries it makes me think that there are tweaks you may want to try. Performance is never an easy subject and it is not easy to find an "one size fits all answer". Here a some considerations:
Do you want better read or write performance? If you want better read performance you need to reconsider your indexes. Sure you have a primary key but most of your queries don't make use of it (all three fields). Try creating an index for [sensor_id], [site_id].
Can you use cache? If some searches are recurrent and your app is the single point of entry to your database, then evaluate if your use cases would benefit from caching.
If the table readings is huge, then consider using some sort of partitioning strategy. Check out MSSQL documentation
If you don't need real time data, then try some sort of search engine such as Elastic Search

Related

Why is Oracle Pivot producing non-existent results?

I manage a database holding a large amount of climate data collected from various stations. It's an Oracle 12.2 DB, and here's a synopsis of the relevant tables:
FACT = individual measurements at a particular time
UTC_START = time in UTC at which the measurement began
LST_START = time in local standard time (to the particular station) at which the measurement began
SERIES_ID = ID of the series to which the measurement belongs (FK to SERIES)
STATION_ID = ID of the station at which the measurement occurred (FK to STATION)
VALUE = value of the measurement
Note that UTC_START and LST_START always have a constant difference per station (the LST offset from UTC). I have confirmed that there are no instances where the difference between UTC_START and LST_START is anything other than what is expected.
SERIES = descriptive data for series of data
SERIES_ID = ID of the series (PK)
NAME = text name of the series (e.g. Temperature)
STATION = descriptive data for stations
STATION_ID = ID of the station (PK)
SITE_ID = ID of the site at which a station is located (most sites have one station, but a handful have 2)
SITE_RANK = rank of the station within the site if there are more than 1 stations.
EXT_ID = external ID for a site (provided to us)
The EXT_ID of a site applies to all stations at that site (but may not be populated unless SITE_RANK == 1, not ideal, I know, but not the issue here), and data from lower ranked stations is preferred. To organize this data into a consumable format, we're using a PIVOT to collect measurements occurring at the same site/time into rows.
Here's the query:
WITH
primaries AS (
SELECT site_id, ext_id
FROM station
WHERE site_rank = 1
),
data as (
SELECT d.site_id, d.utc_start, d.lst_start, s.name, d.value FROM (
SELECT s.site_id, f.utc_start, f.lst_start, f.series_id, f.value,
ROW_NUMBER() over (PARTITION BY s.site_id, f.utc_start, f.series_id ORDER BY s.site_rank) as ORDINAL
FROM fact f
JOIN station s on f.station_id = s.station_id
) d
JOIN series s ON d.series_id = s.series_id
WHERE d.ordinal = 1
AND d.site_id = ?
AND d.utc_start >= ?
AND d.utc_start < ?
)
records as (
SELECT * FROM data
PIVOT (
MAX(VALUE) AS VALUE
FOR NAME IN (
-- these are a few series that we would want to collect by UTC_START
't5' as t5,
'p5' as p5,
'solrad' as solrad,
'str' as str,
'stc_05' as stc_05,
'rh' as rh,
'smv005_05' as smv005_05,
'st005_05' as st005_05,
'wind' as wind,
'wet1' as wet1
)
)
)
SELECT r.*, p.ext_id
FROM records r JOIN primaries p on r.site_id = p.site_id
Here's where things get odd. This query works perfectly in SQLAlchemy, IntelliJ (using OJDBC thin), and Orcale SQL Developer. But when it's run from within our Java program (same JDBC urls, and credentials, using plain old JDBC statments and result sets), it gives results that don't make sense. Specifically for the same station, it will return 2 rows with the same UTC_START, but different LST_START (recall that I have verified that this 100% does not occur anywhere in the FACT table). Just to ensure there was no weird parameter handling going on, we tested hard-coding values in for the placeholders, and copy-and-pasted the exact same query between various clients, and the only one that returns these strange results is the Java program (which is using the exact same OJDBC jar as IntelliJ).
If anyone has any insight or possible causes, it would be greatly appreciated. We're at a bit of a loss right now.

It turns out that Nathan's comment was correct. Though it seems counter-intuitive (to me, at least), it appears that calling ResultSet.getString on a DATE column will in fact convert to Timestamp first. Timestamp has the unfortunate default behavior of using the system default timezone unless you specify otherwise explicitly.
This default behavior meant that daylight saving's time was taken into account when we didn't intend it to be, leading to the odd behavior described.

Fastest way to compare millions of rows in one table with millions of rows in another [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to compare two tables with million of records in each table and get match data from the comparison.
To get the match data from both the tables we are first comparing the name in table1 should not be equal to name in table2. Then we are comparing city in table1 should be equal to city in table2 and then finally we are comparing date_of_birth in table1 should be with in +-1 year range of date_of-birth in table2.
A single row in Table 1 can have multiple matches with data in Table 2.
Also for each match I need a unique Record ID and multiple match data of a single Table 1 row must have same unique record ID.
I tried with Java Code, and PL/SQL Procedure but both are taking hours as this involves comparison of millions of data with millions of data. Is there any faster way to do this matching?

"I tried using java by storing data from both tables in list via jdbc connection and then iterating one list with the other. But it was very slow and took many hours to complete, even got time out exception many time."
Congratulations. This is the first step on the road to enlightenment. Databases are much better at handling data than Java. Java is a fine general programming language but databases are optimized for relational data processing: they just do it faster, with less CPU, less memory and less network traffic.
"I also created an sql procedure for the same, it was some what faster
than java program but still took a lot time (couple of hours) to
complete."
You are on the verge of the second step to enlightenment: row-by-row processing (i.e. procedural iteration) is slow. SQL is a set-based paradigm. Set processing is much faster.
To give concrete advice we need some specifics about what you are really doing, but as an example this query would give you the set of matches for these columns in both tables:
select col1, col2, col3
from huge_table_1
INTERSECT
select col1, col2, col3
from huge_table_2
The MINUS operator would give you the rows in huge_table_1 which aren't in huge_table_2. Flip the tables to get the obverse set.
select col1, col2, col3
from huge_table_1
MINUS
select col1, col2, col3
from huge_table_2
Embrace the Joy of Sets!
"we are first comparing the name in huge_table_1 should not be equal
to name in huge_table_2. Then we are comparing city in huge_table_1
should be equal to city in huge_table_2 and then finally we are
comparing date_of_birth in huge_table_1 should be with in +-1 year
range of date_of-birth in huge_table_2"
Hmmm. Starting off with an inequality is usually bad, especially in large tables. The chances are you will have lots of non-matching names with those matching criteria. But you could try something like this:
select * from huge_table_1 ht1
where exists
( select null from huge_table_2 ht2
where ht2.city = ht1.city
and ht1.date_of birth between add_months(ht2.date_of birth, -12)
and add_months(ht2.date_of birth, 12)
and ht2.name != ht1.name)
/

Select data from both tables, sorted by the key fields, then iterate them in parallel and compare. Comparison time should be fast, so total run time should be only slightly more than sum of run time for each ordered query.
UPDATE
New information shows that a partial cross-join of the data is desired:
left.name <> right.name
left.city = right.city
abs(left.birthDate - right.birthDate) <= 1 year
So, given that there is one equality test, you can process the data in chunks, where a chunk is all records with the same city.
Comparison will progress as follows:
Select data from both tables, sorted by city.
Iterate both result sets in parallel.
Load all records from one result set (left) with the next city, i.e. load the next chunk. Store them in memory in a TreeMap<LocalDate, List<Person>>.
Iterate all records from the other result set (right) with the same city, i.e. process the chunk.
For each record in right, find records within 1 year of birthDate by calling subMap(), like this:
Collection<List<Person>> coll =
leftTree.subMap(right.birthDate.minusYears(1), true,
right.birthDate.plusYears(1), true)
.values();
Iterate the records and skip records with same name. These are the left records that "match" the right given record.
If needed, you can flatten that and filter the names using stream:
List<Person> matches = coll.stream()
.flatMap(List::stream)
.filter(p -> ! p.name.equals(right.name))
.collect(Collectors.toList());
Optionally replacing the collect() with the actual processing logic.
When done processing the chunk as described in step 4, i.e. when you see the next city, clear the TreeMap, and repeat from step 3 for the next chunk, aka city.
Advantages of this logic:
Data is only sent once from the database server, i.e. the repetition of data caused by the partial cross-join is eliminated from the relatively slow data link.
The two queries can be sourced from two different databases, if needed.
Memory footprint is kept down, by only retaining data for one city of one of the queries at a time (chunk of left).
Matching logic can be multi-threaded, if needed, for extra performance, e.g.
Thread 1 loads left chunk into TreeMap, and gives it to thread 2 for processing, while thread 1 begins loading next chunk.
Thread 2 iterates right and finds matching records by calling subMap(), iterating the submap, giving matching left and right records to thread 3 for processing.
Thread 3 processes a matching pair.

Range Scan in Cassandra-2.1.2 taking a lot of time

My use case is like this: I am inserting 10 million rows in a table described as follows:
keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)
and input data is like follows -
keyval - some timestamp
rangef - a random number
arrayval - a byte array
I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.
Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.
My query is like this
Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
s.setFetchSize(10000);
ResultSet rs = sess.execute(s);
for (Row row : rs)
{
count++;
}
System.out.println("Batch2 count = " + count);
I am using default partitioner, that is MurMur partitioner.
My cluster configuration is -
No. of nodes - 4
No. of seed nodes - 1
No. of disks - 6
MAX_HEAP_SIZE for each node = 8G
Rest configuration is default.
How I can improve my range scan performance?

Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.
To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.

RussS is correct that your problems are caused both by the use of ALLOW FILTERING and that you are not limiting your query to a single partition.
How I can improve my range scan performance?
By limiting your query with a value for your partitioning key.
PRIMARY KEY (rangef, keyval)
If the above is indeed correct, then rangef is your partitioning key. Alter your query to first restrict for a specific value of rangef (the "single partition", as RussS suggested). Then your current range query on your clustering key keyval should work.
Now, that query may not return anything useful to you. Or you might have to iterate through many rangef values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.
I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?
The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?

Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).

You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END

Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).

If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

Service usage limiter implementation

I need to limit multiple service usages for multiple customers. For example, customer customer1 can send max 1000 SMS per month. My implementation is based on one MySQL table with 3 columns:
date TIMESTAMP
name VARCHAR(128)
value INTEGER
For every service usage (sending SMS) one row is inserted to the table. value holds usage count (eg. if SMS was split to 2 parts then value = 2). name holds limiter name (eg. customer1-sms).
To find out how many times the service was used this month (March 2011), a simple query is executed:
SELECT SUM(value) FROM service_usage WHERE name = 'customer1-sms' AND date > '2011-03-01';
The problem is that this query is slow (0.3 sec). We are using indexes on columns date and name.
Is there some better way how to implement service usage limitation? My requirement is that it must be flexibile (eg. if I need to know usage within last 10 minutes or another within current month). I am using Java.
Thanks in advance

You should have one index on both columns, not two indexes on each of the columns. This should make the query very fast.
If it still doesn't, then you could use a table with a month, a name and a value, and increment the value for the current month each time an SMS is sent. This would remove the sum from your query. It would still need an index on (month, name) to be as fast as possible, though.

I found one solution to my problem. Instead of inserting service usage increment, I will insert the last one incremented:
BEGIN;
-- select last the value
SELECT value FROM service_usage WHERE name = %name ORDER BY date ASC LIMIT 1;
-- insert it to the database
INSERT INTO service_usage (CURRENT_TIMESTAMP, %name, %value + %increment);
COMMIT;
To find out service usage since %date:
SELECT value AS value1 FROM test WHERE name = %name ORDER BY date DESC LIMIT 1;
SELECT value AS value2 FROM test WHERE name = %name AND date <= %date ORDER BY date DESC LIMIT 1;
The result will be value1 - value2
This way I'll need transactions. I'll probably implement it as stored procedure.
Any additional hints are still appreciated :-)

It's worth trying to replace your "=" with "like". Not sure why, but in the past I've seen this perform far more quickly than the "=" operator on varchar columns.
SELECT SUM(value) FROM service_usage WHERE name like 'customer1-sms' AND date > '2011-03-01';
Edited after comments:
Okay, now I can sorta re-create your issue - the first time I run the query, it takes around 0.03 seconds, subsequent runs of the query take 0.001 second. Inserting new records causes the query to revert to 0.03 seconds.
Suggested solution:
COUNT does not show the same slow-down. I would change the business logic so every time the user sends and SMS you insert the a record with value "1"; if the message is a multipart message, simply insert two rows.
Replace the "sum" with a "count".
I've applied this to my test data, and even after inserting a new record, the "count" query returns in 0.001 second.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.