Is Siddhi unable to group by more than one variable? - java

I have the following stream definition:
String eventStreamDefinition =
"define stream cdrEventStream (nodeId string, phone string, timeStamp long, isOutgoingCall bool); ";
And the query:
String query = "#info(name = 'query1') from cdrEventStream#window.externalTime(timeStamp,5 sec) select nodeId, phone, timeStamp, isOutgoingCall, count(nodeId) as callCount group by phone,isOutgoingCall insert all events into outputStream;";
But when I try to compile them I get:
org.wso2.siddhi.query.compiler.exception.SiddhiParserException: You have an error in your SiddhiQL at line 1:267, extraneous input ',' expecting {'#', STREAM, DEFINE, TABLE, FROM, PARTITION, WINDOW, SELECT, GROUP, BY, HAVING, INSERT, DELETE, UPDATE, RETURN, EVENTS, INTO, OUTPUT, EXPIRED, CURRENT, SNAPSHOT, FOR, RAW, OF, AS, OR, AND, ON, IS, NOT, WITHIN, WITH, BEGIN, END, NULL, EVERY, LAST, ALL, FIRST, JOIN, INNER, OUTER, RIGHT, LEFT, FULL, UNIDIRECTIONAL, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, FALSE, TRUE, STRING, INT, LONG, FLOAT, DOUBLE, BOOL, OBJECT, ID_QUOTES, ID}
at org.wso2.siddhi.query.compiler.internal.SiddhiErrorListener.syntaxError(SiddhiErrorListener.java:34)
at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:65)
at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:558)
at org.antlr.v4.runtime.DefaultErrorStrategy.reportUnwantedToken(DefaultErrorStrategy.java:377)
at org.antlr.v4.runtime.DefaultErrorStrategy.sync(DefaultErrorStrategy.java:275)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.group_by(SiddhiQLParser.java:3783)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.query_section(SiddhiQLParser.java:3713)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.query(SiddhiQLParser.java:1903)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.execution_element(SiddhiQLParser.java:619)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.execution_plan(SiddhiQLParser.java:550)
at org.wso2.siddhi.query.compiler.SiddhiQLParser.parse(SiddhiQLParser.java:152)
at org.wso2.siddhi.query.compiler.SiddhiCompiler.parse(SiddhiCompiler.java:61)
at org.wso2.siddhi.core.SiddhiManager.createExecutionPlanRuntime(SiddhiManager.java:59)
The only way I can get the query to compile is by removing isOutgoingCall from the group by clause. The Siddhi docs states that grouping by more than one variable should be possible. Is this a bug?
This is on version 3.0.0-alpha.

Grouping by several variables is supported by Siddhi 3.0.0. I just checked your query with Siddhi 3.0.0 and I was able to compile it. But of course I used released 3.0.0. Can you please give it a try.
Tip: You can use Siddhi try it to easily try out your queries

Related

Regex changes to a DDL (Java)

I have a process that gets a DDL from Impala and makes a few changes for it to work on SQL Server.
I get something like this from Impala
CREATE EXTERNAL TABLE xxx.yyy (
year INT,
day INT,
mmm_yyyy DATE,
2target_revenue_day DECIMAL(38,6),
2budget_day DECIMAL(38,6),
last_6_months STRING,
load_timestamp TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3a://xxx'
TBLPROPERTIES ('')
I managed to remove the "EXTERNAL TABLE" bit as I only need "TABLE",
changed "STRING" to "VARCHAR" and "TIMESTAMP" to "DATETIME2".
Also removed the bit at the bottom, i.e STORED AS PARQUET
LOCATION 's3a://xxx'
TBLPROPERTIES ('')
My problem is, some of the column names like year, day and 2target_revenue_day I am going to need to wrap in quotes otherwise script won't work (reserved words, name starts with a digit).
I need to find a way to either wrap all column names in quotes or just the ones which are reserved words and start with a digit.
Any idea how to go about it?
Thank you
You could key the pattern off of a word immediately preceding one of a set of known data types. Depending on when you perform that step, you'll need to customize that list to match either the Impala or the SQL Server types.
(\w+)\s+(?:BOOLEAN|CHAR|DATE|DECIMAL|DOUBLE|FLOAT|INT|REAL|STRING|TIMESTAMP|VARCHAR|etc)
With regards to columns start with a digit,
this has worked for me:
variable.replaceAll("(\\d{1}[a-z]+[a-z0-9_]*)", "\"$0\"");
It finds anything with a number in the beginning of the column name and wraps it in quotes.
With regards to reserved words, I've had to manually look for words like year, month, day, date, etc. and replace them a quoted name, e.g "year", "month", etc.
variable.replace(" date ", " \"date\" ").replace(" year ", " \"year\" ").replace(" month ", " \"month\" ").replace(" day ", " \"day\" ");
I hope someone will find this useful.

Why is Oracle Pivot producing non-existent results?

I manage a database holding a large amount of climate data collected from various stations. It's an Oracle 12.2 DB, and here's a synopsis of the relevant tables:
FACT = individual measurements at a particular time
UTC_START = time in UTC at which the measurement began
LST_START = time in local standard time (to the particular station) at which the measurement began
SERIES_ID = ID of the series to which the measurement belongs (FK to SERIES)
STATION_ID = ID of the station at which the measurement occurred (FK to STATION)
VALUE = value of the measurement
Note that UTC_START and LST_START always have a constant difference per station (the LST offset from UTC). I have confirmed that there are no instances where the difference between UTC_START and LST_START is anything other than what is expected.
SERIES = descriptive data for series of data
SERIES_ID = ID of the series (PK)
NAME = text name of the series (e.g. Temperature)
STATION = descriptive data for stations
STATION_ID = ID of the station (PK)
SITE_ID = ID of the site at which a station is located (most sites have one station, but a handful have 2)
SITE_RANK = rank of the station within the site if there are more than 1 stations.
EXT_ID = external ID for a site (provided to us)
The EXT_ID of a site applies to all stations at that site (but may not be populated unless SITE_RANK == 1, not ideal, I know, but not the issue here), and data from lower ranked stations is preferred. To organize this data into a consumable format, we're using a PIVOT to collect measurements occurring at the same site/time into rows.
Here's the query:
WITH
primaries AS (
SELECT site_id, ext_id
FROM station
WHERE site_rank = 1
),
data as (
SELECT d.site_id, d.utc_start, d.lst_start, s.name, d.value FROM (
SELECT s.site_id, f.utc_start, f.lst_start, f.series_id, f.value,
ROW_NUMBER() over (PARTITION BY s.site_id, f.utc_start, f.series_id ORDER BY s.site_rank) as ORDINAL
FROM fact f
JOIN station s on f.station_id = s.station_id
) d
JOIN series s ON d.series_id = s.series_id
WHERE d.ordinal = 1
AND d.site_id = ?
AND d.utc_start >= ?
AND d.utc_start < ?
)
records as (
SELECT * FROM data
PIVOT (
MAX(VALUE) AS VALUE
FOR NAME IN (
-- these are a few series that we would want to collect by UTC_START
't5' as t5,
'p5' as p5,
'solrad' as solrad,
'str' as str,
'stc_05' as stc_05,
'rh' as rh,
'smv005_05' as smv005_05,
'st005_05' as st005_05,
'wind' as wind,
'wet1' as wet1
)
)
)
SELECT r.*, p.ext_id
FROM records r JOIN primaries p on r.site_id = p.site_id
Here's where things get odd. This query works perfectly in SQLAlchemy, IntelliJ (using OJDBC thin), and Orcale SQL Developer. But when it's run from within our Java program (same JDBC urls, and credentials, using plain old JDBC statments and result sets), it gives results that don't make sense. Specifically for the same station, it will return 2 rows with the same UTC_START, but different LST_START (recall that I have verified that this 100% does not occur anywhere in the FACT table). Just to ensure there was no weird parameter handling going on, we tested hard-coding values in for the placeholders, and copy-and-pasted the exact same query between various clients, and the only one that returns these strange results is the Java program (which is using the exact same OJDBC jar as IntelliJ).
If anyone has any insight or possible causes, it would be greatly appreciated. We're at a bit of a loss right now.
It turns out that Nathan's comment was correct. Though it seems counter-intuitive (to me, at least), it appears that calling ResultSet.getString on a DATE column will in fact convert to Timestamp first. Timestamp has the unfortunate default behavior of using the system default timezone unless you specify otherwise explicitly.
This default behavior meant that daylight saving's time was taken into account when we didn't intend it to be, leading to the odd behavior described.

Select count(*) returns a row even when I dont expect it

So I am querying a MySql database from my java application and I am trying to use a query,
Select count(*) from table where `NUMERIC`='1'
to count the rows from a database. When I run this query it works fine, and I get a 1 returned (I am using a test db with 12 records, Numeric has values 1-12 so this makes sense). However I wanted to try to break this and do some error handling. I changed my query to
Select count(*) from table where `Numeric`='1adjfa'
I expected this to return 0, however it still returns 1. In fact, as long as I have 1 at the beginning of the value it will work, if I change the value to just 'adjfa' than it returns 0. I have confirmed this through both my Java App and the MySQL workbench. Any ideas as to why this returns 1, even with the junk at the end of it?
Two different data types can not be compared. Instead one of the two needs to be cast/coerced to the same data type as the other.
In your case you're not doing the coercion, so the DB Engine is doing an implicit coercion.
Based on data-type-order-of-precedence, the database engine chooses the string to be coerced to a numeric.
The value '1adjfa' therefore becomes a 1, and then your comparison is being made.
This results is your query effectively being:
Select count(*) from table where `Numeric` = 1
You should either not be comparing numerics and strings, or do the coercion yourself, for example...
Select count(*) from table where CAST(`Numeric` AS VARCHAR(32)) = '1adjfa'
In terms of breaking the query, I'm hoping that in your application you're actually using parameterised queries. This will allow you to define the data-type of the parameter, and your application should throw the error if the wrong data-type is supplied.
Numeric has a number data type. To make the comparision to 1adjfa the DB engine tries to convert it also to a number which results in 1 and the rest gets cut off.

jooq batch insert issue (duplicating the first row)

I'm trying to use jOOQ for batch inserts into my postgres database.
What I'm trying to do is:
BatchBindStep bbs = context.batch(context.insertInto(TABLENAME,FIELD1,FIELD2,....).values("?","?",...));
bbs = bbs.bind(v1a,v2a).bind(v1b,v2b)....;
bbs.execute();
as described at http://www.jooq.org/doc/3.1/manual-single-page/#batch-execution
To make it clear, I want to insert thousands of rows in one query, not by using a batch with thousands of queries :
// 2. a single query
// -----------------
create.batch(create.insertInto(AUTHOR, ID, NAME).values("?", "?"))
.bind(1, "Erich Gamma")
.bind(2, "Richard Helm")
.bind(3, "Ralph Johnson")
.bind(4, "John Vlissides")
.execute();
The problem is:
To get to the point where the BatchBindStep accepts a .bind() call, one need to have called
context.batch with an argument, that has .values(...) as the last call.
In the documentation is stated, that "?" has to be used. This is typed as String, and may work only for tables where alls columns are varchars, since jOOQ does static typing.
This irritates me. I tried my luck with arbitrary default values (null,0...) just to go through the values(...) step, hoping that since these "values" are not really
values that I want to batch insert, they get overwritten later by the binds.
As a matter of fact, they will.
TWICE for the first row. Which completly baffles me.
To repeat, I CAN do batch inserts, but the first row gets inserted TWICE. I have the intuition that It has to do with the "values" call (at least there is a conceptual problem in the DSL with the typing).
Has anyone tried to use jOOQ for batch inserts, and how does one that without inserting the first row twice ?
P.S. This happens when I try to use
.values("?", "?", "?", "?", "?", "?", "?", "?", "?","?","?","?","?","?","?","?","?","?","?","?","?","?")
:
"The method values(Integer, String, String, String, String, String, String, String, String, String, Double, Double, String, String, String, String, Timestamp, String, String, String,
String, String) in the type
InsertValuesStep22 is not applicable for the arguments (String, String, String, String, String, String, String, String, String, String, String, String, String, String,
String, String, String, String, String, String, String, String)"
So clearly, the typing is wrong, when I try to adapt the example from the documentation.
The example from the documentation was wrong. It has now been fixed:
http://www.jooq.org/doc/latest/manual/sql-execution/batch-execution
In principle, as you've noticed, it doesn't matter what dummy bind values you're passing to the insert statement, as those values will be replaced when binding the values specified by the various .bind() calls. So in principle, some correct solutions would be:
// Passing in null
create.insertInto(AUTHOR, ID, NAME).values((Integer) null, null);
// Passing in a dummy value (even with a wrong type)
create.insertInto(AUTHOR, ID, NAME).values(Arrays.asList("?", "?"))
jOOQ integration tests suggest that batch insertion works correctly. The issue you have been experiencing with double-inserts of the first record would be surprising. Either this is a subtle bug that is not visible from your current question, or you might have called .bind() one too many times?

Service usage limiter implementation

I need to limit multiple service usages for multiple customers. For example, customer customer1 can send max 1000 SMS per month. My implementation is based on one MySQL table with 3 columns:
date TIMESTAMP
name VARCHAR(128)
value INTEGER
For every service usage (sending SMS) one row is inserted to the table. value holds usage count (eg. if SMS was split to 2 parts then value = 2). name holds limiter name (eg. customer1-sms).
To find out how many times the service was used this month (March 2011), a simple query is executed:
SELECT SUM(value) FROM service_usage WHERE name = 'customer1-sms' AND date > '2011-03-01';
The problem is that this query is slow (0.3 sec). We are using indexes on columns date and name.
Is there some better way how to implement service usage limitation? My requirement is that it must be flexibile (eg. if I need to know usage within last 10 minutes or another within current month). I am using Java.
Thanks in advance
You should have one index on both columns, not two indexes on each of the columns. This should make the query very fast.
If it still doesn't, then you could use a table with a month, a name and a value, and increment the value for the current month each time an SMS is sent. This would remove the sum from your query. It would still need an index on (month, name) to be as fast as possible, though.
I found one solution to my problem. Instead of inserting service usage increment, I will insert the last one incremented:
BEGIN;
-- select last the value
SELECT value FROM service_usage WHERE name = %name ORDER BY date ASC LIMIT 1;
-- insert it to the database
INSERT INTO service_usage (CURRENT_TIMESTAMP, %name, %value + %increment);
COMMIT;
To find out service usage since %date:
SELECT value AS value1 FROM test WHERE name = %name ORDER BY date DESC LIMIT 1;
SELECT value AS value2 FROM test WHERE name = %name AND date <= %date ORDER BY date DESC LIMIT 1;
The result will be value1 - value2
This way I'll need transactions. I'll probably implement it as stored procedure.
Any additional hints are still appreciated :-)
It's worth trying to replace your "=" with "like". Not sure why, but in the past I've seen this perform far more quickly than the "=" operator on varchar columns.
SELECT SUM(value) FROM service_usage WHERE name like 'customer1-sms' AND date > '2011-03-01';
Edited after comments:
Okay, now I can sorta re-create your issue - the first time I run the query, it takes around 0.03 seconds, subsequent runs of the query take 0.001 second. Inserting new records causes the query to revert to 0.03 seconds.
Suggested solution:
COUNT does not show the same slow-down. I would change the business logic so every time the user sends and SMS you insert the a record with value "1"; if the message is a multipart message, simply insert two rows.
Replace the "sum" with a "count".
I've applied this to my test data, and even after inserting a new record, the "count" query returns in 0.001 second.

Categories