BigQuery WORM work-around for updated data

BigQuery WORM work-around for updated data - java

Using Google's "electric meter" example from a few years back, we would have:
MeterID (Datastore Key) | MeterDate (Date) | ReceivedDate (Date) | Reading (double)
Presuming we received updated info (Say, out of calibration/busted meter, etc.) and put in a new row with same MeterID and MeterDate, using a Window Function to grab the newest Received Date for each ID+MeterDate pair would only cost more if there is multiple records for that pair, right?
Sadly, we are flying without a SQL expert, but it seems like the query should look like:
SELECT
meterDate,
NTH_VALUE(reading, 1) OVER (PARTITION BY meterDate ORDER BY receivedDate DESC) AS reading
FROM [BogusBQ:TableID]
WHERE meterID = {ID}
AND meterDate BETWEEN {startDate} AND {endDate}
Am I missing anything else major here? Would adding 'AND NOT IS_NAN(reading)' cause the Window Function to return the next row, or nothing? (Then we could use NaN to signify "deleted".)

Your SQL looks good. Couple of advices:
- I would use FIRST_VALUE to be a bit more explicit, but otherwise should work.
- If you can - use NULL instead of NaN. Or better yet, add new BOOLEAN column to mark deleted rows.

Related

Set WatermarkStrategy for Event Timestamps

I am trying to do a windowed aggregation query on a data stream that contains over 40 attributes in Flink. The stream's schema contains an epoch timestamp which I want to use for the WatermarkStrategy so I can actually define tumbling windows over it.
I know from the docs, that you can define a Timestamp using the SQL Api in an CREATE TABLE-query by first using TO_TIMESTAMP_LTZ on the epochs to convert it to a proper timestamp which can be used in the following WATERMARK FOR-statement. Since I got a really huge schema tho, I want to deserialise and provide the schema NOT by completely writing the complete CREATE TABLE-statement containing all columns BUT by using a custom class derived from the proto file that cointains the schema. As far as I know, this is only possible by providing a deserializer for the KafkaSourceBuilder and calling the results function of the stream on the class derived from the protofile with protoc. Which means, that I have to define the table using the stream api.
Inspired by the answer to this question, I do it like this:
WatermarkStrategy watermarkStrategy = WatermarkStrategy
.<Row>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner( (event, ts) -> (Long) event.getField("ts"));
tableEnv.createTemporaryView(
"bidevents",
stream
.returns(BiddingEvent.BidEvent.class)
.map(e -> Row.of(
e.getTracking().getCampaign().getId(),
e.getTracking().getAuction().getId(),
Timestamp.from(Instant.ofEpochSecond(e.getTimestamp().getMilliseconds() / 1000))
)
)
.returns(Types.ROW_NAMED(new String[] {"campaign_id", "auction_id", "ts"}, Types.STRING, Types.STRING, Types.SQL_TIMESTAMP))
.assignTimestampsAndWatermarks(watermarkStrategy)
);
tableEnv.executeSql("DESCRIBE bidevents").print();
Table resultTable = tableEnv.sqlQuery("" +
"SELECT " +
" TUMBLE_START(ts, INTERVAL '1' DAY) AS window_start, " +
" TUMBLE_END(ts, INTERVAL '1' DAY) AS window_end, " +
" campaign_id, " +
" count(distinct auction_id) auctions " +
"FROM bidevents " +
"GROUP BY TUMBLE(ts, INTERVAL '1' DAY), campaign_id");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
env.execute();
I get this error:
Caused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Window aggregate can only be defined over a time attribute column, but TIMESTAMP(9) encountered.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[flink-dist-1.15.1.jar:1.15.1]
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[flink-dist-1.15.1.jar:1.15.1]
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist-1.15.1.jar:1.15.1]
at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:291) ~[flink-dist-1.15.1.jar:1.15.1]
This seems kind of logical, since in line 3 I cast a java.sql.Timestamp to a Long value, which it is not (but also the stacktrace does not indicate that an error occurred during the cast). But when I do not convert the epoch (in Long-Format) during the map-statement to a Timestamp, I get this exception:
"Cannot apply '$TUMBLE' to arguments of type '$TUMBLE(<BIGINT>, <INTERVAL DAY>)'"
How can I assign the watermark AFTER the map-statement and use the column in the later SQL Query to create a tumbling window?
======UPDATE=====
Thanks to a comment from David, I understand that I need the column to be of type TIMESTAMP(p) with precision p <= 3. To my understanding this means, that my timestamp may not be more precise than having full milliseconds. So i tried different ways to create Java Timestamps (java.sql.Timestamps and java.time.LocaleDateTime) which correspond to the Flink timestamps.
Some examples are:
1 Trying to convert epochs into a LocalDateTime by setting nanoseconds (the 2nd parameter of ofEpochSecond to 0):
LocalDateTime.ofEpochSecond(e.getTimestamp().getMilliseconds() / 1000, 0, ZoneOffset.UTC )
2 After reading the answer from Svend in this question who uses LocalDateTime.parse on timestamps that look like this "2021-11-16T08:19:30.123", I tried this:
LocalDateTime.parse(
DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss").format(
LocalDateTime.ofInstant(
Instant.ofEpochSecond(e.getTimestamp().getMilliseconds() / 1000),
ZoneId.systemDefault()
)
)
)
As you can see, the timestamps even only have seconds-granularity (which i checked when looking at the printed output of the stream I created) which I assume should mean, they have a precision of 0. But actually when using this stream for defining a table/view, it once again has the type TIMESTAMP(9).
3 I also tried it with the sql timestamps:
new Timestamp(e.getTimestamp().getMilliseconds() )
This also did not change anything. I somehow always end up with a precision of 9.
Can somebody please help me how I can fix this?

Ok, I found the solution to the problem. If you got a stream containing a timestamp which you want to define as event time column for watermarks, you can use this function:
Table inputTable = tableEnv.fromDataStream(
stream,
Schema.newBuilder()
.column("campaign_id", "STRING")
.column("auction_id", "STRING")
.column("ts", "TIMESTAMP(3)")
.watermark("ts", "SOURCE_WATERMARK()")
.build()
);
The important part is, that you can "cast" the timestamp ts from TIMESTAMP(9) "down" to TIMESTAMP(3) or any other precision below 4 and you can set the column to contain the water mark.
Another mention that seems important to me is, that only Timestamps of type java.time.LocalDateTime actually worked for later use as watermarks for tumbling windows.
Any other attempts to influence the precision of the timestamps by differently creating java.sql.Timestamp or java.time.LocalDateTime failed. This seemed to be the only viable way.

Getting form DB XXX.0E0 ( XX stand for a number)

I'm trying to query select statements using JDBCTamplate.
select statement:
SELECT currency, SUM(amount) AS total
FROM table_name
WHERE user_id IN (:userIdList)
GROUP BY currency
DB Table has three columns:
user_id
currency
amount
table for example
user_id currency amount
1 EUR 9000
2 EUR 1000
3 USD 124
When I'm trying to run this code
namedParamJDBCTemplate.query(query,
new MapSqlParameterSource('user_id', userIdList),
new ResultSetExtractor<Map>() {
#Override
public Map extractData(ResultSet resultSet) throws SQLException, DataAccessException {
HashMap<String,Object> mapRet = new HashMap<String,Object>();
while(resultSet.next()){
mapRet.put(resultSet.getString("currency"), resultSet.getString("total"));
}
return mapRet;
}
});
I'm getting the result set as a map, but the result of the amount looks like this :
EUR -> 10000.0E0
USD -> 124.0E0
When I run the same query in DB ( not via code) the result set is fine and without the '0E0'.
How can I get only EUR -> 10000 and USD-> 124 without the '0E0'?

.0E0 is the exponent of the number, as I think. So 124.0E0 stands for 124.0 multiplied with ten raised to the power of 0 (written 124 x 10^0). Anything raised to the power of 0 is 1, so you've got 124 x 1, which, of course, is the right value.
(If it was, e. g., 124.5E3, this would mean 124500.)
This notation is used more commonly to work with large numbers, because 5436.7E20 is much more readable than 543670000000000000000000.
Without knowing your database background, I can only suppose that this notation arises from the conversion of the numeric field to a string (in result.getString("total")). Therefore, you should ask yourself, if you really need the result as a string (or just use .getFloat or so, also changing your HashMap type). If so, you still have some possibilities:
Convert the value to a string later → e. g. String.valueOf(resultSet.getFloat("total"))
Truncate the .0E0 → e. g. resultSet.getString("total").replace(".0E0", "") (Attention, of course this won't work if, for some reason, you get another suffix like .5E3; it will also cut off any positions after the decimal point)
Perhaps find a database, JDBC or driver setting that suppresses the E-Notation.

Dataflow GCS to BigQuery - How to output multiple rows per input?

Currently I am using the gcs-text-to-bigquery google provided template and feeding in a transform function to transform my jsonl file. The jsonl is pretty nested and i wanted to be able to output multiple rows per one row of the newline delimited json by doing some transforms.
For example:
{'state': 'FL', 'metropolitan_counties':[{'name': 'miami dade', 'population':100000}, {'name': 'county2', 'population':100000}…], 'rural_counties':{'name': 'county1', 'population':100000}, {'name': 'county2', 'population':100000}….{}], 'total_state_pop':10000000,….}
There will obviously be more counties than 2 and each state will have one of these lines. The output my boss wants is:
When i do the gcs-to-bq text transform, i end up only getting one line per state (so I'll get miami dade county from FL, and then whatever the first county is in my transform for the next state). I read a little bit and i think this is because of the mapping in the template that expects one output per jsonline. It seems I can do a pardo(DoFn ?) not sure what that is, or there is a similar option with beam.Map in python. There is some business logic in the transforms (right now it's about 25 lines of code as the json has more columns than i showed but those are pretty simple).
Any suggestions on this? data is coming in tonight/tomorrow, and there will be hundreds of thousands of rows in a BQ table.
the template i am using is currently in java, but i can translate it to python pretty easily as there are a lot of examples online in python. i know python better and i think its easier given the different types (sometimes a field can be null) and it seems less daunting given the examples i saw look simpler, however, open to either

Solving that in Python is somewhat straightforward. Here's one possibility (not fully tested):
from __future__ import absolute_import
import ast
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/service_account.json'
pipeline_args = [
'--job_name=test'
]
pipeline_options = PipelineOptions(pipeline_args)
def jsonify(element):
return ast.literal_eval(element)
def unnest(element):
state = element.get('state')
state_pop = element.get('total_state_pop')
if state is None or state_pop is None:
return
for type_ in ['metropolitan_counties', 'rural_counties']:
for e in element.get(type_, []):
name = e.get('name')
pop = e.get('population')
county_type = (
'Metropolitan' if type_ == 'metropolitan_counties' else 'Rural'
)
if name is None or pop is None:
continue
yield {
'State': state,
'County_Type': county_type,
'County_Name': name,
'County_Pop': pop,
'State_Pop': state_pop
}
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText('gs://url to file')
schema = 'State:STRING,County_Type:STRING,County_Name:STRING,County_Pop:INTEGER,State_Pop:INTEGER'
data = (
lines
| 'Jsonify' >> beam.Map(jsonify)
| 'Unnest' >> beam.FlatMap(unnest)
| 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(
'project_id:dataset_id.table_name', schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
)
This will only succeed if you are working with batch data. If you have streaming data then just change beam.io.Write(beam.io.BigquerySink(...)) to beam.io.WriteToBigQuery.

Comparing Date type in Oracle PACKAGE does not work

I made a Oracle Package like below.
And I will pass parameter String like '2014-11-05'.
--SEARCH 2014 11 04
FUNCTION SEARCHMYPAGE(v_created_after IN DATE, v_created_before IN DATE)
return CURSORTYPE is rtn_cursor CURSORTYPE;
BEGIN
OPEN
rtn_cursor FOR
select
news_id
from
(
select
news_id,
news_title, news_desc,
created, news_cd
from
news
)
where
1=1
AND (created BETWEEN decode(v_created_after, '', to_date('2000-01-01', 'YYYY-MM-DD'), to_date(v_created_after, 'YYYY-MM-DD'))
AND (decode(v_created_before, '', sysdate, to_date(v_created_before, 'YYYY-MM-DD')) + 0.999999));
return rtn_cursor ;
END SEARCHMYPAGE;
I confirmed my parameter in Eclipse console Message, since I am working on Eclipse IDE.
I got contents, which are made in 2014-10-29 ~ 2014-10-31.
when I pass '2014-11-01' as created_after, It returns 0 records.(But I expected all contents, since every contents are made between 10-29 and 10-31)
Would you find anything wrong with my Function?
Thanks :D

create function search_my_page(p_created_after in date, p_created_before in date)
return cursortype
is rtn_cursor cursortype;
begin
open rtn_cursor for
select news_id
from news
where created between
nvl(v_created_after, date '1234-01-01')
and
nvl(v_created_before, sysdate) + interval '1' day - interval '1' second;
return rtn_cursor;
end search_my_page;
/
Changes:
Re-wrote predicates - there was a misplaced parentheses changing the meaning.
Replaced to_date with date literals and variables. Since you're already using ANSI date format, might as well use literals. And date variables do not need to be cast to dates.
Replace DECODE with simpler NVL.
Removed extra parentheses.
Renamed v_ to p_. It's typical to use p_ to mean "parameter" and v for "(local) variable".
Removed extra inline view. Normally inline views are underused, in this case it doesn't seem to help much.
Removed unnecessary 1=1.
Replaced 0.99999 with date intervals, to make the math clearer.
Changed to lower case (this ain't COBOL), added underscores to function name.
Changed 2000-01-01 to 1234-01-01. If you use a magic value it should look unusual - don't try to hide it.

Auto-complete a tuple

I have a database table like this
Port Code | Country |Port_Name
--------------------------------------
1234 | Australia | port1
2345 | India | Mumbai
2341 | Australia | port2
...
The table consists of around 12000 entries.I need to auto-complete as the user enter's the query.Now the query can be any either a port-code,country or a port name.For example if the users partial query is '12' and the drop-down should display 1234 | Australia | port1.The problem that i'm facing now is that for each user entry i'm querying the database which makes the auto-complete really slow.So is there a way to optimize this ?

in smartgwt use comboboxitem.Then override getPickListFilterCriteria of comboxitem like this
ComboBoxItem portSelect = new ComboBoxItem("PORT_ATTRIB", "") {
#Override
public Criteria getPickListFilterCriteria() {
if (getValue() != null && getValue() instanceof String) {
criteria = new AdvancedCriteria(OperatorId.AND, new Criterion[]{new Criterion("portValue",
OperatorId.EQUALS, getDisplayValue())});
}
return criteria;
}
};
Every key press will give you a criteria which u can pass to your query.Query will be something likeselect * from port where portName like '+criteria+%' or portCode like '+criteria+%

You could do this with Lucene and a RAMDirectory. You build an index on your data, and implement a data lookup service to check from time to time if changes in the database occured. Or any other update from your database for your Lucene Index. See Lucene for Indexing your DB and for Querying use the MultiFieldQueryParser.

Is your database indexed correctly? Lookups on indexed columns should be pretty fast - 12k rows is not a great deal for any relational DB.
Another thing I could suggest is to load the database table data, into an in memory table. I've done this in MySQL long time back : http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html . This will help specially if the data does not change very frequently - so a one time load up of the data into a in memory table will be quick. After that, all queries should be executed on this in memory table and these are amazingly fast.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

BigQuery WORM work-around for updated data - java

Your SQL looks good. Couple of advices: - I would use FIRST_VALUE to be a bit more explicit, but otherwise should work. - If you can - use NULL instead of NaN. Or better yet, add new BOOLEAN column to mark deleted rows.

Related

Set WatermarkStrategy for Event Timestamps

Getting form DB XXX.0E0 ( XX stand for a number)

Dataflow GCS to BigQuery - How to output multiple rows per input?

Comparing Date type in Oracle PACKAGE does not work

Auto-complete a tuple

Categories

Resources