Java/Spring: connect to severals BigQuery datasets - java

Everything is mostly in the title.
I have an API already connected to a BigQuery dataset, which is regularly queried. Soon, my process will necessitate new data stocked in another BigQuery dataset.
So, I started to check if it's possible to connect to two different BQ datasets to one Spring API. Unless I missed it, I didn't find any information for this specific case inside BQ documentation.
As the API is already connected, the properties values of spring.cloud.gcp.bigQuery.* are already defined. As such, I can't use those properties to define the new connection.
So is it possible to connect one API to several BigQuery datasets ? if so, how can I do that with the properties files ?

Could you not achieve what you are after doing something like:
public BigQuery bigQuery1() {
return BigQueryOptions.newBuilder()
.setCredentials(
ServiceAccountCredentials.fromStream(
new FileInputStream("your_json_service_keyfile.json"))
)
.setProjectId("bigQuery1")
.build().getService();
}
public BigQuery bigQuery2() {
return BigQueryOptions.newBuilder()
.setCredentials(
ServiceAccountCredentials.fromStream(
new FileInputStream("your_json_service_keyfile.json"))
)
.setProjectId("bigQuery2")
.build().getService();
}

Related

What is the default Provisioned Throughput(read and write capacity unit) for locally running DynamoDb?

I'm new to DynamoDb and trying to do bulk insert of around 5.5k items on my locally running DynamoDb using Java. Despite my best efforts, and various tweakings, I am able to do this in around 100 seconds(even after using executor framework).
I posted my code here, but didn't get an answer.
To improve the insertion rate I tried changing the Provisioned Throughput value several times while creating the table, then I got to know that when running locally, dynamodb ignores the throughput values. So I think its my dynamodb that is not able to handle so many write requests at a time and when I do it on the AWS server, the performance might improve.
This is the code I was running to create table:
public static void main(String[] args) throws Exception {
AmazonDynamoDBClient client = new AmazonDynamoDBClient().withEndpoint("http://localhost:8000");
DynamoDB dynamoDB = new DynamoDB(client);
String tableName = "Database";
try {
System.out.println("Creating the table, wait...");
Table table = dynamoDB.createTable(tableName, Arrays.asList(new KeySchemaElement("Type", KeyType.HASH),
new KeySchemaElement("ID", KeyType.RANGE)
), Arrays.asList(new AttributeDefinition("Type", ScalarAttributeType.S),
new AttributeDefinition("ID", ScalarAttributeType.S)),
new ProvisionedThroughput(10000L, 10000L));
table.waitForActive();
System.out.println("Table created successfully. Status: " + table.getDescription().getTableStatus());
} catch (Exception e) {
System.err.println("Cannot create the table: ");
System.err.println(e.getMessage());
}
}
But to be sure I want to know what's the default read and write capacity unit of a locally running dynamodb instance?
DynamoDBLocal does not implement the throughput limitations in any way. If your throughput is limited, it is limited by the hardware you are running it on.
From the DynamoDB Local docs:
The speed of read and write operations on table data is limited only by the speed of your computer.
According to this answer to a related question, the performance is noticeably poor because DynamoDB local uses SQLite behind the scenes. Since the implementation is different than the real DynamoDB, we should expect that the performance will be different as well.
If you need to do any performance testing with DynamoDB, you should use the real DynamoDB. My company has used DynamoDB for applications with tens of thousands of reads and writes per second without any scaling problems from DynamoDB, so I can attest that the real DynamoDB will perform much better than DynamoDB Local.

Proper approach how to add context from external source to records in Kafka Streams

I have records that are processed with Kafka Streams (using Processor API). Let's say the record has city_id and some other fields.
In Kafka Streams app I want to add current temperature in the target city to the record.
Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
There are several questions how to best approach it:
Where to initiate the context data?
Originally, I have been doing it within AbstractProcessor::init() but that seems wrong as it's initialized for each thread and also reinitialized on rebalance.
So I moved it before streams topology builder and processors are build with it. Data are fetched only once independently on all processor instances.
Is that proper and valid approach. It works but...
HashMap<CityId, Temperature> tempHM = new HashMap<CityId, Temperature>;
// Connect to DB and initialize tempHM here
Topology topology = new Topology();
topology
.addSource(SOURCE, stringDerializer, protoDeserializer, "topic-in")
.addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(tempHm), SOURCE)
.addSink(SINK, "topic-out", stringSerializer, protoSerializer, TemperatureAppender.NAME)
;
How to refresh the context data?
I would like to refresh the temperature data every 15 minutes for example. I was thinking of using Hashmap container instead of Hashmap, that would handle it:
abstract class ContextContainer<T> {
T context;
Date lastRefreshAt;
ContextContainer(Date now) {
refresh(now);
}
abstract void refresh(Date now);
abstract Duration getRefreshInterval();
T get() {
return context;
}
boolean isDueToRefresh(Date now) {
return lastRefreshAt == null
|| lastRefreshAt.getTime() + getRefreshInterval().toMillis() < now.getTime();
}
}
final class CityTemperatureContextContainer extends ContextContainer<HashMap> {
CityTemperatureContextContainer(Date now) {
super(now);
}
void refresh(Date now) {
if (!isDueToRefresh(now)) {
return;
}
HashMap context = new HashMap();
// Connect to DB and get data and fill hashmap
lastRefreshAt = now;
this.context = context;
}
Duration getRefreshInterval() {
return Duration.ofMinutes(15);
}
}
this is a brief concept written in SO textarea, might contain some syntax errors but the point is clear I hope
then passing it into processor like .addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(cityTemperatureContextContainer), SOURCE)
And in processor do
public void init(final ProcessorContext context) {
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
(timestamp) -> {
cityTemperatureContextContainer.refresh(new Date(timestamp));
tempHm = cityTemperatureContextContainer.get();
}
);
super.init(context);
}
Is there a better way? The main question is about finding proper concept, I'm able to implement it then. There is not much resources on the topic out there though.
In Kafka Streams app I want to add current temperature in the target city to the record. Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
A better alternative would be to use Kafka Connect to ingest your data from Postgres into a Kafka topic, read this topic into a KTable in your application with Kafka Streams, and then join this KTable with your other stream (the stream of records "with city_id and some other fields"). That is, you will be doing a KStream-to-KTable join.
Think:
### Architecture view
DB (here: Postgres) --Kafka Connect--> Kafka --> Kafka Streams Application
### Data view
Postgres Table ----------------------> Topic --> KTable
Example connectors for your use case are https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc and https://www.confluent.io/hub/debezium/debezium-connector-postgresql.
One of the advantages of the Kafka Connect based setup above is that you no longer need to talk directly from your Java application (which uses Kafka Streams) to your Postgres DB.
Another advantage is that you don't need to do "batch refreshes" of your context data (you mentioned every 15 minutes) from your DB into your Java application, because the application would get the latest DB changes in real-time automatically via the DB->KConnect->Kafka->KStreams-app flow.

Apache Beam Dataflow BigQuery

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?
I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.
Declare library
from google.cloud import bigquery
Prepares a bigquery client
client = bigquery.Client(project='your_project_name')
Prepares a reference to the new dataset
dataset_ref = client.dataset('your_data_set_name')
Make API request
tables = list(client.list_tables(dataset_ref))
if tables:
for table in tables:
print('\t{}'.format(table.table_id))
Reference:
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets
You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.
The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.
DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
// do something
}

How can I append timestamp to rdd and push to elasticsearch

I am new to spark streaming and elasticsearch, I am trying to read data from kafka topic using spark and storing data as rdd. In the rdd I want to append time stamp, as soon as new data comes and then push to elasticsearch.
lines.foreachRDD(rdd -> {
if(!rdd.isEmpty()){
// rdd.collect().forEach(System.out::println);
String timeStamp = new
SimpleDateFormat("yyyy::MM::dd::HH::mm::ss").format(new Date());
List<String> myList = new ArrayList<String>(Arrays.asList(timeStamp.split("\\s+")));
List<String> f = rdd.collect();
Map<List<String>, ?> rddMaps = ImmutableMap.of(f, 1);
Map<List<String>, ?> myListrdd = ImmutableMap.of(myList, 1);
JavaRDD<Map<List<String>, ?>> javaRDD = sc.parallelize(ImmutableList.of(rddMaps));
JavaEsSpark.saveToEs(javaRDD, "sample/docs");
}
});
Spark?
As far as I understand, spark streaming is for real time streaming data computation, like map, reduce, join and window. It seems no need to use such a powerful tool, in the case that what we need is just add a timestamp for event.
Logstash?
If this is the situation, Logstash may be more suitable for our case.
Logstash will record the timestamp when event coming and it also has persistent queue and Dead Letter Queues that ensure the data resiliency. It has the native support for push data to ES (after all they are belong to a serial of products), which make it is very easy to push data to.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-%{type}-%{+YYYY.MM.dd}"
}
}
More
for more about Logstash, here is introduction.
here is a sample logstash config file.
Hope this is helpful.
Ref
Deploying and Scaling Logstash
If all you're using Spark Streaming for is getting the data from Kafka to Elasticsearch a neater way–and not needing any coding–would be to use Kafka Connect.
There is an Elasticsearch Kafka Connect sink. Depending on what you want to do with a Timestamp (e.g. for index routing, or to add as a field) you can use Single Message Transforms (there's an example of them here).

How to improve BigQuery read performance

We're using BigQuery to retrieve the full content of a big table. We're using the publicly available publicdata:samples.natality.
Our code follows Google instructions as described in their API doc - java.
We're able to retrieve this table at around 1'300 rows/sec that is amazingly slow. Is there a faster way to retrieve the full result of a query or is this always as fast as it gets ?
The recommended way to retrieve a large amount of data from a BigQuery table is not to use tabledata.list to page through a full table as that example is using. That example is optimized for reading a small number of rows for the results of a query.
Instead, you should run an extract job that exports the entire content of the table to Google Cloud Storage, which you can then download the full content from.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
To download a table fast you can use Google BigQuery Storage Client for Java.
It lets you download the tables into efficient binaries format such as Avro or Arrow.
Using the basic Arrow example in the documentation I manage to download ~1 million rows per second.
I think you can use it to download a query result by writing the result into a temporary table.
The code to get the temporary table of the result looks like this:
public static TableId getTemporaryTable(String query) throws InterruptedException{
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(false)
.build();
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).build());
queryJob = queryJob.waitFor(); // Wait for the query to complete.
return ((QueryJobConfiguration) queryJob.getConfiguration()).getDestinationTable();
}
References:
Google cloud documentation
GitHub repository

Categories