Flink Table API not able to convert DataSet to DataStream - java

I am using Flink Table API using Java where I want to convert DataSet to DataStream .... Following is my code :
TableEnvironment tableEnvironment=new TableEnvironment();
Table tab1=table.where("related_value < 2014").select("related_value,ref_id");
DataSet<MyClass>ds2=tableEnvironment.toDataSet(tab1, MyClass.class);
DataStream<MyClass> d=tableEnvironment.toDataStream(tab1, MyClass.class);
But when I try to execute this program,it throws following exception :
org.apache.flink.api.table.ExpressionException: Invalid Root for JavaStreamingTranslator: Root(ArraySeq((related_value,Double), (ref_id,String))). Did you try converting a Table based on a DataSet to a DataStream or vice-versa? I want to know how we can convert DataSet to DataStream using Flink Table API ??
Another thing I want to know that, for Pattern matching, there is Flink CEP Library available.But is it feasible to use Flink Table API for Pattern Matching ??

Flink's Table API was not designed to convert a DataSet into a DataStream and vice versa. It is not possible to do that with the Table API and there is also no other way to do it with Flink at the moment.
Unifying the DataStream and DataSet APIs (handling batch processing as a special case of streaming, i.e., as bounded streams) is on the long-term roadmap of Flink.

You cannot convert to DataStream API when using TableEnvironment, you must create an StreamTableEnvironment to convert from table to DataStream, something like this:
final EnvironmentSettings fsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
final StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(configuration, fsSettings);
DataStream<String> finalRes = fsTableEnv.toAppendStream(tableNameHere, MyClass.class);
Hope to help you in somehow.
Kind regards!

Related

Apache Beam Dataflow BigQuery

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?
I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.
Declare library
from google.cloud import bigquery
Prepares a bigquery client
client = bigquery.Client(project='your_project_name')
Prepares a reference to the new dataset
dataset_ref = client.dataset('your_data_set_name')
Make API request
tables = list(client.list_tables(dataset_ref))
if tables:
for table in tables:
print('\t{}'.format(table.table_id))
Reference:
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets
You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.
The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.
DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
// do something
}

How can I append timestamp to rdd and push to elasticsearch

I am new to spark streaming and elasticsearch, I am trying to read data from kafka topic using spark and storing data as rdd. In the rdd I want to append time stamp, as soon as new data comes and then push to elasticsearch.
lines.foreachRDD(rdd -> {
if(!rdd.isEmpty()){
// rdd.collect().forEach(System.out::println);
String timeStamp = new
SimpleDateFormat("yyyy::MM::dd::HH::mm::ss").format(new Date());
List<String> myList = new ArrayList<String>(Arrays.asList(timeStamp.split("\\s+")));
List<String> f = rdd.collect();
Map<List<String>, ?> rddMaps = ImmutableMap.of(f, 1);
Map<List<String>, ?> myListrdd = ImmutableMap.of(myList, 1);
JavaRDD<Map<List<String>, ?>> javaRDD = sc.parallelize(ImmutableList.of(rddMaps));
JavaEsSpark.saveToEs(javaRDD, "sample/docs");
}
});
Spark?
As far as I understand, spark streaming is for real time streaming data computation, like map, reduce, join and window. It seems no need to use such a powerful tool, in the case that what we need is just add a timestamp for event.
Logstash?
If this is the situation, Logstash may be more suitable for our case.
Logstash will record the timestamp when event coming and it also has persistent queue and Dead Letter Queues that ensure the data resiliency. It has the native support for push data to ES (after all they are belong to a serial of products), which make it is very easy to push data to.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-%{type}-%{+YYYY.MM.dd}"
}
}
More
for more about Logstash, here is introduction.
here is a sample logstash config file.
Hope this is helpful.
Ref
Deploying and Scaling Logstash
If all you're using Spark Streaming for is getting the data from Kafka to Elasticsearch a neater way–and not needing any coding–would be to use Kafka Connect.
There is an Elasticsearch Kafka Connect sink. Depending on what you want to do with a Timestamp (e.g. for index routing, or to add as a field) you can use Single Message Transforms (there's an example of them here).

Where clause in Phoenix integration with Spark

I am trying to read some data from Phoenix to Spark using its
String connectionString="jdbc:phoenix:auper01-01-20-01-0.prod.vroc.com.au,auper01-02-10-01-0.prod.vroc.com.au,auper01-02-10-02-0.prod.vroc.com.au:2181:/hbase-unsecure";
Map<String, String> options2 = new HashMap<String, String>();
options2.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");
//options2.put("dbtable", url);
options2.put("table", "VROC_SENSORDATA_3");
options2.put("zkUrl", connectionString);
DataFrame phoenixFrame2 = this.hc.read().format("org.apache.phoenix.spark")
.options(options2)
.load();
System.out.println("The phoenix table is:");
phoenixFrame2.printSchema();
phoenixFrame2.show(20, false);
But I need to do a select with where clause, I also used the dbtable which is used for a JDBC connection in Spark but I guess it doesn't have any effect!
Based on the documentation
"In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering."
But seems there is no way to parallelize the reading from Phoenix, it would be really inefficient to read the whole table in a Spark Dataframe and then doing filtering, but it seems I can find a way to apply a where clause. Does anyone know how to apply where clause in my above codes?

Spark converting a Dataset to RDD

I have a Dataset[String] and need to convert to a RDD[String]. How?
Note: I've recently migrated from spark 1.6 to spark 2.0. Some of my clients were expecting RDD but now Spark gives me Dataset.
As stated in the scala API documentation you can call .rdd on your Dataset :
val myRdd : RDD[String] = ds.rdd
Dataset is a strong typed Dataframe, so both Dataset and Dataframe could use .rdd to convert to a RDD.

BSON Message To Map in JAVA

We are currently sending messages to a Redis Queue, which is being picked up by our JAVA application.
Anyone have an idea how to convert the BSON message to a Map in JAVA?
Here is an example MSG in BSON we pop from the Redis queue:
\x16\x00\x00\x00\x02hello\x00\x06\x00\x00\x00world\x00\x00
You can use MongoDB Driver:
Parse your BSON data just like this:
RawDBObject obj(your ByteBuffer buf );
obj.toMap();
done.
https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/RawDBObject.java
or BSON official site may help:
http://bsonspec.org/#/implementation
You can use a BSON parser to parse your BSON input. Google gives me bson4jackson but I have never tried it myself.

Categories