How to change dataflow job graph during runtime with arguments? - java

I am using Dataflow to read data from a JDBC table and load results to a BigQuery table. There is one parameter "flag" that I want to pass during runtime and if the flag is set True, results should be loaded to an additional table in BigQuery.
To summarise:
If the flag is set False - Read table A from JDBC, write to table A in BigQuery
If the flag is set True - Read table A from JDBC, write to table A as well as B in BigQuery.
Please refer sample code of my pipeline
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
ValueProvider < String > gcsFlag = options.getGcsFlag();
PCollection < TableRow > inputData = pipeline.apply("Reading JDBC Table",
JdbcIO. < TableRow > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(options.getDriverClassName(), options.getJdbcUrl())
.withUsername(options.getUsername()).withPassword(options.getPassword()))
.withQuery(options.getSqlQuery())
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new CustomRowMapper()));
inputData.apply(
"Write to BigQuery Table 1",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable()));
if (gcsFlag.get().equals("TRUE")) {
inputData.apply(
"Write to BigQuery Table 2",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable2()));
}
pipeline.run();
}
The challenge that I am facing is I have to pass the ValueProvider during compiling and creating the dataflow template. The job graph is constructed at compile time only and I am not able to re-use the same template again for other cases.
Is there a way that I can pass the ValueProvider<String> flag at runtime and the job graph can be constructed during runtime? With this, I can reuse the same template for both cases. Similarly, I want to also provide sqlQuery (options.getSqlQuery()) at runtime. So that I can use the same template for all the tables that I want to read from Source.
Any help is appreciated.

When you create the DAG it can't change in the runtime.
But still, you have a chance to fix your problem. Try Beam Partition-Pattern
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Related

GCP - Bigquery to Kafka as streaming

I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka. But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet
Options 1. I tried the options of options.streaming (true); its running as stream but it will finish on the first success write.
Options 2. Applied trigger
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.
Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
This could allow your one time query to become a repeated query that periodically emits new records. It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records. Hopefully your use case can tolerate that.
Emitting into the global window would look like:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input,
#Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.

Apache Beam Streaming unable to write to BigQuery column-based partition

I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!
I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);

BigQueryIO read get TableSchema

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?
Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.

Access dataBase using Apache-Spark-SQL

Hi i'm a new learner for apache spark using java
This is a correct way or not?
This code is working,but performance wise very slow i don't know which one is best approach to access data for every loop.
Dataset<Row> javaRDD = sparkSession.read().jdbc(dataBase_url, "sample", properties);
javaRDD.toDF().registerTempTable("sample");
Dataset<Row> Users = sparkSession.sql("SELECT DISTINCT FROM_USER FROM sample ");
List<Row> members = Users.collectAsList();
for (Row row : members) {
Dataset<Row> userConversation = sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER ='"+ row.getDecimal(0) +"'");
userConversation.show();
}
Try create a set with all Users and then execute a query like
sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER IN usersSet);
This way you execute only one query so you pay the overhead needed to DB connection only once.
Another approach if you run Spark over HDFS and this is a one time query is to use a tool like Sqoop to load the SQL table in Hadoop and use the data natively in Spark.

Spark read() works but sql() throws Database not found

I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.

Categories