Flink : How to execute continuous query on relational database

Flink : How to execute continuous query on relational database - java

Hello, I want to read data from relational database continuously to get new data. I written code but it executes the select query one time only not executing repeatedly.
Iterator iterates over only the result stream but not executing query multiple time.
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.keyBy(0)
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
Date dt = (Date) row.getField(2);
return row;
}
});
IterativeStream<Row> iteration = source.iterate();
iteration.closeWith(iteration.filter(new FilterFunction<Row>() {
#Override
public boolean filter(Row row) throws Exception {
if(Integer.parseInt(row.getField(0).toString()) > 0) {
return true;
}
return false;
}
}));
//iteration.print();
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.INSERT_SQL)
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR, Types.TIMESTAMP_WITH_TIMEZONE})
.finish());
environment.execute();

Related

Filter the query result only and not the entire table

I am new to spring and I am trying to do something like this. Let us say my table has the following columns.
task
is_completed
completd_at
The user will provide the following options in the query parameters.
is_completed=true or false
from_date = dd-mm-yyyy
to_date = dd-mm-yyyy
I will check for each parameter one by one and then filter the table.
In Django, I can do something like this
tasks = Task.object.all() # All the tasks will be stored in tasks
tasks = tasks.filter(is_completed=True) # completed tasks will be filtered from all tasks
tasks = tasks.filter(completed_at__gte=from_date, completed_date__lte=to_date) # completed tasks will be filtered based on the completed date
How can I achieve this with spring JPA. Is there any way I can save the filtered results and query the filtered results again instead of querying the entire database?
That way I can check if the parameter has any value and check it like this
if (is_completed = True){
// filter completed tasks
}
if (from_date ){
// filter completed tasks that are completed from or after this date
}
if (to_date ){
// filter completed tasks that are completed till this date
}
The problem with the current approach is I have to write SQL queries for each combination. This becomes complex when there are multiple parameters.

Lets consider this is your Task class.
class Task {
private String id;
private boolean isCompleted;
private LocalDate isCompletedAt;
public Task(String id, boolean isCompleted, LocalDate isCompletedAt) {
this.id = id;
this.isCompleted = isCompleted;
this.isCompletedAt = isCompletedAt;
}
public boolean isCompleted() {
return isCompleted;
}
public LocalDate getCompletedAt() {
return isCompletedAt;
}
#Override
public String toString() {
return "Task{" +
"id='" + id + '\'' +
", isCompleted=" + isCompleted +
", isCompletedAt=" + isCompletedAt +
'}';
}
}
Below is the code using which you can filter the data:
class TaskFilter {
public static void main(String[] args) {
// Sample user input
boolean isCompleted = true;
LocalDate fromDate = LocalDate.parse("2020-11-04");
LocalDate toDate = LocalDate.parse("2021-11-06");
// Simulate data retrieved from JPA repository eg repository.findAll()
List<Task> tasks = List.of(
new Task("1", true, LocalDate.parse("2020-10-04")),
new Task("2", false, LocalDate.parse("2010-12-02")),
new Task("3", false, LocalDate.parse("2021-04-24")),
new Task("4", true, LocalDate.parse("2021-03-12"))
);
// Create a stream on retrieved data
Stream<Task> tasksStream = tasks.stream();
// Filter that stream based on user input
if(isCompleted) {
tasksStream = tasksStream.filter(task -> task.isCompleted());
}
if(fromDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isAfter(fromDate));
}
if(toDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isBefore(toDate));
}
// Finally collect that in a list. This is a must operation, because stream is not executed unless terminal operation is called.
List<Task> filteredTaskList = tasksStream.collect(Collectors.toList());
System.out.println(filteredTaskList);
}
}

Can anyone point out possible areas to investigate as to find the possible cause for 25% CPU usage in a Java process?

I have an application which reads a SQL ResultSet, parses the schema, and writes it to a Parquet file. I'm having a consistently high CPU usage in the function which writes to the Parquet file. I'm not sure if this is normal.
I am using Xmx=512m as a CLI parameter in the run.
I've run JProfiler to profile the application and I came up with resultSet.getValue() and parquetWriter.write() as the hotspots for method invocation. I also noticed the org.apache.hadoop.fs.FileSystem$Statistics thread being on waiting for a long time.
When I ran hPROF it gave me java.net.SocketInputStream.socketRead0 as being a massive consumer of CPU time.
fixedColNames, colTypes and colNames are all Hashtables which are built when the ResultSet is pulled in from SQL. I did this to prevent multiple functional calls to ResultSet.getMetadata().
All in all, I've not figured out what specifically is causing my application to use so much CPU.
I've attached the Profiler snapshot and the thread stack too.
CPU Cores Profiler Snapshot Threads
Attached Thread Dumps:
https://gist.github.com/skitiz/87c12af3fa2b3c31113365e8c4d8dc74
https://gist.github.com/skitiz/152f4968694fa612636f92af8a6087cb
https://gist.github.com/skitiz/dd7c1cea7d4a527c36ba24f6bf73ee87
https://gist.github.com/skitiz/af37552dd3340611b46040dc98338818
https://gist.github.com/skitiz/9806469affdd651b41ea2bce3393e604
private String saveParquetFile(
Schema schema, ResultSet resultset, String fileName, String offsetKey) throws Exception {
Path outFile = new Path(config.getExtractFilePath() + fileName);
String offsetValue = null;
ParquetWriter<Object> parquetWriter =
AvroParquetWriter.builder(outFile)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.build();
GenericRecordBuilder genericRecordBuilder = new GenericRecordBuilder(schema);
int rowCount = 0;
while (resultset.next()) {
for (int i = 1; i <= colCount; i++) {
genericRecordBuilder.set(
fixedColNames.get(i), extractResult(colTypes.get(i), colNames.get(i), resultset, 1));
}
rowCount = rowCount + 1;
// System.out.println(count++);
parquetWriter.write(genericRecordBuilder.build());
if (offsetKey != null) {
offsetValue = resultset.getString(offsetKey);
}
}
parquetWriter.close();
rawDataLakeLoaderLog.setRows(rowCount);
return offsetValue;
}
private Object extractResult(int mapping, String columnName, ResultSet resultSet, int flag)
throws SQLException {
Object temp = null;
switch (mapping) {
case Types.BIT:
if (flag == 1) {
temp = resultSet.getBoolean(columnName);
return resultSet.wasNull() ? null : temp;
}
return Schema.Type.BOOLEAN;
case Types.TINYINT:
if (flag == 1) {
temp = resultSet.getShort(columnName);
return resultSet.wasNull() ? null : temp;
}
return Schema.Type.INT;
case Types.SMALLINT:
case Types.INTEGER:
if (flag == 1) {
temp = resultSet.getInt(columnName);
return resultSet.wasNull() ? null : temp;
}
return Schema.Type.INT;
case Types.BIGINT:
if (flag == 1) {
temp = resultSet.getLong(columnName);
return resultSet.wasNull() ? null : temp;
}
return Schema.Type.LONG;
default:
if (flag == 1) {
temp = resultSet.getString(columnName);
return resultSet.wasNull() ? null : temp;
}
return Schema.Type.STRING;
}
}
I expect the CPU usage to be near 5-10%? But its coming out to be 25%.

Overwrite some partitions of a partitioned table Bigquery

I am currently trying to develop a Dataflow pipeline in order to replace some partitions of a partitioned table. I have a custom partition field which is a date. The input of my pipeline is a file with potentially different dates.
I developed a Pipeline :
PipelineOptionsFactory.register(BigQueryOptions.class);
BigQueryOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BigQueryOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> rows = p.apply("ReadLines", TextIO.read().from(options.getFileLocation()))
.apply("Convert To BQ Row", ParDo.of(new StringToRowConverter(options)));
ValueProvider<String> projectId = options.getProjectId();
ValueProvider<String> datasetId = options.getDatasetId();
ValueProvider<String> tableId = options.getTableId();
ValueProvider<String> partitionField = options.getPartitionField();
ValueProvider<String> columnNames = options.getColumnNames();
ValueProvider<String> types = options.getTypes();
rows.apply("Write to BQ", BigQueryIO.writeTableRows()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withCustomGcsTempLocation(options.getGCSTempLocation())
.to(new DynamicDestinations<TableRow, String>() {
#Override
public String getDestination(ValueInSingleWindow<TableRow> element) {
TableRow date = element.getValue();
String partitionDestination = (String) date.get(partitionField.get());
SimpleDateFormat from = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat to = new SimpleDateFormat("yyyyMMdd");
try {
partitionDestination = to.format(from.parse(partitionDestination));
LOG.info("Table destination "+partitionDestination);
return projectId.get()+":"+datasetId.get()+"."+tableId.get()+"$"+partitionDestination;
} catch(ParseException e){
e.printStackTrace();
return projectId.get()+":"+datasetId.get()+"."+tableId.get()+"_rowsWithErrors";
}
}
#Override
public TableDestination getTable(String destination) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField(partitionField.get());
timePartitioning.setType("DAY");
timePartitioning.setRequirePartitionFilter(true);
TableDestination tableDestination = new TableDestination(destination, null, timePartitioning);
LOG.info(tableDestination.toString());
return tableDestination;
}
#Override
public TableSchema getSchema(String destination) {
return new TableSchema().setFields(buildTableSchemaFromOptions(columnNames, types));
}
})
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
);
p.run();
}
When I trigger the pipeline locally, it successfully replacesthe partitions which date are in the input file. Nevertheless, when deploying on Google Cloud Dataflow and running the template with the exact same parameters, it truncates all the data, and I just have at the end the file I wanted to upload in my table.
Do you know why there is such a difference ?
Thank you !

You specified BigQueryIO.Write.CreateDisposition to CREATE_IF_NEEDED, and this is paired with BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE, so even if the table exists, it may be recreated. This is the reason why you see your table getting replaced.
See this document [1] for details.
[1] https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/BigQueryIO.Write.CreateDisposition#CREATE_IF_NEEDED

Receiving java.lang.IllegalArgumentException: invalid fixed length size, using Jackcess library

I need help in a routine I've written to dump the content of a class (which represent a database table) to a new database table in MS Access. My code is the following:
public void dumpDB() throws IOException, Exception {
// for each table
for (String tableName : this.DB.getTablesNames()) {
System.out.println(tableName);
int nColumns = 0;
ModelDatabaseTable table = this.DB.getTable(tableName);
// create a tablebuilder
TableBuilder DBTableBuilder = new TableBuilder(tableName);
// get datatypes of columns
Map<String, DataType> columns = table.getColumns();
// for each column
for (String columnName : columns.keySet()) {
System.out.println(columnName);
// get its datatype
DataType dt = columns.get(columnName);
// create a column with correspondent datatype and max length and add it
// to the table builder
ColumnBuilder cb = new ColumnBuilder(columnName).setType(dt).setMaxLength();
DBTableBuilder.addColumn(cb);
nColumns += 1;
}
// if table has columns
if (nColumns > 0) {
// save it to the actual database: Exception rises here
Table DBTable = DBTableBuilder.toTable(this.DBConnection);
// copy all table's rows
for (ModelDatabaseRow row : table.getRows()) {
List<String> values = new ArrayList<String>();
for (String columnName : columns.keySet()) {
String columnValue = row.getColumn(columnName);
values.add(columnValue);
}
DBTable.addRow(values.toArray());
}
}
}
}
When I try to save the table to the actual database, I get the exception:
java.lang.IllegalArgumentException: invalid fixed length size
at com.healthmarketscience.jackcess.ColumnBuilder.validate(ColumnBuilder.java:361)
at com.healthmarketscience.jackcess.impl.TableCreator.validate(TableCreator.java:207)
at com.healthmarketscience.jackcess.impl.TableCreator.createTable(TableCreator.java:130)
at com.healthmarketscience.jackcess.impl.DatabaseImpl.createTable(DatabaseImpl.java:954)
at com.healthmarketscience.jackcess.TableBuilder.toTable(TableBuilder.java:223)
at modelDatabase.AccessModelDatabaseBuilder.dumpDB(AccessModelDatabaseBuilder.java:153)
at modelDatabase.AccessModelDatabaseBuilder.main(AccessModelDatabaseBuilder.java:37)
DataTypes were saved before using the same database I am writing (I am basically updating the database), using the code:
for (Column column : DBTable.getColumns()) {
table.addColumn(column.getName(), column.getType(), "");
}
What am I doing wrong?

From the Jackcess forum thread, the solution is to wrap the call to setMaxLength() method:
if(dt.isVariableLength()) {
cb.setMaxLength();
}

MVEL VerifyError with Bulk Update - Not updated few records in JAVA API

I am using elastic search bulk update JAVA API. Below is the script, I am using for bulk update. In mapping nested object 'name' is specified as string field.
String updateScript = "if(ctx._source.containsKey(\"nestedObjects\") && ctx._source.nestedObjects.size()>0)
{
int nestedSize = ctx._source.nestedObjects.size();
boolean isUpdated = false;
for (int i = 0; i < nestedSize; i++)
{
if(ctx._source.nestedObjects[i].containsKey(\"name\"))
{
if(ctx._source.nestedObjects[i].name == \"ram\")
{
ctx._source.nestedObjects[i].name = \"ricky\";
isUpdated = true;
}
}
}
if(!isUpdated)
{
ctx._source.nestedObjects.add(\"name\":\"ricky\");
}
}";
Below is the code I am using for bulk update.
BulkRequestBuilder bulkRequestBuilder = client.prepareBulk();
for (int i=0; i<5; i++)
{
String documentId = String.value(i);
bulkRequestBuilder.add(indexName, type, documentId).setScript(updateScript).setRouting(routingName).request());
}
BulkResponse bulkResponse = bulkRequestBuilder.execute().actionGet();
Bulk Failure Message:
message [VerifyError[(class: ASMAccessorImpl_2153668671377692494610, method: getValue signature: (Ljava/lang/Object;Ljava/lang/Object;Lorg/elasticsearch/common/mvel2/integration/VariableResolverFactory;)Ljava/lang/Object;) Expecting to find integer on stack]]
Note : Only few records not getting updated. If I update again, some other few records getting same error and not updated. The record, which got error for the first time got updated in second time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Flink : How to execute continuous query on relational database - java

Related

Filter the query result only and not the entire table

Can anyone point out possible areas to investigate as to find the possible cause for 25% CPU usage in a Java process?

Overwrite some partitions of a partitioned table Bigquery

Receiving java.lang.IllegalArgumentException: invalid fixed length size, using Jackcess library

MVEL VerifyError with Bulk Update - Not updated few records in JAVA API

Categories

Resources