Creating a table within a dataset in BigQuery programmatically - java

Is it possible to create a table within a dataset in BigQuery using the API in Java? I know it's possible with
bq mk --schema <fileName> -t <project>:<dataset>.<table>
but I can't find a way to do it programmatically.

I haven't used the Java BigQuery library personally1, but it looks like you should call BigQuery.create(TableInfo, TableOptions[]. That documentation has this example code - assuming you already have an instance of a BigQuery interface implementation of course:
String datasetName = "my_dataset_name";
String tableName = "my_table_name";
String fieldName = "string_field";
TableId tableId = TableId.of(datasetName, tableName);
// Table field definition
Field field = Field.of(fieldName, Field.Type.string());
// Table schema definition
Schema schema = Schema.of(field);
TableDefinition tableDefinition = StandardTableDefinition.of(schema);
TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build();
Table table = bigquery.create(tableInfo);
Obviously your schema construction is likely to be a bit more involved for a real table, but that should get you started. I can't see any way of loading a schema from a file, but if your schema file is machine-readable in a simple way (e.g. JSON) you could probably write your own parser fairly easily. (And contribute it to the project, should you wish...)
1 I'm the main author of the C# BigQuery library though, so I know what to look for.

Related

using bigquery in spring boot microservice

I am referring to this documentation for running a query in my spring boot app: https://docs.spring.io/spring-cloud-gcp/docs/current/reference/html/bigquery.html
I am thinking of going on with something similar to this example they mentioned:
// BigQuery client object provided by our autoconfiguration.
#Autowired
BigQuery bigquery;
public void runQuery() throws InterruptedException {
String query = "SELECT column FROM table;";
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query).build();
// Run the query using the BigQuery object
for (FieldValueList row : bigquery.query(queryConfig).iterateAll()) {
for (FieldValue val : row) {
System.out.println(val);
}
}
}
However, there are some questions I have, they mentioned that The GcpBigQueryAutoConfiguration class configures an instance of BigQuery for you by inferring your credentials and Project ID from the machine’s environment, and also in the configuration section they mentioned spring.cloud.gcp.bigquery.datasetName is The BigQuery dataset that the BigQueryTemplate and BigQueryFileMessageHandler is scoped to and that it's required.
but what if I am having different projects and different datasets I want to use based on different conditions in my code, will it be enough to define the credential in the machine's environment, and use and in the part in which the query is defined, I can concentenate the project id and dataset name to the table to be something like this:
String query = "SELECT column FROM <project_id>+<dataset>+<table>;";
is this correct or should I do with better way?
EDIT:
and the second question is, what if I need to apply where condition in the query should it will be easily done by string concatenation enough also? for example like this:
because I found some resource is dealing with it like this:
String query = "SELECT column FROM <project_id>+<dataset>+<table> where id =<ID>;";
final var queryJobConfiguration = QueryJobConfiguration
.newBuilder("SELECT * FROM " + tableId.getTable() + " WHERE id=#id")
.addNamedParameter("id", QueryParameterValue.numeric(BigDecimal.valueOf(userId)))
.setDefaultDataset(dataset)
.build();
I don't know why they used ```.addNamedParameter````

How to create a table in BigQuery from an existing table structure using Java?

I would like to know how to create a BigQuery table via Java with a known table structure obtained from an already existing table in my BigQuery. This requirement is similar to the following SQL statement:
create table `deom.hezuo.device_20220301` like `deom.hezuo.device_20220228`;
That is, create a new table deom.hezuo.device_20220301 according to the structure and index of deom.hezuo.device_20220228. I would like to know how to do it using Java program?
The method used in the JDK currently found is like this, but it requires me to manually fill in the fields of the table structure, which is very troublesome.
BigQuery bigQuery = getBigQuery();
String datasetName = "hezuo";
String tableName = "device_20220227";
TableId tableId = TableId.of(datasetName, tableName);
TableDefinition tableDefinition = StandardTableDefinition.newBuilder().build();
TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build();
Table table = bigQuery.create(tableInfo);
Or is there a way to directly execute the SQL statement for creating a table on BigQuery through Java?

How to change dataflow job graph during runtime with arguments?

I am using Dataflow to read data from a JDBC table and load results to a BigQuery table. There is one parameter "flag" that I want to pass during runtime and if the flag is set True, results should be loaded to an additional table in BigQuery.
To summarise:
If the flag is set False - Read table A from JDBC, write to table A in BigQuery
If the flag is set True - Read table A from JDBC, write to table A as well as B in BigQuery.
Please refer sample code of my pipeline
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
ValueProvider < String > gcsFlag = options.getGcsFlag();
PCollection < TableRow > inputData = pipeline.apply("Reading JDBC Table",
JdbcIO. < TableRow > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(options.getDriverClassName(), options.getJdbcUrl())
.withUsername(options.getUsername()).withPassword(options.getPassword()))
.withQuery(options.getSqlQuery())
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new CustomRowMapper()));
inputData.apply(
"Write to BigQuery Table 1",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable()));
if (gcsFlag.get().equals("TRUE")) {
inputData.apply(
"Write to BigQuery Table 2",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable2()));
}
pipeline.run();
}
The challenge that I am facing is I have to pass the ValueProvider during compiling and creating the dataflow template. The job graph is constructed at compile time only and I am not able to re-use the same template again for other cases.
Is there a way that I can pass the ValueProvider<String> flag at runtime and the job graph can be constructed during runtime? With this, I can reuse the same template for both cases. Similarly, I want to also provide sqlQuery (options.getSqlQuery()) at runtime. So that I can use the same template for all the tables that I want to read from Source.
Any help is appreciated.
When you create the DAG it can't change in the runtime.
But still, you have a chance to fix your problem. Try Beam Partition-Pattern
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Encrypting specific columns in Hibernate: What to do with existing data and how to correctly implement #ColumnTransformer?

I have built a web application with a MySQL database that holds patients data. According to GDPR patients names must be encrypted inside the database. For connecting and performing operations to db, I use Hibernate 5.
Searching the web, I have found a lot of info regarding how to encrypt a specific column or columns inside a db table. Mainly the following three approaches:
Using #ColumnTransformer annotation of Hibernate which is the least destructive to existing code and requires the least code to be written
Using Jasypt and its Hibernate integration which is more destructive to existing code and requires a few lines of code.
Implementing a JPA Attribute Converter which required quite a few lines to be written
I decided to use #ColumnTransformer which seems to be the easiest implementation. If you think that one of the other approaches is better please say it and also explain the reason.
My question, however, has to do with existing data. My db already has data that is unencrypted which must be encrypted to work with #ColumnTransformer implementation. I intend to use the following annotation:
#ColumnTransformer(
read = "pgp_sym_decrypt(lastName, 'mySecretKey')",
write = "pgp_sym_encrypt(?, 'mySecretKey')"
)
and
#ColumnTransformer(
read = "pgp_sym_decrypt(firstName, 'mySecretKey')",
write = "pgp_sym_encrypt(?, 'mySecretKey')"
)
to the corresponding columns.
How should I encrypt existing data to comply with the above annotations? What SQL code should I use?
MySQL supports the following functions:
AES_ENCRYPT(str, key_str);
AES_DECRYPT(crypt_str,key_str);
However, I can't update all MySQL entries using the following (because aes_encrypt returns binary):
UPDATE Patient SET firstName=AES_ENCRYPT(firstName, "mySecretKey"), lastName=AES_ENCRYPT(lastName, "mySecretKey") //NOT WORKING
The solution is:
Rename existing columns using MySQLcommand:
ALTER TABLE Patient CHANGE firstName firstName-old;
ALTER TABLE Patient CHANGE lastName lastName-old;
Create two new MySQL columns of type varbinary(512) with command:
ALTER TABLE Patient ADD COLUMN lastName VARBINARY(512) NOT NULL;
ALTER TABLE Patient ADD COLUMN firstName VARBINARY(512) NOT NULL;
Update the new columns from the old ones with the following command:
UPDATE `gourvas_platform`.`Patient` SET firstName=aes_encrypt(`firstName-old`, "my secret"), lastName=aes_encrypt(`lastName-old`, "mysecret");
Now we can safely delete the old columns
Finally use the following Hibernate #ColumnTransformer annotations:
#ColumnTransformer(
read = "AES_DECRYPT(lastName, 'mySecretKey')",
write = "AES_ENCRYPT(?, 'mySecretKey')"
)
and
#ColumnTransformer(
read = "AES_DECRYPT(firstName, 'mySecretKey')",
write = "AES_ENCRYPT(?, 'mySecretKey')"
)
Note: Because I'm using MySQL 5.7 and AES_DECRYPT function returns binary[] instead of String, I need to cast to text. So the above #ColumnTransformer needs to be changed to the following:
#ColumnTransformer(
read = "cast(aes_decrypt(lastName, 'my secret') as char(255))",
write = "aes_encrypt(?, 'mysecret')"
)
and
#ColumnTransformer(
read = "cast(aes_decrypt(firstName, 'myscret') as char(255))",
write = "aes_encrypt(?, 'mysecret')"
)

BigQueryIO read get TableSchema

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?
Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.

Categories