BigQueryIO read get TableSchema - java

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?

Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.

Related

Encrypting specific columns in Hibernate: What to do with existing data and how to correctly implement #ColumnTransformer?

I have built a web application with a MySQL database that holds patients data. According to GDPR patients names must be encrypted inside the database. For connecting and performing operations to db, I use Hibernate 5.
Searching the web, I have found a lot of info regarding how to encrypt a specific column or columns inside a db table. Mainly the following three approaches:
Using #ColumnTransformer annotation of Hibernate which is the least destructive to existing code and requires the least code to be written
Using Jasypt and its Hibernate integration which is more destructive to existing code and requires a few lines of code.
Implementing a JPA Attribute Converter which required quite a few lines to be written
I decided to use #ColumnTransformer which seems to be the easiest implementation. If you think that one of the other approaches is better please say it and also explain the reason.
My question, however, has to do with existing data. My db already has data that is unencrypted which must be encrypted to work with #ColumnTransformer implementation. I intend to use the following annotation:
#ColumnTransformer(
read = "pgp_sym_decrypt(lastName, 'mySecretKey')",
write = "pgp_sym_encrypt(?, 'mySecretKey')"
)
and
#ColumnTransformer(
read = "pgp_sym_decrypt(firstName, 'mySecretKey')",
write = "pgp_sym_encrypt(?, 'mySecretKey')"
)
to the corresponding columns.
How should I encrypt existing data to comply with the above annotations? What SQL code should I use?
MySQL supports the following functions:
AES_ENCRYPT(str, key_str);
AES_DECRYPT(crypt_str,key_str);
However, I can't update all MySQL entries using the following (because aes_encrypt returns binary):
UPDATE Patient SET firstName=AES_ENCRYPT(firstName, "mySecretKey"), lastName=AES_ENCRYPT(lastName, "mySecretKey") //NOT WORKING
The solution is:
Rename existing columns using MySQLcommand:
ALTER TABLE Patient CHANGE firstName firstName-old;
ALTER TABLE Patient CHANGE lastName lastName-old;
Create two new MySQL columns of type varbinary(512) with command:
ALTER TABLE Patient ADD COLUMN lastName VARBINARY(512) NOT NULL;
ALTER TABLE Patient ADD COLUMN firstName VARBINARY(512) NOT NULL;
Update the new columns from the old ones with the following command:
UPDATE `gourvas_platform`.`Patient` SET firstName=aes_encrypt(`firstName-old`, "my secret"), lastName=aes_encrypt(`lastName-old`, "mysecret");
Now we can safely delete the old columns
Finally use the following Hibernate #ColumnTransformer annotations:
#ColumnTransformer(
read = "AES_DECRYPT(lastName, 'mySecretKey')",
write = "AES_ENCRYPT(?, 'mySecretKey')"
)
and
#ColumnTransformer(
read = "AES_DECRYPT(firstName, 'mySecretKey')",
write = "AES_ENCRYPT(?, 'mySecretKey')"
)
Note: Because I'm using MySQL 5.7 and AES_DECRYPT function returns binary[] instead of String, I need to cast to text. So the above #ColumnTransformer needs to be changed to the following:
#ColumnTransformer(
read = "cast(aes_decrypt(lastName, 'my secret') as char(255))",
write = "aes_encrypt(?, 'mysecret')"
)
and
#ColumnTransformer(
read = "cast(aes_decrypt(firstName, 'myscret') as char(255))",
write = "aes_encrypt(?, 'mysecret')"
)

Apache Beam Dataflow BigQuery

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?
I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.
Declare library
from google.cloud import bigquery
Prepares a bigquery client
client = bigquery.Client(project='your_project_name')
Prepares a reference to the new dataset
dataset_ref = client.dataset('your_data_set_name')
Make API request
tables = list(client.list_tables(dataset_ref))
if tables:
for table in tables:
print('\t{}'.format(table.table_id))
Reference:
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets
You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.
The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.
DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
// do something
}

Creating a table within a dataset in BigQuery programmatically

Is it possible to create a table within a dataset in BigQuery using the API in Java? I know it's possible with
bq mk --schema <fileName> -t <project>:<dataset>.<table>
but I can't find a way to do it programmatically.
I haven't used the Java BigQuery library personally1, but it looks like you should call BigQuery.create(TableInfo, TableOptions[]. That documentation has this example code - assuming you already have an instance of a BigQuery interface implementation of course:
String datasetName = "my_dataset_name";
String tableName = "my_table_name";
String fieldName = "string_field";
TableId tableId = TableId.of(datasetName, tableName);
// Table field definition
Field field = Field.of(fieldName, Field.Type.string());
// Table schema definition
Schema schema = Schema.of(field);
TableDefinition tableDefinition = StandardTableDefinition.of(schema);
TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build();
Table table = bigquery.create(tableInfo);
Obviously your schema construction is likely to be a bit more involved for a real table, but that should get you started. I can't see any way of loading a schema from a file, but if your schema file is machine-readable in a simple way (e.g. JSON) you could probably write your own parser fairly easily. (And contribute it to the project, should you wish...)
1 I'm the main author of the C# BigQuery library though, so I know what to look for.

Spark read() works but sql() throws Database not found

I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.

Can I force quotes for STRING fields in a JobConfigurationExtract for BigQuery?

There is a table we would like to export to a customer by means of a JobConfigurationExtract to Google Cloud Storage using Java. I encounter an issue when exchanging CSV information with a customer. The customer is forced to receive CSV files separated by a comma. String fields should always have surrounding quotes.
I noticed that, by default, no quotes are added.
I also noticed that in the query explorer, quotes are added when a delimiter is present in one of the data values.
Small snippet of code as to how we configure this job.
Job exportJob = new Job();
JobConfiguration jobConfiguration = new JobConfiguration();
JobConfigurationExtract configurationExtract = new JobConfigurationExtract();
configurationExtract.setSourceTable(sourceTable);
configurationExtract.setFieldDelimiter(",");
configurationExtract.setPrintHeader(true);
configurationExtract.setDestinationUri(destinationUri);
//configurationExtract.setForcedQuotes(true) <=wish there was something like this.
jobConfiguration.setExtract(configurationExtract);
exportJob.setConfiguration(jobConfiguration);
Bigquery bigquery = getBigQuery();
Job resultJob = bigquery.jobs().insert(projectId, exportJob).execute();
Is there a way to achieve this, without making a very complicated query that concats quotes around strings?
There isn't a way to do this, other than, as you suggested, writing a query that writes out string fields with quotes. However, this is a reasonable feature request. Can you file it as a feature request at the bigquery public issue tracker here: https://code.google.com/p/google-bigquery/ so that we can prioritize it and you can keep track of progress?

Categories