Apache Beam Dataflow BigQuery - java

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?
I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.

Declare library
from google.cloud import bigquery
Prepares a bigquery client
client = bigquery.Client(project='your_project_name')
Prepares a reference to the new dataset
dataset_ref = client.dataset('your_data_set_name')
Make API request
tables = list(client.list_tables(dataset_ref))
if tables:
for table in tables:
print('\t{}'.format(table.table_id))
Reference:
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets

You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.

The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.
DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
// do something
}

Related

Java/Spring: connect to severals BigQuery datasets

Everything is mostly in the title.
I have an API already connected to a BigQuery dataset, which is regularly queried. Soon, my process will necessitate new data stocked in another BigQuery dataset.
So, I started to check if it's possible to connect to two different BQ datasets to one Spring API. Unless I missed it, I didn't find any information for this specific case inside BQ documentation.
As the API is already connected, the properties values of spring.cloud.gcp.bigQuery.* are already defined. As such, I can't use those properties to define the new connection.
So is it possible to connect one API to several BigQuery datasets ? if so, how can I do that with the properties files ?
Could you not achieve what you are after doing something like:
public BigQuery bigQuery1() {
return BigQueryOptions.newBuilder()
.setCredentials(
ServiceAccountCredentials.fromStream(
new FileInputStream("your_json_service_keyfile.json"))
)
.setProjectId("bigQuery1")
.build().getService();
}
public BigQuery bigQuery2() {
return BigQueryOptions.newBuilder()
.setCredentials(
ServiceAccountCredentials.fromStream(
new FileInputStream("your_json_service_keyfile.json"))
)
.setProjectId("bigQuery2")
.build().getService();
}

Is it possible to import in Dataflow streaming pipeline written in Python the Java method `wrapBigQueryInsertError`?

I'm trying to create a Dataflow streaming pipeline with Python3 that reads messages from a Pub/Sub topic to end up writing them on a BigQuery table "from scratch". I've seen in the Dataflow Java template named PubSubToBigQuery.java (that carries out what I'm looking for) a piece of code in the 3th step to handle those Pub/Sub messages transformed into table rows that fail when you try to insert them into the BigQuery table. Finally, in the code pieces of the steps 4 and 5, those are flatten and inserted in an error table:
Step 3:
PCollection<FailsafeElement<String, String>> failedInserts =
writeResult
.getFailedInsertsWithErr()
.apply(
"WrapInsertionErrors",
MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
.via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
.setCoder(FAILSAFE_ELEMENT_CODER);
Steps 4 & 5
PCollectionList.of(
ImmutableList.of(
convertedTableRows.get(UDF_DEADLETTER_OUT),
convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
.apply("Flatten", Flatten.pCollections())
.apply(
"WriteFailedRecords",
ErrorConverters.WritePubsubMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
failedInserts.apply(
"WriteFailedRecords",
ErrorConverters.WriteStringMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
In order to do this, I suspect that the key to making this possible lies in the first imported library in the template:
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;
Is this method available in Python?
If not, there is some way to perform the same in Python that is not to check that the structure and the data type of fields of the records that should be inserted corresponds to what the BigQuery table expects?
This kind of workaround slows down my streaming pipeline too much.
In Beam Python, when performing a streaming BigQuery write, the rows which failed during the BigQuery write are returned by the transform. See https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1248
So you can process these in the same way as the Java template.

BigQueryIO read get TableSchema

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?
Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.

How to improve BigQuery read performance

We're using BigQuery to retrieve the full content of a big table. We're using the publicly available publicdata:samples.natality.
Our code follows Google instructions as described in their API doc - java.
We're able to retrieve this table at around 1'300 rows/sec that is amazingly slow. Is there a faster way to retrieve the full result of a query or is this always as fast as it gets ?
The recommended way to retrieve a large amount of data from a BigQuery table is not to use tabledata.list to page through a full table as that example is using. That example is optimized for reading a small number of rows for the results of a query.
Instead, you should run an extract job that exports the entire content of the table to Google Cloud Storage, which you can then download the full content from.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
To download a table fast you can use Google BigQuery Storage Client for Java.
It lets you download the tables into efficient binaries format such as Avro or Arrow.
Using the basic Arrow example in the documentation I manage to download ~1 million rows per second.
I think you can use it to download a query result by writing the result into a temporary table.
The code to get the temporary table of the result looks like this:
public static TableId getTemporaryTable(String query) throws InterruptedException{
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(false)
.build();
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).build());
queryJob = queryJob.waitFor(); // Wait for the query to complete.
return ((QueryJobConfiguration) queryJob.getConfiguration()).getDestinationTable();
}
References:
Google cloud documentation
GitHub repository

Transform Cassandra query result to POJO with Astyanax

I am working in a Spring web application using Cassandra with Astyanax client. I want to transform result data retrieved from Cassandra queries to a POJO, but I do not know which library or Astyanax API support this.
For example, I have User column family (CF) with some basic properties (username, password, email) and other related additional information can be added to this CF. Then I fetch one User row from that CF by using OperationResult> to hold the data returned, like this:
OperationResult<ColumnList<String>> columns = getKeyspace().prepareQuery(getColumnFamily()).getRow(rowKey).execute();
What I want to do next is populating "columns" to my User object. Here, I have 2 problems and could you please help me solve this:
1/ What is the best structure of User class to hold the corresponding data retrieved from User CF? My suggestion is:
public class User {
String userName, password, email; // Basic properties
Map<String, Object> additionalInfo;
}
2/ How can I transform the Cassandra data to this POJO by using a generic method (so that it can be applied to every single CF which has mapped POJO)?
I am so sorry if there are some stupid dummy things in my questions, because I have just approached NoSQL concepts and Cassandra as well as Astyanax for 2 weeks.
Thank you so much for your help.
You can try Achilles : https://github.com/doanduyhai/achilles, an JPA compliant Entity Manager for Cassandra
Right now there is a complete implementation using Thrift API via Hector.
The CQL3 implementation using Datastax Java Driver is in progress. A beta version will be available in few months (July-August 2013)
CQL3 is great but it's still too low level because you need to extract the data yourself from the ResultSet. It's like coming back to the time when only JDBC Template was available.
Achilles is there to fill the gap.
I would suggest you to use some library like Playorm using which you can easily perform CRUD operations on your entities. See this for an example that how you can create a User object and then you can get the POJO easily by
User user1 = mgr.find(User.class, email);
Assuming that email is your NoSqlId(Primary key or row key in Cassandra).
I use com.netflix.astyanax.mapping.Mapping and com.netflix.astyanax.mapping.MappingCache for exactly this purpose.

Categories