How to improve BigQuery read performance - java

We're using BigQuery to retrieve the full content of a big table. We're using the publicly available publicdata:samples.natality.
Our code follows Google instructions as described in their API doc - java.
We're able to retrieve this table at around 1'300 rows/sec that is amazingly slow. Is there a faster way to retrieve the full result of a query or is this always as fast as it gets ?

The recommended way to retrieve a large amount of data from a BigQuery table is not to use tabledata.list to page through a full table as that example is using. That example is optimized for reading a small number of rows for the results of a query.
Instead, you should run an extract job that exports the entire content of the table to Google Cloud Storage, which you can then download the full content from.
https://cloud.google.com/bigquery/exporting-data-from-bigquery

To download a table fast you can use Google BigQuery Storage Client for Java.
It lets you download the tables into efficient binaries format such as Avro or Arrow.
Using the basic Arrow example in the documentation I manage to download ~1 million rows per second.
I think you can use it to download a query result by writing the result into a temporary table.
The code to get the temporary table of the result looks like this:
public static TableId getTemporaryTable(String query) throws InterruptedException{
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(false)
.build();
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).build());
queryJob = queryJob.waitFor(); // Wait for the query to complete.
return ((QueryJobConfiguration) queryJob.getConfiguration()).getDestinationTable();
}
References:
Google cloud documentation
GitHub repository

Related

Is it possible to import in Dataflow streaming pipeline written in Python the Java method `wrapBigQueryInsertError`?

I'm trying to create a Dataflow streaming pipeline with Python3 that reads messages from a Pub/Sub topic to end up writing them on a BigQuery table "from scratch". I've seen in the Dataflow Java template named PubSubToBigQuery.java (that carries out what I'm looking for) a piece of code in the 3th step to handle those Pub/Sub messages transformed into table rows that fail when you try to insert them into the BigQuery table. Finally, in the code pieces of the steps 4 and 5, those are flatten and inserted in an error table:
Step 3:
PCollection<FailsafeElement<String, String>> failedInserts =
writeResult
.getFailedInsertsWithErr()
.apply(
"WrapInsertionErrors",
MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
.via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
.setCoder(FAILSAFE_ELEMENT_CODER);
Steps 4 & 5
PCollectionList.of(
ImmutableList.of(
convertedTableRows.get(UDF_DEADLETTER_OUT),
convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
.apply("Flatten", Flatten.pCollections())
.apply(
"WriteFailedRecords",
ErrorConverters.WritePubsubMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
failedInserts.apply(
"WriteFailedRecords",
ErrorConverters.WriteStringMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
In order to do this, I suspect that the key to making this possible lies in the first imported library in the template:
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;
Is this method available in Python?
If not, there is some way to perform the same in Python that is not to check that the structure and the data type of fields of the records that should be inserted corresponds to what the BigQuery table expects?
This kind of workaround slows down my streaming pipeline too much.
In Beam Python, when performing a streaming BigQuery write, the rows which failed during the BigQuery write are returned by the transform. See https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1248
So you can process these in the same way as the Java template.

Writing Data from RDS to Disk in JOOQ

My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html
Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"

Apache Beam Dataflow BigQuery

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?
I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.
Declare library
from google.cloud import bigquery
Prepares a bigquery client
client = bigquery.Client(project='your_project_name')
Prepares a reference to the new dataset
dataset_ref = client.dataset('your_data_set_name')
Make API request
tables = list(client.list_tables(dataset_ref))
if tables:
for table in tables:
print('\t{}'.format(table.table_id))
Reference:
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets
You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.
The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.
DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
// do something
}

Getting a MySQL table's key and engine information from a statement's metadata using java

I am in the process of writing a java class that will read tables from a database that exists on one database server and will then recreate the tables in another database that resides on a different server.
With the above in mind, I am obtaining the most of the tables' metadata from a result set that reads from the source database. I say most because I am unsure where I can information on the keys, the auto-increment, engine information.
Can I get this information via the statement's metadata? Or should I be looking elsewhere for this information? Possibly the database's metadata???
If this helps, here is a snippet of the code - as you can see, quite basic stuff.
Statement sourceStmt = sourceConnection.createStatement();
ResultSet sourceRS = sourceStmt.executeQuery("select * from " + tableName);
--> this is how I am getting the metadata and am not sure if this is correct
or not in regards to wanting to get the key and engine type information.
sourceDataRS.getMetaData();
Any information you can offer is greatly appreciated.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

Categories