how to batch insert data into Google BigQuery from a Java service?

how to batch insert data into Google BigQuery from a Java service? - java

I've read through a few similar questions on SO and GCP docs - but did not get a definitive answer...
Is there a way to batch insert data from my Java service into BigQuery directly, without using intermediary files, PubSub, or other Google services?
The key here is the "batch" mode: I do not want to use streaming API as it costs a lot.
I know there are other ways to do batch inserts using Dataflow, Google Cloud Storage, etc. - I am not interested in those, I need to do batch inserts programmatically for my use case.
I was hoping to use the REST batch API but it looks like it is deprecated now: https://cloud.google.com/bigquery/batch
Alternatives that are pointed to by the docs are:
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll REST request - but it looks like it will be working in the streaming mode inserting one row at a time (and cost a lot)
a Java client library: https://developers.google.com/api-client-library/java/google-api-java-client/dev-guide
After following through the links and references I ended up finding this specific API method promising: https://googleapis.dev/java/google-api-client/latest/index.html?com/google/api/client/googleapis/batch/BatchRequest.html
with the following usage pattern:
Create an BatchRequest object from this Google API client instance.
Sample usage:
client.batch(httpRequestInitializer)
.queue(...)
.queue(...)
.execute();
Is this API using the batch mode, not streaming one, and is the right way to go ?
thank you!

The "batch" version of writing data is called a "load job" in the Java client library. The bigquery.writer method creates an object which can be used to write data bytes as a batch load job. Set the format options based on the type of file you'd like to serialize to.
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.JobStatistics.LoadStatistics;
import com.google.cloud.bigquery.TableDataWriteChannel;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.WriteChannelConfiguration;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.channels.Channels;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.UUID;
public class LoadLocalFile {
public static void main(String[] args) throws IOException, InterruptedException {
String datasetName = "MY_DATASET_NAME";
String tableName = "MY_TABLE_NAME";
Path csvPath = FileSystems.getDefault().getPath(".", "my-data.csv");
loadLocalFile(datasetName, tableName, csvPath, FormatOptions.csv());
}
public static void loadLocalFile(
String datasetName, String tableName, Path csvPath, FormatOptions formatOptions)
throws IOException, InterruptedException {
try {
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests.
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of(datasetName, tableName);
WriteChannelConfiguration writeChannelConfiguration =
WriteChannelConfiguration.newBuilder(tableId).setFormatOptions(formatOptions).build();
// The location and JobName must be specified; other fields can be auto-detected.
String jobName = "jobId_" + UUID.randomUUID().toString();
JobId jobId = JobId.newBuilder().setLocation("us").setJob(jobName).build();
// Imports a local file into a table.
try (TableDataWriteChannel writer = bigquery.writer(jobId, writeChannelConfiguration);
OutputStream stream = Channels.newOutputStream(writer)) {
// This example writes CSV data from a local file,
// but bytes can also be written in batch from memory.
// In addition to CSV, other formats such as
// Newline-Delimited JSON (https://jsonlines.org/) are
// supported.
Files.copy(csvPath, stream);
}
// Get the Job created by the TableDataWriteChannel and wait for it to complete.
Job job = bigquery.getJob(jobId);
Job completedJob = job.waitFor();
if (completedJob == null) {
System.out.println("Job not executed since it no longer exists.");
return;
} else if (completedJob.getStatus().getError() != null) {
System.out.println(
"BigQuery was unable to load local file to the table due to an error: \n"
+ job.getStatus().getError());
return;
}
// Get output status
LoadStatistics stats = job.getStatistics();
System.out.printf("Successfully loaded %d rows. \n", stats.getOutputRows());
} catch (BigQueryException e) {
System.out.println("Local file not loaded. \n" + e.toString());
}
}
}
Resources:
https://cloud.google.com/bigquery/docs/batch-loading-data#loading_data_from_local_files
https://cloud.google.com/bigquery/docs/samples/bigquery-load-from-file
system test which writes JSON from memory

Related

Google Dataproc API (through Java) does not submit Job to cluster

I was trying to get this section of code to submit a Hadoop job request based on this code sample:
import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Job;
import com.google.cloud.dataproc.v1.JobControllerClient;
import com.google.cloud.dataproc.v1.JobControllerSettings;
import com.google.cloud.dataproc.v1.JobMetadata;
import com.google.cloud.dataproc.v1.JobPlacement;
import com.google.cloud.dataproc.v1.SparkJob;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SubmitJob {
public static void submitJob() throws IOException, InterruptedException {
// TODO(developer): Replace these variables before running the sample.
String projectId = "your-project-id";
String region = "your-project-region";
String clusterName = "your-cluster-name";
submitJob(projectId, region, clusterName);
}
public static void submitJob(
String projectId, String region, String clusterName)
throws IOException, InterruptedException {
String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);
// Configure the settings for the job controller client.
JobControllerSettings jobControllerSettings =
JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
// Create a job controller client with the configured settings. Using a try-with-resources
// closes the client,
// but this can also be done manually with the .close() method.
try (JobControllerClient jobControllerClient =
JobControllerClient.create(jobControllerSettings)) {
// Configure cluster placement for the job.
JobPlacement jobPlacement = JobPlacement.newBuilder().setClusterName(clusterName).build();
// Configure Spark job settings.
HadoopJob hadJob =
HadoopJob.newBuilder()
.setMainClass("my jar file")
.addArgs("input")
.addArgs("output")
.build();
Job job = Job.newBuilder().setPlacement(jobPlacement).setHadoopJob(hadJob).build();
// Submit an asynchronous request to execute the job.
OperationFuture<Job, JobMetadata> submitJobAsOperationAsyncRequest =
jobControllerClient.submitJobAsOperationAsync(projectId, region, job);
// THIS IS WHERE IT SEEMS TO TIMEOUT VVVVVVVV
Job response = submitJobAsOperationAsyncRequest.get();
// Print output from Google Cloud Storage.
Matcher matches =
Pattern.compile("gs://(.*?)/(.*)").matcher(response.getDriverOutputResourceUri());
matches.matches();
Storage storage = StorageOptions.getDefaultInstance().getService();
Blob blob = storage.get(matches.group(1), String.format("%s.000000000", matches.group(2)));
System.out.println(
String.format("Job finished successfully: %s", new String(blob.getContent())));
} catch (ExecutionException e) {
// If the job does not complete successfully, print the error message.
System.err.println(String.format("submitJob: %s ", e.getMessage()));
}
}
}
When running this sample, the code seems to timeout on Job response = submitJobAsOperationAsyncRequest.get(), and the Job is never submitted to my Google Cloud. I've checked all my project, region, and cluster names and I'm sure that is not the issue. I also have the following dependencies installed for the sample:
jar files
I believe I am not missing any .jar files.
Any suggestions? I appreciate any and all help.

Avro Schema for GenericRecord: Be able to leave blank fields

I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow.
The Avro schema is created on runtime using SchemaBuilder.
One of the fields I define in the schema is an optional LONG field, it is defined like this:
SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields();
Schema concreteType = SchemaBuilder.nullable().longType();
fields.name("key1").type(concreteType).noDefault();
Now when I create a GenericRecord using the schema above, and "key1" is not set, when putting the resulted GenericRecord to the context of my DoFn: context.output(res); I get the following error:
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.avro.UnresolvedUnionException: Not in union ["long","null"]: 256
I also tried doing the same thing with withDefault(0L) and got the same result.
What do I miss?
Thanks

It works fine for me when trying as below and you can try to print the schema that will help to compare also you can remove the nullable() for long type to try.
fields.name("key1").type().nullable().longType().longDefault(0);
Provided the complete code that I used to test:
import org.apache.avro.AvroRuntimeException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.SchemaBuilder.FieldAssembler;
import org.apache.avro.SchemaBuilder.RecordBuilder;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
import java.io.IOException;
public class GenericRecordExample {
public static void main(String[] args) {
FieldAssembler<Schema> fields;
RecordBuilder<Schema> record = SchemaBuilder.record("Customer");
fields = record.namespace("com.example").fields();
fields = fields.name("first_name").type().nullable().stringType().noDefault();
fields = fields.name("last_name").type().nullable().stringType().noDefault();
fields = fields.name("account_number").type().nullable().longType().longDefault(0);
Schema schema = fields.endRecord();
System.out.println(schema.toString());
// we build our first customer
GenericRecordBuilder customerBuilder = new GenericRecordBuilder(schema);
customerBuilder.set("first_name", "John");
customerBuilder.set("last_name", "Doe");
customerBuilder.set("account_number", 999333444111L);
Record myCustomer = customerBuilder.build();
System.out.println(myCustomer);
// writing to a file
final DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
dataFileWriter.create(myCustomer.getSchema(), new File("customer-generic.avro"));
dataFileWriter.append(myCustomer);
System.out.println("Written customer-generic.avro");
} catch (IOException e) {
System.out.println("Couldn't write file");
e.printStackTrace();
}
// reading from a file
final File file = new File("customer-generic.avro");
final DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
GenericRecord customerRead;
try (DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, datumReader)){
customerRead = dataFileReader.next();
System.out.println("Successfully read avro file");
System.out.println(customerRead.toString());
// get the data from the generic record
System.out.println("First name: " + customerRead.get("first_name"));
// read a non existent field
System.out.println("Non existent field: " + customerRead.get("not_here"));
}
catch(IOException e) {
e.printStackTrace();
}
}
}

If I understand your question correctly, you're trying to accept JSON strings and save them in a Cloud Storage bucket, using Avro as your coder for the data as it moves through Dataflow. There's nothing immediately obvious from your code that looks wrong to me. I have done this, including saving the data to Cloud Storage and to BigQuery.
You might consider using a simpler, and probably less error prone approach: Define a Java class for your data and use Avro annotations on it to enable the coder to work properly. Here's an example:
import org.apache.avro.reflect.Nullable;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
#DefaultCoder(AvroCoder.class)
public class Data {
public long nonNullableValue;
#Nullable public long nullableValue;
}
Then, use this type in your DnFn implementations like you likely already are. Beam should be able to move the data between workers properly using Avro, even when the fields marked #Nullable are null.

Generate CSV from Java object and move to Azure Storage without intermediate location

Is it possible to create a file like CSV from Java object and move them to Azure Storage without using temporary location?

According to your description , it seems that you want to upload a CSV file without taking up your local space. So, I suggest you use stream to upload CSV files to Azure File Storage.
Please refer to the sample code as below ：
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.file.CloudFile;
import com.microsoft.azure.storage.file.CloudFileClient;
import com.microsoft.azure.storage.file.CloudFileDirectory;
import com.microsoft.azure.storage.file.CloudFileShare;
import com.microsoft.azure.storage.StorageCredentials;
import com.microsoft.azure.storage.StorageCredentialsAccountAndKey;
import java.io.File;
import java.io.FileInputStream;
import java.io.StringBufferInputStream;
public class UploadCSV {
// Configure the connection-string with your values
public static final String storageConnectionString =
"DefaultEndpointsProtocol=http;" +
"AccountName=<storage account name>;" +
"AccountKey=<storage key>";
public static void main(String[] args) {
try {
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
// Create the Azure Files client.
CloudFileClient fileClient = storageAccount.createCloudFileClient();
StorageCredentials sc = fileClient.getCredentials();
// Get a reference to the file share
CloudFileShare share = fileClient.getShareReference("test");
//Get a reference to the root directory for the share.
CloudFileDirectory rootDir = share.getRootDirectoryReference();
//Get a reference to the file you want to download
CloudFile file = rootDir.getFileReference("test.csv");
file.upload( new StringBufferInputStream("aaa"),"aaa".length());
System.out.println("upload success");
} catch (Exception e) {
// Output the stack trace.
e.printStackTrace();
}
}
}
Then I upload the file into the account successfully.
You could also refer to the threads:
1.Can I upload a stream to Azure blob storage without specifying its length upfront?
2.Upload blob in Azure using BlobOutputStream
Hope it helps you.

Java insert into MongoDB not working

I followed several different tutorials for his but everytime more or less nothing happens. Since I had problems with "ClassNotFoundException" I used the mongodb driver jar file suggested in this question:
Stackoverflow Topic
I have a very simple Java Project with a Class Test running the main method to connect to my database "local" and to the collection "Countries" according to several tutorials, the data should be inserted as defined in the code. But when I check the collection in command line or Studio 3T it is still empty. There are some unused imports due to several tests before.
import org.bson.Document;
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.Mongo;
import com.mongodb.MongoClient;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
public class Test {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
MongoClient connection = new MongoClient("localhost", 27017);
DB db = connection.getDB("local");
DBCollection coll = db.getCollection("Countries");
BasicDBObject doc = new BasicDBObject("title", "MongoDB").
append("name","Germany" ).
append("population", "82 000 000");
coll.insert(doc);
System.out.print("Test");
}
catch(Exception e) {
System.out.print(e);
System.out.print("Test");
}
}
}
The output is the following:
Usage : [--bucket bucketname] action
where action is one of:
list : lists all files in the store
put filename : puts the file filename into the store
get filename1 filename2 : gets filename1 from store and sends to filename2
md5 filename : does an md5 hash on a file in the db (for testing)
I dont get why the insert is not working and in addition why the System.out.print methods are not showing up. The getDB method is also cancelled in eclipse saying "The method getDB(String) from the type Mongo is deprecated" that i do not really understand. I hope someone can help me to get the code working. Mongod.exe is running in the background.

How to programmatically put data to the Google Appengine database from remote executable?

I would like to pre-fill and periodically put data to the Google Appengine database.
I would like to write a program in java and python that connect to my GAE service and upload data to my database.
How can I do that?
Thanks

Please use RemoteAPI for doing this programmatically.
In python, you can first configure the appengine_console.py as described here
Once you have that, you can launch and write the following commands in the python shell:
$ python appengine_console.py yourapp
>>> import yourdbmodelclassnamehere
>>> m = yourmodelclassnamehere(x='',y='')
>>> m.put()
And here is code from the java version which is self explanatory (directly borrowed from the remote api page on gae docs):
package remoteapiexample;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.tools.remoteapi.RemoteApiInstaller;
import com.google.appengine.tools.remoteapi.RemoteApiOptions;
import java.io.IOException;
public class RemoteApiExample {
public static void main(String[] args) throws IOException {
String username = System.console().readLine("username: ");
String password =
new String(System.console().readPassword("password: "));
RemoteApiOptions options = new RemoteApiOptions()
.server("<your app>.appspot.com", 443)
.credentials(username, password);
RemoteApiInstaller installer = new RemoteApiInstaller();
installer.install(options);
try {
DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
System.out.println("Key of new entity is " +
ds.put(new Entity("Hello Remote API!")));
} finally {
installer.uninstall();
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to batch insert data into Google BigQuery from a Java service? - java

Related

Google Dataproc API (through Java) does not submit Job to cluster

Avro Schema for GenericRecord: Be able to leave blank fields

Generate CSV from Java object and move to Azure Storage without intermediate location

Java insert into MongoDB not working

How to programmatically put data to the Google Appengine database from remote executable?

Categories

Resources