We have close to a million records in one DB table and while inserting a record to the table, we save a huge xml as Clob (a column has type as CLOB) and save the same. We are now planning to retire the Clob column as it is occupying lots of space and it is creating some performance issues. As part of it, we are planning to move the data in Clob column to S3 bucket in AWS. Basically the clob data will become a json file in S3 bucket and application in future will read/write from S3 bucket. The problem that we are facing is, for existing million records, we need to move clob data to S3 bucket. How can we move such huge dataset? iterating each record from DB and publishing the same to S3 bucket is very time consuming from code perspective. Any suggestions in this regard?
One approach to move the CLOB data to S3 for existing million records is to use a tool that can do this task in bulk. Here are a few suggestions:
AWS Database Migration Service: This is a managed service that can migrate your data to and from various data sources, including Amazon S3. You can use the service to extract data from your database, convert it to JSON format, and load it to an S3 bucket.
AWS Glue: This is a fully managed ETL (extract, transform, load) service that can move data between different data sources. You can use Glue to create an ETL job that extracts data from your database, converts it to JSON format, and loads it to an S3 bucket.
Apache Spark: You can use Apache Spark to read data from your database and write it to an S3 bucket. Spark is a distributed computing framework that can handle large datasets and parallel processing. You can write a Spark job that reads data from your database, converts it to JSON format, and saves it to an S3 bucket.
Scripting: If you prefer to write your own code, you can use a scripting language like Python or Java to read data from your database and write it to an S3 bucket. You can use the AWS SDK for Python or Java to interact with S3 and the database.
Regardless of the approach you choose, make sure to test it on a small subset of your data before running it on the full dataset. Also, consider the cost and the time it takes to move such a huge dataset to S3, as it may take a while to complete and could incur additional costs.
Related
I am using STORAGE_WRITE_API method in Dataflow for writing data into BigQuery through Batch Pipeline. That is causing issues and sometimes it gets stuck and does not load data into Biquery. It works with small tables but with the large table, it starts giving issues without throwing any errors.
I tried the same code with Default write method and it run properly with small as well as large tables.
So I wanted to know the STORAGE_WRITE_API method for BigQuery is recommended for Batch Pipeline or not?
rows.apply(BigQueryIO.writeTableRows()
.withJsonSchema(tableJsonSchema)
.to(String.format("project:SampleDataset.%s", tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
);
The STORAGE_WRITE_API is recommended for batch and streaming according to the documentation :
The BigQuery Storage Write API is a unified data-ingestion API for
BigQuery. It combines streaming ingestion and batch loading into a
single high-performance API. You can use the Storage Write API to
stream records into BigQuery in real time or to batch process an
arbitrarily large number of records and commit them in a single atomic
operation.
Advantages of using the Storage Write API Exactly-once delivery
semantics. The Storage Write API supports exactly-once semantics
through the use of stream offsets. Unlike the tabledata.insertAll
method, the Storage Write API never writes two messages that have the
same offset within a stream, if the client provides stream offsets
when appending records.
Stream-level transactions. You can write data to a stream and commit
the data as a single transaction. If the commit operation fails, you
can safely retry the operation.
Transactions across streams. Multiple workers can create their own
streams to process data independently. When all the workers have
finished, you can commit all of the streams as a transaction.
Efficient protocol. The Storage Write API is more efficient than the
older insertAll method because it uses gRPC streaming rather than REST
over HTTP. The Storage Write API also supports binary formats in the
form of protocol buffers, which are a more efficient wire format than
JSON. Write requests are asynchronous with guaranteed ordering.
Schema update detection. If the underlying table schema changes while
the client is streaming, then the Storage Write API notifies the
client. The client can decide whether to reconnect using the updated
schema, or continue to write to the existing connection.
Lower cost. The Storage Write API has a significantly lower cost than
the older insertAll streaming API. In addition, you can ingest up to 2
TB per month for free.
There are many advantages for batch and streaming.
For batch mode, it's more efficient than BATCH_LOAD method.
You need to check from all the possible logs to understand this weird behaviour :
job log in Dataflow UI
worker log in Dataflow UI
Diagnostics tab in Dataflow UI
Cloud Logging with filter on dataflow_step
Use the latest apache beam version if possible 2.43.0
Looking for an example on how to build the file in cloud storage dynamically. Below is my use case:
Java application will query big query for data
Using pagination in big query, data will be pulled by page window
After having the data from BQ, will persist each chunk in cloud storage.
After all chunks have been uploaded, complete file upload.
The challenge in here is cloud storage file is immutable so once you have created the object in GCS, you can no longer reopen it unless you overwrite the same file.
Tried to explore using streaming and resumable upload feature and based on my understanding it needs the file to be ready prior to uploading.
If this is not possible, my only option now is to upload each chunk as different file and use cloud storage compose feature to merge these chunks into a one file. This is very costly given that you need to create multiple request to GCS just to complete one file.
If your final file format is CSV, JSONL (line), AVRO or Parquet, you can use the table export feature. only one file will be generated if you export less than 1Gb.
Java application query BigQuery and sink the result in a temporary table
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
Java application performs a table export
That's all.
I have been searching for loading data into Big Query programmatically from Google Cloud Storage. I have done this manually by taking backup of my Google Cloud Storage of one Kind and dumping it into the BigQuery Table and was able to retrive data in android as well. The only problem i am facing is that i want to upload data programmatically into BigQuery Table.
What are the various methods to achieve this?
You can use bigquery Java API to load data from cloud storage to Bigquery.
If you need to process the data beforehand, depending on the scale, you can use Google Dataflow to process the data, and then load data or stream data to Bigquery.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
You can also use Bigquery to directly search files in Cloud storage without loading.
https://cloud.google.com/bigquery/external-data-sources
What are the options to index large data from Oracle DB to elastic search cluster? Requirement is to index 300Million records one time into multiple indexes and also incremental updates having around approximate 1 Million changes every day.
I have tried JDBC plugin for elasticsearch river/feeder, both seems to be running inside or require locally running elastic search instance. Please let me know if there is any better option for running elastic search indexer as a standalone job (probably java based). Any suggestions will be very helpful.
Thanks.
We use ES as a reporting db and when new records are written to SQL we take the following action to get them into ES:
Write the primary key into a queue (we use rabbitMQ)
Rabbit picks up the primary key (when it has time) and queries the relation DB to get the info it needs and then writes the data into ES
This process works great because it handles both new data and old data. For old data just write a quick script to write 300M primary keys into rabbit and you're done!
there are many integration options - I've listed out a few to give you some ideas, the solution is really going to depend on your specific resources and requirements though.
Oracle Golden Gate will look at the Oracle DB transaction logs and feed them in real-time to ES.
ETL for example Oracle Data Integrator could run on a schedule and pull data from your DB, transform it and send to ES.
Create triggers in the Oracle DB so that data updates can be written to ES using a stored procedure. Or use the trigger to write flags to a "changes" table that some external process (e.g. a Java application) monitors and uses to extract data from the Oracle DB.
Get the application that writes to the Oracle DB to also feed ES. Ideally your application and Oracle DB should be loosely coupled - do you have an integration platform that can feed the messages to both ES and Oracle?
I have created a app that will have a large set of data in the form of XML files inside documents folder. The data size is so large and its growing data by day so planning to move it to SQLLite DB. Also, i want it to be moved to SQLLite DB for security purposes. I have around 1000 XML files currently, it may grow in future. My primary issue is i want all the data inside XML files to be moved into SQLLite DB using a Backend System(.Net Framework or Java) and can i push this complete Database into the iPhone using a Web Service. So that no XML parsing happens in iPhone. Because i heard XML parsing is resource intensive than reading from SQLLite DB inside iPhone. Whether this is a feasible solution or any better approach is available?
Don't transport the entire set of data each time. Have the iOS client request only the changes since it last synced, and have it update its local database. Processing multiple XML documents should be fine so long as the app can synchronize in the background while the user continues to use it.