Looking for an example on how to build the file in cloud storage dynamically. Below is my use case:
Java application will query big query for data
Using pagination in big query, data will be pulled by page window
After having the data from BQ, will persist each chunk in cloud storage.
After all chunks have been uploaded, complete file upload.
The challenge in here is cloud storage file is immutable so once you have created the object in GCS, you can no longer reopen it unless you overwrite the same file.
Tried to explore using streaming and resumable upload feature and based on my understanding it needs the file to be ready prior to uploading.
If this is not possible, my only option now is to upload each chunk as different file and use cloud storage compose feature to merge these chunks into a one file. This is very costly given that you need to create multiple request to GCS just to complete one file.
If your final file format is CSV, JSONL (line), AVRO or Parquet, you can use the table export feature. only one file will be generated if you export less than 1Gb.
Java application query BigQuery and sink the result in a temporary table
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
Java application performs a table export
That's all.
Related
We have close to a million records in one DB table and while inserting a record to the table, we save a huge xml as Clob (a column has type as CLOB) and save the same. We are now planning to retire the Clob column as it is occupying lots of space and it is creating some performance issues. As part of it, we are planning to move the data in Clob column to S3 bucket in AWS. Basically the clob data will become a json file in S3 bucket and application in future will read/write from S3 bucket. The problem that we are facing is, for existing million records, we need to move clob data to S3 bucket. How can we move such huge dataset? iterating each record from DB and publishing the same to S3 bucket is very time consuming from code perspective. Any suggestions in this regard?
One approach to move the CLOB data to S3 for existing million records is to use a tool that can do this task in bulk. Here are a few suggestions:
AWS Database Migration Service: This is a managed service that can migrate your data to and from various data sources, including Amazon S3. You can use the service to extract data from your database, convert it to JSON format, and load it to an S3 bucket.
AWS Glue: This is a fully managed ETL (extract, transform, load) service that can move data between different data sources. You can use Glue to create an ETL job that extracts data from your database, converts it to JSON format, and loads it to an S3 bucket.
Apache Spark: You can use Apache Spark to read data from your database and write it to an S3 bucket. Spark is a distributed computing framework that can handle large datasets and parallel processing. You can write a Spark job that reads data from your database, converts it to JSON format, and saves it to an S3 bucket.
Scripting: If you prefer to write your own code, you can use a scripting language like Python or Java to read data from your database and write it to an S3 bucket. You can use the AWS SDK for Python or Java to interact with S3 and the database.
Regardless of the approach you choose, make sure to test it on a small subset of your data before running it on the full dataset. Also, consider the cost and the time it takes to move such a huge dataset to S3, as it may take a while to complete and could incur additional costs.
I have common question about architecture I should use for my specific problem.
I have .TSV file with some informations and my task is to create REST API app that will consume this .TSV file and there will be 3 REST API endpoints. Each endpoint will return JSON data I processed from .TSV file.
My question is: Should I crate some POST method that will upload the TSV file and I will save it eg to the session and do the logic with using the API Endpoints?
Or should I POST the content of TFS file as JSON in every request to the specific endpoint?
I dont know how to glue it all together.
There is no requirement fot the DB. The program will be tested just with numerous requests through the API and I dont know how to process or store the .TSV content in my app so one user could call all three endpoint sequentially above the same data without reuploading the TSV file.
It's better to upload the file and then do the processing on server. The file will upload in one request and it's better rather than send multiple request.
I believe the solution will depend on the size of the file. Storing the file in the memory can not be a good approach if the file is very large. And also, saving the file in a session may not be good, because if you need to scale your service in the future, you will not be able to do it. Even storing the file in a /tmp directory can also be a bad approach, because the solution continues to be not scalable.
It will be a good idea using a Storage Service like AWS S3 or Google Firebase or any other related. When you would call one of your three RESTs, your application will verify if that file was not yet processed, read that file, process anything you want and save the result to your S3 Bucket (If you don't want to save the processed files, you can use a retention policy on S3 to delete the file after X period of time).
And only after this, you will return the result. As you can see, this is a synchronous solution.
If the file processing need a lot of CPU and takes so long, you will need an asynchronous solution. So instead of processing the files directly when you call the REST API, you will have to create another application that will read that file from S3, process it and save it. All asynchronously. And your REST API would only get the file from S3 and return it.
I have been searching for loading data into Big Query programmatically from Google Cloud Storage. I have done this manually by taking backup of my Google Cloud Storage of one Kind and dumping it into the BigQuery Table and was able to retrive data in android as well. The only problem i am facing is that i want to upload data programmatically into BigQuery Table.
What are the various methods to achieve this?
You can use bigquery Java API to load data from cloud storage to Bigquery.
If you need to process the data beforehand, depending on the scale, you can use Google Dataflow to process the data, and then load data or stream data to Bigquery.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
You can also use Bigquery to directly search files in Cloud storage without loading.
https://cloud.google.com/bigquery/external-data-sources
My Apache Spark application takes various input files and stores the results and logs in other files. The input files are provided along with the application which is supposed to run on the Amazon cloud (EMR seemed preferable to EC2).
Now, I know that I'm supposed to create an uber-jar containing my input files and the application that accesses them. However, how do I retrieve the generated files from the cloud, once the execution finishes?
As an additional info, the files are created and written using relative paths from the code.
Assuming you mean that you want to access the output generated by the Spark application outside the cluster, the usual thing to do is to write it to S3. Then you may of course read the data directly from S3 from outside the EMR cluster.
I have created a app that will have a large set of data in the form of XML files inside documents folder. The data size is so large and its growing data by day so planning to move it to SQLLite DB. Also, i want it to be moved to SQLLite DB for security purposes. I have around 1000 XML files currently, it may grow in future. My primary issue is i want all the data inside XML files to be moved into SQLLite DB using a Backend System(.Net Framework or Java) and can i push this complete Database into the iPhone using a Web Service. So that no XML parsing happens in iPhone. Because i heard XML parsing is resource intensive than reading from SQLLite DB inside iPhone. Whether this is a feasible solution or any better approach is available?
Don't transport the entire set of data each time. Have the iOS client request only the changes since it last synced, and have it update its local database. Processing multiple XML documents should be fine so long as the app can synchronize in the background while the user continues to use it.