Google Bigquery load data with local file size limit

Google Bigquery load data with local file size limit - java

Is there any limit on Google Bigquery load data with a local file with API?
As Google Bigquery document mention regarding Web UI, local file size is than <=10 MB and 16,000 Rows. Is the same limit will apply to API?

There is no BigQuery API to load local files. Local file load via bq command or Web UI - and I believe what happened when you do this - it is just upload file to GCS on your behalf and after this just doing normal API load job from GCS - you can see it clearly in UI. But because Google want to have reasonable user experience from WebUI/bq command - additional much more strict limits are here for upload "local" files.
My recommendation to go GCS path to load big files
(https://cloud.google.com/bigquery/docs/loading-data-cloud-storage)
Important thing - it is free (compare with streaming where you will pay for streamed data)
Limits are following (from https://cloud.google.com/bigquery/quotas)
Load jobs per table per day — 1,000 (including failures)
Maximum columns per table — 10,000
Maximum size per load job — 15 TB across all input files for CSV, JSON, and Avro
Maximum number of files per load job — 10 Million total files including all files matching all wildcard URIs
For CSV and JSON - 4 GB compressed file, 5TB uncompressed

There are no special limits for local file uploads, 10MB and 16000 rows is only for UI. But I don't recommend uploading huge local files.

Related

App Engine Deployment Failure - Max File Size?

400 Bad Request
Max file size is 32000000 bytes. File "WEB-INF/lib/gwt-user.jar" is 32026261 bytes.
I've been deploying this app for years without issues and this file (gwt-user.jar) has been part of this deployment (it has not been updated for 2 years). Anybody have any ideas as to what could have changed?

There were no recent changes related to the file size in App Engine. According to the official documentation, the limit of each file to be uploaded is 32 megabytes.
Deployments
An application is limited to 10,000 uploaded files per version. Each
file is limited to a maximum size of 32 megabytes. Additionally, if
the total size of all files for all versions exceeds the initial free
1 gigabyte, then there will be a $ 0.026 per GB per month charge.
I would suggest to :
Make sure WAR file contains only the essential libraries required for the application to start.
Use BlobStore for deployment of your App Engine app containing other dependencies (split up the necessary libraries) link.

Java: GC Overhead when handling multiple files and best way to store them

I have two questions here. We store a huge amount of xml files in a postgres database, as bytea. For small users, there is no problem and we can handle all the files
A user can download all his xml's in a zip file. We then retrieve all files from database (around 5000 xml files with 15kb each), zip all the files and return to frontend.
The problem is that for that amount of files we are having GC Overhead, and the system goes down sometimes.
Is there a better way to handle those files?
If it is ok, how can we avoid the GC Overhead when retrieving all files?
Thanks in advance!

How do I effectively process a largs gzipped file in dataflow?

We have som batch jobs that process gzipped files that are ~10GB zipped and ~30GB unzipped.
Trying to process this, in java, takes an unreasonable amount of time and we are looking for how to do it more effective. If we use TextIO or the native java sdk for gcs to download the file it takes more than 8 hours to process, and the reason is ut can scale out for some reason. Most likely it won't split the file since it is gzipped.
If I unzipped the file and process the unzipped file the job take roughly 10 minute, so in the order of 100 times as fast.
I can totally understand that it might take some extra time to process a gzipped file, but 100 times as long time is too much.

You're correct that gzipped files are not splittable, so Dataflow has no way to parallelize reading each gzipped input file. Storing uncompressed in GCS is the best route if it's possible for you.
Regarding the 100x performance difference: how many worker VMs did your pipeline scale up to in the uncompressed vs compressed versions of your pipeline? If you have a job id we can look it up internally to investigate further.

Apache Spark on HDFS: read 10k-100k of small files at once

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:
// return a list of paths to small files
List<Sting> paths = getAllPaths();
// read up to 100000 small files at once into memory
sparkSession
.read()
.parquet(paths)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
Problem
The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.
Questions
Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?
Will HAR or sequence files slow down persisting of that small files? Why?
P.S.
Batch read is the only operation required for that small files, I don't need to read them by id or anything else.

From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?
wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine
groups of smaller files into one partition ... Each record in the RDD
... has the entire contents of the file
Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html
RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file
system URI.
Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala
A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for
reading whole text files. Each file is read as key-value pair, where the key is the file path and
the value is the entire content of file.
For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.
Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required
That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing
Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

Struts Upload Slow Performance

I have a Struts 1 web application that needs to upload fairly large files (>50 MBytes) from the client to the server. I'm currently using its built-in org.apache.struts.upload.DiskMultipartRequestHandler to handle the HTTP POST multipart/form-data requests. It's working properly, but it's also very very slow, uploading at about 100 KBytes per second.
Downloading the same large files from server to client occurs greater than 10 times faster. I don't think this is just the difference between the upload and download speeds of my ISP because using a simple FTP client to transfer the file to the same server takes less than 1/3 the time.
I've looked at replacing the built-in DiskMultipartRequestHandler with the newer org.apache.commons.fileupload package, but I'm not sure how to modify this to create the MultipartRequestHandler that Struts 1 requires.
Barend commented below that there is a 'bufferSize' parameter that can be set in web.xml. I increased the size of the buffer to 100 KBytes, but it didn't improve the performance. Looking at the implementation of DiskMultipartRequestHandler, I suspect that its performance could be limited because it reads the stream one byte at a time looking for the multipart boundary characters.
Is anyone else using Struts to upload large files?
Has anyone customized the default DiskMultipartRequestHandler supplied with Struts 1?
Do I just need to be more patient while uploading the large files? :-)

The page StrutsFileUpload on the Apache wiki contains a bunch of configuration settings you can use. The one that stands out for me is the default buffer size of 4096 bytes. If you haven't already, try setting this to something much larger (but not excessively large as a buffer is allocated for each upload). A value of 2MB seems reasonable. I suspect this will improve the upload rate a great deal.

Use Apache Commons,
this gives more flexibility to upload file . We can configure upload file size (Max file size) and temporary file location for swapping the file (this improves performance).
please visit this link http://commons.apache.org/fileupload/using.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Google Bigquery load data with local file size limit - java

Is there any limit on Google Bigquery load data with a local file with API? As Google Bigquery document mention regarding Web UI, local file size is than <=10 MB and 16,000 Rows. Is the same limit will apply to API?

There are no special limits for local file uploads, 10MB and 16000 rows is only for UI. But I don't recommend uploading huge local files.

Related

App Engine Deployment Failure - Max File Size?

Java: GC Overhead when handling multiple files and best way to store them

How do I effectively process a largs gzipped file in dataflow?

Apache Spark on HDFS: read 10k-100k of small files at once

Struts Upload Slow Performance

Categories

Resources