I have a heavily I/O bound (Java) beam pipeline that on Google Cloud Dataflow I use the dataflow beam option "setNumberOfWorkerHarnessThreads(16);" to get 16 threads running on every virtual CPU. I'm trying to port that same pipeline to run on Spark, and I can't find an equivalent option on Spark. I've tried doing my own threading but that appears to be causing problems on the SparkRunner since the ProcessElement part of the DoFn returns but the output to the ProcessContext gets called later when the thread completes. (I get weird ConcurrentModificationExceptions with stack traces that are part of beam rather than in user code)
Is there an equivalent to that setting on Spark?
I'm not aware of an equivalent setting on Spark, but if you want to do your own threading you'll have to ensure that calling the output is only ever done in the same thread that invokes ProcessElement or FinishBundle. You can do this by starting a threadpool that reads from a queue and writes to a queue, and in your ProcessElement you can push to the one queue and drain the other to the context's output, and also drain in FinishBundle.
Related
I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!
There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.
#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped
I'm getting a StackOverflowError on my Beam workers due to running out the thread stack, and because it's deep within the running of a SqlTransform it's not straightforward to reduce the number of calls being made.
Is it possible to change the JVM thread stack size for my workers, either through Google Cloud Dataflow or Beam's own pipeline options?
I don't think there's an easy way to do this.
If this is an issue of stack trace being purged by Cloud Logging, may be it might be possible to catch the exception yourself and inspect that instead of just logging it.
If this is an issue of default stack trace depth set by the JVM not being enough, I don't think there's a way to update this for Dataflow today unfortunately.
I have a fairly simple streaming Dataflow pipeline reading from pubsub and writing to BigQuery using BATCH_LOADS (with streaming engine enabled). We do have one version of this pipeline working, but it seems very fragile, and simple additions to the code seem to tip it over and the worker process starts to eat up memory.
Sometimes the Java heap fills up, gets java.lang.OutOfMemoryError, dumps the heap, and jhat shows the heap is full of Windmill.Message objects.
More often, the machine gets really slow (kernel starts swapping), then the kernel OOM killer kills Java.
Today I have further evidence that might help debug this issue: a live worker (compute.googleapis.com/resource_id: "1183238143363133621") that started swapping but managed to come out of that state without crashing. The worker logs show that the Java heap is using 1GB (total memory), but when I ssh into the worker, "top" shows the Java process is using 3.2GB.
What could be causing Java to use so much memory outside of its heap?
I am using Beam 2.15 PubsubIO, and a clone of Beam's BigQueryIO with some modifications. I have tried increasing to a larger machine size, but it only delays the failure. It still eventually fills up its memory when the pubsub subscription has a lot of backlog.
EDIT:
Some more details: The memory issues seem to happen earlier in the pipeline than BigQueryIO. There are two steps between PubsubIO.Read and BigQueryIO.Write: Parse and Enhance. Enhance uses a side input, so I suspect fusion is not being applied to merge those two steps. Triggers are very slow to fire (why?), so Enhance is slow to start due to the side input's Combine.Globally being delayed by about 3 mintues, and even after it is ready, WriteGroupedRecords is sometimes called 10 minutes after I know the data was ready. When it is called, it is often with way more than 3 minutes-worth of data. Often, especially when the pipeline is just starting, the Parse step will pull close to 1000000 records from pubsub. Once Enhance starts working, it will quickly process the 1000000 rows and turn them into TableRows. Then it pulls more and more data from pubsub, continuing for 10 minutes without WriteGroupedRecords being called. It seems like the runner is favoring the earlier pipeline steps (maybe because of the sheer number of elements in the backlog) instead of firing window triggers that activate the later steps (and side inputs) as soon as possible.
I'm new in spark streaming and I have a general question relating to its usage. I'm currently implementing an application which streams data from a Kafka topic.
Is it a common scenario to use the application to run a batch only one time, for example, an end of the day, collecting all the data from the topic, do some aggregation and transformation and so on?
That means after starting the app with spark-submit all this stuff will be performed in one batch and then the application would be shut down. Or is spark stream build for running endless and permanently stream data in continuous batches?
You can use kafka-stream api, and fix a window-time to perform aggregation and transformation over events in your topic only one batch at a time. for move information about windowing check this https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#windowing
I have a use case where I need to monitor the number of datastore operations and also the time consumed by a particular block of Java code that is running in the task queue.
As of now, I am using Appstats and it is not displaying the operations performed by the task (Java Code) that is running in the Task Queue. Also, I need to know the execution time of particular code blocks in the same code in the task queue by using some kind of monitors.
Please suggest and advise if I could use some other tools for the above requirement.