I'm new in spark streaming and I have a general question relating to its usage. I'm currently implementing an application which streams data from a Kafka topic.
Is it a common scenario to use the application to run a batch only one time, for example, an end of the day, collecting all the data from the topic, do some aggregation and transformation and so on?
That means after starting the app with spark-submit all this stuff will be performed in one batch and then the application would be shut down. Or is spark stream build for running endless and permanently stream data in continuous batches?
You can use kafka-stream api, and fix a window-time to perform aggregation and transformation over events in your topic only one batch at a time. for move information about windowing check this https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#windowing
Related
I am practicing file reading through the flink batch processing mechanism on a Windows 10 machine.
I downloaded flink-1.7.2-bin-hadoop24-scala_2.12.tgz from flink's official site and executed start-cluster.bat .
I uploaded the jar though Flink's UI and was able to execute the job but the job finished in a matter of seconds.
I want to keep the job running continuously so that I can test my use case .
Can you guide my possible ways to achieve this?
In Flink, batch jobs run until all of their input has been processed, at which point they have finished and are terminated. If you want continuous processing, then you should either
use some deployment automation (outside of Flink) to arrange for new batch jobs to be created as needed, or
implement a streaming job
In your case it sounds like you might be looking for the FileProcessingMode.PROCESS_CONTINUOUSLY option on StreamExecutionEnvironment.readfile -- see the docs for more info.
I have a heavily I/O bound (Java) beam pipeline that on Google Cloud Dataflow I use the dataflow beam option "setNumberOfWorkerHarnessThreads(16);" to get 16 threads running on every virtual CPU. I'm trying to port that same pipeline to run on Spark, and I can't find an equivalent option on Spark. I've tried doing my own threading but that appears to be causing problems on the SparkRunner since the ProcessElement part of the DoFn returns but the output to the ProcessContext gets called later when the thread completes. (I get weird ConcurrentModificationExceptions with stack traces that are part of beam rather than in user code)
Is there an equivalent to that setting on Spark?
I'm not aware of an equivalent setting on Spark, but if you want to do your own threading you'll have to ensure that calling the output is only ever done in the same thread that invokes ProcessElement or FinishBundle. You can do this by starting a threadpool that reads from a queue and writes to a queue, and in your ProcessElement you can push to the one queue and drain the other to the context's output, and also drain in FinishBundle.
I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!
There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.
#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped
I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.
I have a Cassandra cluster with two node and replica_factor=2 in same datacenter. Table in ~150 million and continuously increasing that i need to read process and update corresponding row in Cassandra once in a day.
Is there any better approach to do this?
Is there any way to divide all row in parallel chunk and each chunk process by some thread?
Cassandra version: 2.2.1
java version: openjdk 1.7
You should have a look at Spark. Using the Spark Cassandra Connector allows you to read data from Cassandra from multiple Spark nodes that can be deployed additionally on the Cassandra nodes or in a separate cluster. Data is read, processed and written back in parallel by running a Spark job, which can also be scheduled for daily execution.
As your data size is constantly growing, it would probably also make sense to look into Spark Streaming, allowing you to continually process and update your data, just based on the new data coming in. This would prevent reprocessing the same data over and over again, but it of course depends on your use-case if that's an option for you.