I have a Cassandra cluster with two node and replica_factor=2 in same datacenter. Table in ~150 million and continuously increasing that i need to read process and update corresponding row in Cassandra once in a day.
Is there any better approach to do this?
Is there any way to divide all row in parallel chunk and each chunk process by some thread?
Cassandra version: 2.2.1
java version: openjdk 1.7
You should have a look at Spark. Using the Spark Cassandra Connector allows you to read data from Cassandra from multiple Spark nodes that can be deployed additionally on the Cassandra nodes or in a separate cluster. Data is read, processed and written back in parallel by running a Spark job, which can also be scheduled for daily execution.
As your data size is constantly growing, it would probably also make sense to look into Spark Streaming, allowing you to continually process and update your data, just based on the new data coming in. This would prevent reprocessing the same data over and over again, but it of course depends on your use-case if that's an option for you.
Related
I'm new in spark streaming and I have a general question relating to its usage. I'm currently implementing an application which streams data from a Kafka topic.
Is it a common scenario to use the application to run a batch only one time, for example, an end of the day, collecting all the data from the topic, do some aggregation and transformation and so on?
That means after starting the app with spark-submit all this stuff will be performed in one batch and then the application would be shut down. Or is spark stream build for running endless and permanently stream data in continuous batches?
You can use kafka-stream api, and fix a window-time to perform aggregation and transformation over events in your topic only one batch at a time. for move information about windowing check this https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#windowing
I am new to Apache-Spark,
I have a requirement to read millions(~5 million) of records from Oracle database, then do some processing on these records , and write the processed records to a file.
At present ,this is done in Java , and in this process
- the records in DB are categorized into different sub sets, based on some data criteria
- In the Java process, 4 threads are running in parallel
- Each thread reads a sub set of records , processes and writes processed records to a new file
- finally it merges all these files into single file.
Still It takes around half an hour to complete the whole process .
So I would like to know , if Apache Spark could make this process fast- read millions of records from Oracle DB, process these, and write to a file ?
If Spark can make this process faster, what is the best approach to be used to implement this in my process? Also wWill it be effective in a non-clustered environment too?
Appreciate the help.
Yeah you can do that using Spark, it's built for distributed processing ! http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
You should be using a well configured spark cluster to achieve the same. Performance is something you need to fine tune by adding more worker nodes as required.
We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.
When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space).
Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).
Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.
Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.
--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?
Thanks.
Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.
Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.
As #paul said , dont trust file size.
Parquet is a columnar storage data file format, so retrieve data like "*" is really not a good idea. But its good for group by queries.
The driver role is to manage worker executors, then give you thé query result at the end, so all your data will be collected on the driver.
Try limiting your query and specifying some fields rather than *.
I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar
When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.
My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.
To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.
Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.
Is there a solution for this kind of big data problem?
With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.