Currently I'm working on Spark Streaming and data volume will be huge and I have the following scenario.
Every 2 mins,Streamed data will be processed. During some transformations, I will need to validate against the data which may come in next batch (i.e) after 2 mins. In such cases , I need to hold these particular data on in-memory or Disk plus in-memory combo so that these data will be compared in next batch/after 2 mins .
Either Accumulator nor broadcast variables won't be help in my case.In this case what would be the best approach?
Related
Intention: I have a bounded data set (Rest API exposed, 10k records) to be written to BigQuery with some additional steps. As my data set is bounded I've implemented BoundedSource interface to read records in my apache beam pipeline.
Problem: all 10k records are read in one shot (one shot for write to BigQiery as well). But I want to query small part (for example 200 rows), process, save to BigQuery and then query next 200 rows.
I've found that I can use windowing with bounded PCollections, but windows are created on time basis (every 10 sec for example) and I want it to be on record counter basis (every 200 records)
Question: How can I implement the mentioned splitting to batches/windows with 200 records size? Am I missing something?
The question is similar to this but it wasn't answered
Given a PCollection of rows, you can use GroupIntoBatches to batch these up into a PCollection of sets of rows of a given size.
As for reading your input in an incremental way, you can use the split method of BoundedSource to shard your read into several pieces which will then be read independently (possibly on separate workers). For a bounded pipeline, this will still happen in its entirety (all 10k records read) before anything is written, but need not happen on a single worker. You could also insert a Reshuffle to decouple the parallelism between your read and your write.
Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.
I am designing an API in Java with Spring Framework that will read a flat file containing 100K records and compare them with values fetched from database. If the DB values are available in the file values then they will be updated in the database.
The concern in the entire process is performance.
I have a maximum of 7 minutes to perform the entire processing of 100K records.
I am looking to use a caching mechanism to fetch all the data from the database in a cache bean. The cache will be refreshed every 30 mins or 1 hour.
Second, we will read the file and compare the values with the values in the cache and the matched values will be stored in another cache.
Third, we will update the values from the second cache to the database using a threading mechanism.
I need some opinions on this design. Does it look good.
Any advice to improve the design is welcome.
P.S. : Database in use is DB2 hosted on Mainframe systems
Thanks
Nirmalya
I'm using Mongodb as a cache right now. The application will be fed with 3 CSVs over night and the CSVs get bigger because new products will be added all the time. Right now, I'm reached 5 million records and it took about 2 hours to process everything. As the cache is refreshed everyday it'll become impractical to refresh the data.
For example
CSV 1
ID, NAME
1, NAME!
CSV 2
ID, DESCRIPTION
1, DESC
CSV 3
ID, SOMETHING_ELSE
1, SOMETHING_ELSE
The application will read CSV 1 and put it in the database. Then CSV 2 will be read if there're new information it will be added to the same document or create a new record. The same logic applies for CSV 3. So, one document will get different attributes from different CSVs hence the upsert. After everything is done then all the documents will be indexes.
Right now the first 1 million documents is relatively quick, but I can see the performance degrades considerably over time. I'm guessing it's because of the upsert as Mongodb has to find the document and update the attributes otherwise create it. I'm using Java Driver and MongoDB 2.4. Is there anyway I could improve or even do batch upsert in mongodb java driver?
What do you mean by 'after everything is done then all the documents will be indexed'?
If it is because you want to add additional indexes, it is debatable to do it at the end, but it is fine.
If you have absolutely no indexes, then this is likely your issue.
You want to ensure that all inserts/upserts you are doing are using an index. You can run one command and use .explain() to see if an index is getting used appropriately.
You need an index, otherwise you are scanning 1 million documents for each insert/update.
Also, can you also give more details about your application?
are you going to do the import in 3 phases only once, or will you do frequent updates?
do CSV2 and CSV3 modify a large percentage of the documents?
do the modifications of CSV2 and CSV3 add or replace documents?
what is the average size of your documents?
Let's assume you are doing a lot updates on the same documents many times. For example, CSV2 and CSV3 have updates on the same documents. Instead of importing for CSV1, then doing updates for CSV2, then another set of updates for CSV3, you may want to simply keep the documents in the memory of your application, apply all the updates in memory, then push your documents in the database. That assumes that you have enough RAM to do the operation, otherwise you will be using the disk again.
I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.
It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.
Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .
If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.