Talend Parallelization and Java Scope - java

I am creating a job to create a complex multi level document for mongoDB from relational data.
I read 'product' records in from Oracle.
I have a tJavaRow and I use the mongoDB API to create a product document (BasicDBObject) using the product details coming in. I store this document in the global map (call this 'product_doc')..as I need to embed a sub-document in this later in the sub job.
I use a tFlowToIterate to store the product_id in the globalMap.
I then have another Oracle input which uses the product_id from the global map as a parameter in the sql, so getting the many part of the relationship to products (call this 'product_orders').
I build a java List of 'product_order' documents and write the List to the globalMap, let's call this 'product_orders'.
I then insert the 'product_documents' List as a sub document to the 'product' document in a tJava component. And I write 'product' to mongoDB and then I move on to the next product row from Oracle.
It is more complex than this, creating a 5 level hierarchy...but this s the basic idea - but it takes 3 hours to run.
So,I want to set the job to run parallelized, so each product row from Oracle gets despatched onto a new thread...Round Robin style.
However, I have a heavy dependency on the globalMap to store objects for later use in the flow....and I know the threads will trample all over each other. I assume each thread maintains the same variable scope across the sub job...
I can identify the thread_id using a global variable in the globalMap "tCollector_1_THREAD_ID" I think.
So I had considered doing this when I add documents/objects into the globalMap.
globalMap.put("product_doc_" + globalMap.get("
tCollector_1_THREAD_ID"))
So that everything I put in the globalMap is thread specific and tagged...but I don't know how tCollector_1_THREAD_ID gets populated, if it is in the globalMap then surely each thread can trample over this value also?
It didn't work...I was getting a load of Null Errors.
So I guess my question is about variable scope and use of globalMap when using tJavaRow components in a parallelized data flow, when you need to maintain references in each thread.
---- UPDATE ------
For clarity if you look at this page it states you can get the thread ID from the variable tCollector_1_Thread_ID. BUt it gets that variable from the globalMap.
Surely the globalMap is a global variable so how can the multiple threads not be all changing this global variable all the time and interfering with each other?
https://help.talend.com//pages/viewpage.action?pageId=265114338

Here's a few approaches that I am using successfully for parallel executions:
If possible create another job that you run parallel, that helps understanding the tasks:
In this example I use the "Use or Register a shared Db connection" feature, so I re-use my connections. tJavaFlex just contains a simple try{ } catch block, so I handle / hide the errors.
GpOutput uses a connection that was created outside of the threads.
I more prefer this approach where I create a separate job and use the context parameters to pass information to the job.
Here you can see how the globalMap is used.
I found that the tPartition/Departitioner in talend is very hard to use. I prefer to use more controlled ways to handle parallel executions. Such as a Loop that splits the workload on 20 parallel threads.
" WHERE mod(num,20) = " + context.i "

Related

Consequences of using use StepExecutionContext/JobExecutionContext to share Hashmap with large values

I have a requirement in which I am retrieving values in one Reader of the Step using SQL statements and doing the same request in next reader.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step.
For this I have gone through the following link on SO :
How can we share data between the different steps of a Job in Spring Batch?
In many of the comments it is mentioned that 'data must be short'.
Also it is mentioned in one response that: these contexts are good to share strings or simple values, but not for sharing collections or huge amounts of data.
By passing that HashMap, I believe it automatically infers that the reference of the HashMap will be passed.
It would be good to know the possible consequences of passing it before hand and any better alternative approach.
Passing data between step is indeed done via the execution context. However, you should be careful about the size of data you put in the execution context as it is persisted between steps.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step
You can read the data from the database only once and put it in a cache. The second reader can then get the data from the cache. This would be faster that reading the data from the database a second time.
Hope this helps.

Are batchlets the correct way of implementing ETL steps in JavaEE Batch?

I am studying Javaee Batch API (jsr-352) in order to test the feasibility of changing out current ETL tool for our own solution using this technology.
My goal is to build a job in which I:
get some (dummy) data from a datasource in step1,
some other data from other data-source in step2 and
merge them in step3.
I would like to process each item and not write to a file, but send it to the next step. And also store the information for further use. I could do that using batchlets and jobContext.setTransientUserData().
I think I am not getting the concepts right: as far as I understood, JSR-352 is meant for this kind of ETL tasks, but it has 2 types of steps: chunk and batchlets. Chunks are "3-phase-steps", in which one reads, processes and writes the data. Batchlets are tasks that are not performed on each item on the data, but once (as calculating totals, sending email and others).
My problem is that my solution is not correct if I consider the definition of batchlets.
How could one implement this kinf od job using Javaee Batch API?
I think you better to use chunk rather than batchlet to implement ETLs. typical chunk processing with a datasource is something like following:
ItemReader#open(): open a cursor (create Connection, Statement and ResultSet) and save them as instance variables of ItemReader.
ItemReader#readItem(): create and return a object that contains data of a row using ResultSet
ItemReader#close(): close JDBC resources
ItemProcessor#processItem(): do calculation and create and return a object which contains result
ItemWriter#writeItems(): save calculated data to database. open Connection, Statement and invoke executeUpdate() and close them.
As to your situation, I think you have to choose one data which considerble as primary one, and open a cursor for it in ItemReader#open(). then get another one in ItemProcessor#processItem() for each item.
Also I recommend you to read useful examples of chunk processing:
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-1/
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-2/
My blog entries about JBatch and chunk processing:
http://www.nailedtothex.org/roller/kyle/category/JBatch

Selecting multiple data from the database at the same time

I am creating a java application that needs to collect loads of data, process the data into objects and return the objects as a list.
All of the data collected is from different tables in my database (some are joined but all of them are different SQL calls)
I was thinking of getting this data through different threads but since multiple threads cannot use the same connection to acess data in the database i would have to create a new connection for each of these threads.
My Question is: what is the best way to acess and process multiple data from database at the same time?
If you have enough memory I would use a full second level change that syncs to the database. Using a cache makes it extremely faster. If you won't have enough memory on the server/client you can cache your query on the sqlserver with a table that has all the values from your query and this table gets updated every second.
Otherwise you can use a Threadpool with Threads which inserts the queryresults into a shared object for the result.
I am using Spring framework.Suppose there is ModelBean class present where all constants are declared.In ModelBean class name is field which are declared.
public class CurriculumReportDaoImpl extends JdbcDaoSupport{
public List<String> fetchList(){
String query="";
List<ModelBean > tempList=new ArrayList<ModelBean >();
List<Map<String,Object>> record = getJdbcTemplate().queryForList(query);
for(Map<String,Object> result=record){
/*create new instance of ModelBean*/
model=new ModelBean();
/*"name" is the column name which are fetch from db*/
model.setName(result.get("name").toString);
/*now set model in tempList*/
tempList.add(ModelBean );
}
return tempList;
}
}
If you have many connection Then you can create many List here and set into ModelBean class.
I think this will help you.
Ideally the database should be designed in such a way that you should be able to get all the related data in a single query(maybe with JOINS). Not sure if that is possible to achieve in your case.
There are three viable options, one you have already tried by creating multiple threads and fetching the data. I will just add an input to that approach and you may try in case that optimizes. Create Data Fetcher threads to fetch the data from different tables and create one Data Processor thread to process the data as it is fetched by Data Fetcher threads.
or
Second approach can be to to create a stored procedure, which will run directly on the database and can do some data processing for you. This will avoid the need of creating too many threads and doing a lot of processing in your java code.
or
Mix both the approaches to achieve the best results.
Good luck!

Java - multithreaded access to a local value store which is periodically cleared

I'm hoping for some advice or suggestions on how best to handle multi threaded access to a value store.
My local value storage is designed to hold onto objects which are currently in use. If the object is not in use then it is removed from the store.
A value is pumped into my store via thread1, its entry into the store is announced to listeners, and the value is stored. Values coming in on thread1 will either be totally new values or updates for existing values.
A timer is used to periodically remove any value from the store which is not currently in use and so all that remains of this value is its ID held locally by an intermediary.
Now, an active element on thread2 may wake up and try to access a set of values by passing a set of value IDs which it knows about. Some values will be stored already (great) and some may not (sadface). Those values which are not already stored will be retrieved from an external source.
My main issue is that items which have not already been stored and are currently being queried for may arrive in on thread1 before the query is complete.
I'd like to try and avoid locking access to the store whilst a query is being made as it may take some time.
It seems that you are looking for some sort of cache. Did you try to investigate existing cache implementation, maybe some of them will do?
For example Guava cache implementations seems to cover a lot of your requirements - http://code.google.com/p/guava-libraries/wiki/CachesExplained.

Counting Unique Users using Mapreduce for Java Appengine

I'm trying to count the number of unique users per day on my java appengine app. I have decided to use the mapreduce framework (mapreduce.appspot.com) for java appengine to do this calculation offline. I've managed to create a map reduce job that goes through all of my entities which represent a single users session event. I can use a simple counter as well. I have several questions though:
1) How do I only increment a counter once for each user id? I am currently mapping over entities which contain a user id property but many of these entities may contain the same user id so how do I only count it once?
2) Once I have these results of the job stored in these counters - how can I persist them to the datastore? I see the results of the counters on the mapreduce's status page but I want these results automatically persisted to the datastore.
Ideas?
I haven't actually used the MapReduce functionality yet, but my theoretical understanding is that you can write things to the datastore from within your mapper. You could create an Entity type called something like UniqueCount, and insert one entity every time your mapper sees an ID that it hasn't seen before. then you can count how many unique ID's you have. In fact, you can just update a counter every time you find a new unique entity. You may want to google "sharded counter" for hints on creating a counter in the datastore that can handle high throughput.
Eventually, when they finish the Reduce functionality, I imagine this whole task will become pretty trivial.

Categories