parallel processing with loop construct in java - java

I am working to create a crawler- a java web app, in which users can define crawl jobs, which extract and store information from specific websites.
As part of this work, there is a 'loop' construct... it has a list portion, which is evaluated initially (and typically represents a list of values)... After that is the loop body, which is executed once for each item in the list (from the list portion mentioned previously).
Note that there can be a loop construct within another loop construct, and so on.
The problem is, sometimes one list can contain millions of rows of data - and the body is to be executed for each row in this list. The body has a start index value, upper bound for the index, and is incremented by one.
What I want to do is, for a single level loop, initially calculate the list value and store it in database. After that, instead of executing the body in one go, split it up into different sections so that different sections of the list are processed in parallel.
However, how do I split up a job for an n-level loop? (Ie one loop within one loop and so on.)
Is there some recommended way of doing such processing... Any tutorial or guide you could point me to, would be very helpful for me.

I suggest packing the processing logic for 1 element of the list into a Runnable or Callable, and then pass them to an Executor for execution. This will run tasks in parallel in different worker-threads. Of course it depends on how many cores your machine has, how "parallel" this will really be.
If each element of the list can be processed completely independent of all the others, than this would be the way to go for me, instead of messing around myself with Threads and dividing the list into sublists etc.

According your description, i got that you are fetching the source code of xyz website and scrap data from that.
You can use XPath and RegularExpression to do this kind of task as its best. use JSOUP for that ,it helps you a lot.
As far as parallelization is concern you can use the .select , getElementbyId, getElementByClass of JSOUP (it's a opensource). than simply put a
for(i=0 ;i< length;i++)
{
i am fetching i;
i am fetching i+1;
int temp=i+1;
if(temp>=length)
{
break;
}
}
hope this helps: http://jsoup.org

This sounds like a great candidate for the Java 7 fork / join framework

Lets say you create 3 thread: T1, T2, T3. and following is the looping construct, for eaxmple
for(int i=0; i<100; i++)
{
for(int j=0; j<100; j++)
{
for(int k=0; k<100; k++)
{
// do some processing.
}
}
}
Modify the increment part as i += no. of threads. In this case it will be i += 3
Thus, the initial values of i, j, k will vary for each thread.
For T1: i = 0;
For T2: i = 1;
For T3: i = 2;
Similary the loop limit can be set.

Related

Running a test method multiple times without using loop

The scenario is like i will fetch four data from open api end points from a json array for the first threenter code heree output. Then i kept the data in four arraylist, account no, com amount, revenue amount, mark up realized amount. Till this it works fine. Now i have to take the first index data from each of the list and use it in a test method. In this whay i would take the secound index, third indexx data from array list and use it in the sme test method. The test method will go the ui and vaidate. I have right now used a loop which executes this test method three times by taking data three times from the arraylist. Is there a different way where i can avoid the loop in test method and the test method will execute three times by taking the each index value from the arraylist.
#Test(dependsOnMethods = "getTransactionRecordDetailsForSpecificAccountNumberByDateRange")
public void verifyTransactionDetailsFromOpenAPIOnSFObject() throws InterruptedException, AWTException {enter code here
for (int i = 0; i < 3; i++) {
launchApp();
homepage = new HomePage();
companyaccountpage = new CompanyAccountPage();
homepage.navigateToCompanyAccountDetailsPage(li_AcctNo.get(i));
companyaccountpage.navigateToTransactionTableDetails();
Assert.assertEquals(companyaccountpage.getValueOfTotalComissionAmountFromTransactionTable(actual_tsDate),
li_comAmt.get(i));
Assert.assertEquals(companyaccountpage.getValueOfTotalMarkUpRealizedAmountTransactionTable(actual_tsDate),
li_markupAmt.get(i));
Assert.assertEquals(companyaccountpage.getValueOfTotalRevenueAmountFromTransactionTable(actual_tsDate),
li_revenueAmt.get(i));`enter code here`
fd.quit();
Perhaps three different test cases would be another way to do things logically too since you're testing for different details. If you're worried about code duplication, then you can make a separate function for the loop logic.

Iterate through each element in java to see if it's displayed

I am new to Java and would like assistance on the best way to perform the following:
I wanted to grab all of the elements that belongs to a class but I want to perform a count on how many of these elements are displayed. I have set up the code to create an int counter, the findelements and the assertion but basically my question is how in java do you create a for loop to simply check if an element within the elements list is displayed, then +1 on the count?
int testNumber = 0;
List<WebElement> testItems = _driver.findElements(By.className("test"));
//Loop each element from the list and if it's displayed then +1 for test Number
Assert.assertEquals(testNumber, 4,
"Mismatch between the number of test items displayed");
Use the method WebElement#isDisplayed (documentation) to get the status of an element.
All you need to do is to setup a counter like int counter = 0. Then create a loop over all elements like for (WebElement e : testItems) { ... }. Then, inside the loop, create some sort of if (e.isDisplayed()) { ... }. And inside this you increase the counter like counter++. All in all:
int counter = 0;
for (WebElement e : testItems) {
if (e.isDisplayed()) {
counter++;
}
}
This variant uses the enhanced for-loop. You can of course use other loop variants too.
Using streams this could be made a compact one-liner like
int count = testItems.stream()
.filter(WebElement::isDisplayed)
.count();
Note that WebElements can, depending on the website, quickly stale if they aren't processed quickly after their generation. This is indicated by them throwing a StaleElementReferenceException. It happens whenever the website rebuilds itself (completely or partially), your element was then retrieved from an old state of the website and is thus invalid.
If you experience this, you may directly collect the display status of an element before finding the other elements. This decreases the time between generation and access of a single WebElement.

Nesting parallelizations in Spark? What's the right approach?

NESTED PARALLELIZATIONS?
Let's say I am trying to do the equivalent of "nested for loops" in Spark. Something like in a regular language, let's say I have a routine in the inside loop that estimates Pi the way the Pi Average Spark example does (see Estimating Pi)
i = 1000; j = 10^6; counter = 0.0;
for ( int i =0; i < iLimit; i++)
for ( int j=0; j < jLimit ; j++)
counter += PiEstimator();
estimateOfAllAverages = counter / i;
Can I nest parallelize calls in Spark? I am trying and have not worked out the kinks yet. Would be happy to post errors and code but I think I am asking a more conceptual question about whether this is the right approach in Spark.
I can already parallelize a single Spark Example / Pi Estimate, now I want to do that 1000 times to see if it converges on Pi. (This relates to a larger problem we are trying to solve, if something closer to MVCE is needed I'd be happy to add )
BOTTOM LINE QUESTION I just need someone to answer directly: Is this the right approach, to use nested parallelize calls? If not please advise something specific, thanks! Here's a pseudo-code approach of what I think will be the right approach:
// use accumulator to keep track of each Pi Estimate result
sparkContext.parallelize(arrayOf1000, slices).map{ Function call
sparkContext.parallelize(arrayOf10^6, slices).map{
// do the 10^6 thing here and update accumulator with each result
}
}
// take average of accumulator to see if all 1000 Pi estimates converge on Pi
BACKGROUND: I had asked this question and got a general answer but it did not lead to a solution, after some waffling I decided to post a new question with a different characterization. I also tried to ask this on the Spark User maillist but no dice there either. Thanks in advance for any help.
This is not even possible as SparkContext is not serializable. If you want a nested for loop, then your best option is to use cartesian
val nestedForRDD = rdd1.cartesian(rdd2)
nestedForRDD.map((rdd1TypeVal, rdd2TypeVal) => {
//Do your inner-nested evaluation code here
})
Keep in mind, just as a double for loop, this comes at a size cost.
In the Pi example, in the nested for loop you can get the same answer by doing a single loop through the process i * j times and summing over all of them and then dividing by j at the end. If you have steps that you want to apply in the outer loop, do them within the loop, but create different groups by assigning specific keys to each inner-loop group. Without knowing what kinds of things you want to do in the outer loop its hard to give an example here.
For the simple case of just averaging to improve convergence, its relatively easy. Instead of doing the nested loop, just make an rdd with i * j elements and then apply the function to each element.
this might look like (with pySpark ):
( f is whatever function you want to apply, remember that it will pass each element in the RDD so define your f with an input even if you don't use it in your function)
x = RandomRDDs.uniformRDD(sc, i*j)
function_values = x.map(f)
from operator import add
sum_of_values = function_values.reduce(add)
averaged_value = sum_of_values/j (if you are only averaging over the outer loop)
If you want perform actions in the outer loop, I'd assign an index (zipWIthIndex) then create a key using the index modulo j. Then each different key would be a single virtual inner loop cycle and you can use operators like aggregateByKey, foldByKey, or reduceByKey to perform actions only on those records. This will probably take a bit of a performance hit if the different keys are distributed to different partitions.
An alternative would be to repartition the rdd onto j partitions and then use a foreachPartition function to apply a function to each partition.
A third option would be to run the inner loop j times in parallel, concatenate the results into one distributed file, and then do the outer loop operations after reading this into Spark.
No. You can't.
SparkContext is only accessible from the spark Driver node. The inner parallelization() calls would try to execute SparkContext from the worker nodes, which do not have access to SparkContext.

where to initiate object in java

This question relates to my answer of another of my question.
The original question is here
Can anyone explain why the bad code fails in the way explained in the comments (by the wy it is in pseudo code)
// bad code
ResultSet rs = getDataFromDBMS("Select * from [tableName];");
//temp uses a collection member within it to hold a list of column names to data value pairs (hashmap<String,String>)
Object temp = new objectToHoldInfoFromResultSet();
// loop over the result set
while (rs.next)// for each row in the result set
{
for (int i = 1; i <= rs.getNumberColums; i++) {
temp.add(infoAboutColumn);
}
temp.printInfo();// always prints correct (ie current) info
//the print just takes the hashmap member and asks for a
//toString() on it
anotherObject(makeUseOf(temp));// always uses info from first
//iteration. Essentially grabs the hashMap member of temp, and puts
//this into another collection of type
//HashMap< HashMap<String,String>, temp> see the linked question
//for more detail.
}
// Seemingly each loop into the while the temp.doSomethingToData(); uses
// the temp object created in the first iteration
// good code
ResultSet rs = getDataFromDBMS("Select * from [tableName];");
// loop over the result set
while (rs.next)// for each row in the result set
{
Object temp = new objectToHoldInfoFromResultSet();// moving
// declaration
// of temp into
// the while
// loop solves
// the problem.
for (int i = 1; i <= rs.getNumberColums; i++) {
temp.add(infoAboutColumn);
}
temp.printInfo();// always prints correct (ie current) info
anotherObject(makeUseOf(temp));// also uses the correct (ie current)
// info.
}
We can't reliably answer this without knowing what temp.printInfo() and makeUseOf() are doing. It is easy to implement them to behave the way you describe though.
When you instantiate temp outside the loop, you will be using the same object throughout all iterations of the loop. Thus it is possible for it to gather data from each iteration (e.g. into a collection). And then it is possible for methods to get data accumulated in the current iteration, as well as from any previous iteration, which may cause problems if it was not intended.
So let's assume temp contains a collection and in each iteration a column from the resultset is added to it. Now if temp.printInfo() is implemented to print info about the last element in this collection, while makeUseOf() is implemented to do something with the first element in the collection, you get the behaviour you observed.
OTOH when you instantiate temp inside the loop, you will get a new, distinct object in each iteration, which won't "remember" any data from earlier iterations. Thus even with the implementations of temp.printInfo() and makeUseOf() outlined above, you will get correct results.
I am not sure why the one is good and the other bad. My guess (without knowing what objectToHoldInfoFromResultSet and the other methods behaviour are) is the following:
In the first instance, the "objectToHoldInfoFromResultSet" (should be capitalised) is created once and everytime
temp.add(infoAboutColumn);
is called, new record data is added to the object. I would guess that this info should be cleared for each record... otherwise you'll get a lot of duplication. The duplication is taken care of by re-initialising the holder object. I.e. (1 2 3 4 5 6) instead of the (1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6).
Without knowing more about the various propriety objects, there is not much more I can tell you.

Array access optimization

I have a 10x10 array in Java, some of the items in array which are not used, and I need to traverse through all elements as part of a method. What Would be better to do :
Go through all elements with 2 for loops and check for the nulltype to avoid errors, e.g.
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
Or would it be better to keep a list of all the used addresses... Say an arraylist of points?
Something different I haven't mentioned.
I look forward to any answers :)
Any solution you try needs to be tested in controlled conditions resembling as much as possible the production conditions. Because of the nature of Java, you need to exercise your code a bit to get reliable performance stats, but I'm sure you know that already.
This said, there are several things you may try, which I've used to optimize my Java code with success (but not on Android JVM)
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
should in any case be reworked into
for(int x=0;x<10;x++){
for(int y=0;y<10;y++){
if(array[x][y]!=null)
//perform task here
}
}
Often you will get performance improvement from caching the row reference. Let as assume the array is of the type Foo[][]:
for(int x=0;x<10;x++){
final Foo[] row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
Using final with variables was supposed to help the JVM optimize the code, but I think that modern JIT Java compilers can in many cases figure out on their own whether the variable is changed in the code or not. On the other hand, sometimes this may be more efficient, although takes us definitely into the realm of microoptimizations:
Foo[] row;
for(int x=0;x<10;x++){
row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
If you don't need to know the element's indices in order to perform the task on it, you can write this as
for(final Foo[] row: array){
for(final Foo elem: row
if(elem!=null)
//perform task here
}
}
Another thing you may try is to flatten the array and store the elements in Foo[] array, ensuring maximum locality of reference. You have no inner loop to worry about, but you need to do some index arithmetic when referencing particular array elements (as opposed to looping over the whole array). Depending on how often you do it, it may or not be beneficial.
Since most of the elements will be not-null, keeping them as a sparse array is not beneficial for you, as you lose locality of reference.
Another problem is the null test. The null test itself doesn't cost much, but the conditional statement following it does, as you get a branch in the code and lose time on wrong branch predictions. What you can do is to use a "null object", on which the task will be possible to perform but will amount to a non-op or something equally benign. Depending on the task you want to perform, it may or may not work for you.
Hope this helps.
You're better off using a List than an array, especially since you may not use the whole set of data. This has several advantages.
You're not checking for nulls and may not accidentally try to use a null object.
More memory efficient in that you're not allocating memory which may not be used.
For a hundred elements, it's probably not worth using any of the classic sparse array
implementations. However, you don't say how sparse your array is, so profile it and see how much time you spend skipping null items compared to whatever processing you're doing.
( As Tom Hawtin - tackline mentions ) you should, when using an array of arrays, try to loop over members of each array rather than than looping over the same index of different arrays. Not all algorithms allow you to do that though.
for ( int x = 0; x < 10; ++x ) {
for ( int y = 0; y < 10; ++y ) {
if ( array[x][y] != null )
//perform task here
}
}
or
for ( Foo[] row : array ) {
for ( Foo item : row ) {
if ( item != null )
//perform task here
}
}
You may also find it better to use a null object rather than testing for null, depending what the complexity of the operation you're performing is. Don't use the polymorphic version of the pattern - a polymorphic dispatch will cost at least as much as a test and branch - but if you were summing properties having an object with a zero is probably faster on many CPUs.
double sum = 0;
for ( Foo[] row : array ) {
for ( Foo item : row ) {
sum += item.value();
}
}
As to what applies to android, I'm not sure; again you need to test and profile for any optimisation.
Holding an ArrayList of points would be "over engineering" the problem. You have a multi-dimensional array; the best way to iterate over it is with two nested for loops. Unless you can change the representation of the data, that's roughly as efficient as it gets.
Just make sure you go in row order, not column order.
Depends on how sparse/dense your matrix is.
If it is sparse, you better store a list of points, if it is dense, go with the 2D array. If in between, you can have a hybrid solution storing a list of sub-matrices.
This implementation detail should be hidden within a class anyway, so your code can also anytime convert between any of these representations.
I would discourage you from settling on any of these solutions without profiling with your real application.
I agree an array with a null test is the best approach unless you expect sparsely populated arrays.
Reasons for this:
1- More memory efficient for dense arrays (a list needs to store the index)
2- More computationally efficient for dense arrays (You need only compare the value you just retrieved to NULL, instead of having to also get the index from memory).
Also, a small suggestion, but in Java especially you are often better off faking a multi dimensional array with a 1D array where possible (square/rectangluar arrays in 2D). Bounds checking only happens once per iteration, instead of twice. Not sure if this still applies in the android VMs, but it has traditionally been an issue. Regardless, you can ignore it if the loop is not a bottleneck.

Categories