Threading a recursive function - java

I have this recursive function that finds hrefs on a URL and adds them all to a global list. This is done synchronously and takes a long time. I have tried to do this with threading but have failed to get all threads to write to the one list. Could someone please show me how to do this with threading?
private static void buildList (String BaseURL, String base){
try{
Document doc = Jsoup.connect(BaseURL).get();
org.jsoup.select.Elements links = doc.select("a");
for(Element e: links){
//only if this website has no longer been visited
if(!urls.contains(e.attr("abs:href"))){
//eliminates pictures and pdfs
if(!e.attr("abs:href").contains(".jpg")){
if(!e.attr("abs:href").contains("#")){
if(!e.attr("abs:href").contains(".pdf")){
//makes sure it doesn't leave the website
if(e.attr("abs:href").contains(base)){
urls.add(e.attr("abs:href"));
System.out.println(e.attr("abs:href"));
//recursive call
buildList(e.attr("abs:href"),base);
}
}
}
}
}
}
} catch(IOException ex) {
}
//to print out all urls.
/*
* for(int i=0;i<urls.size();i++){
* System.out.println(urls.get(i));
* }
*/
}

This is a great use case for ForkJoin. It'll provide excellent concurrency with very simple code.
For the set of urls parsed, use a Collections.synchronizedSet(new HashSet<String>());.
You can also create a larger ForkJoinPool than the amount of cores you have, since there's network involved (the common usage expects that each thread will be performing work at ~100%).

Use any of collection from concurrent package to store the values you get from different threads. ArrayBloac
You can use fork and join once you break you your problem into divide and conquer algo.

Related

Guaranteeing order of file content when fetched through multi threading

Suppose there are 100 files numbered from 1-100 and you need to read these files in parallel using multi-threading. Is there any way to print the content of these file in order i.e 1-100 ?
Yes, provided you can hold the contents of all of them in memory.
The basic idea is to keep on storing the Future to when you would complete reading/processing the files in order and then get the values from the future in the order they were created.
List<String> filePathsInOrder = new ArrayList<>();
List<Future<String>> fileOutputsInOrder = new ArrayList<>();
for (String filePath : filePathsInOrder) {
fileOutputsInOrder.add(CompletableFuture.supplyAsync(() -> {
try {
return Files.readString(Paths.get(filePath));
}
catch (IOException e) {
throw new RuntimeException(e);
}
}));
}
for (Future<String> fileOutput : fileOutputsInOrder){
System.out.println(fileOutput.get());
}
You would of course need to take of subtleties like exception handling, in case of your reads fail, etc. This done above, as that is beyond the scope of this question.
Yes, of course. You can create a String array of 100 elements and fill the element of proper index, so, if you read file 55, then you set the 54th String (remember, indexing starts from 0) to it. If you wait for all threads to be finished, then you can just loop this array and print its contents. You can also decide not to wait. In that case you can have a numeric n value (initialized to -1) which would denote which was the last file successfully printed and upon each thread end you could print out the files you can at that point.

Retaining the stack position of a recursive function between calls

This question is general, but I feel it is best explained with a specific example. Let's say I have a directory with many nested sub directories and in some of those sub directories there are text files ending with ".txt". A sample structure could be:
dir1
dir2
file1.txt
dir3
file2.txt
file3.txt
I'd be interested if there were a way in Java to build a method that could be called to return the successive text files:
TextCrawler crawler = new TextCrawler(new File("dir1"));
File textFile;
textFile = crawler.nextFile(); // value is file1.txt
textFile = crawler.nextFile(); // value is file2.txt
textFile = crawler.nextFile(); // value is file3.txt
Here is the challenge: No internal list of all the text files can be saved in the crawler object. That is trivial. In that case you'd simply build into the initialization a method that recursively builds the list of files.
Is there a general way of pausing a recursive method so that when it is called again it returns to the specific point in the stack where it left? Or will we have to write something that is specific to each situation and solutions necessarily have to vary for file crawlers, org chart searches, recursive prime finders, etc.?
If you want a solution that works on any recursive function, you can accept a Consumer object. It may look something like this:
public void recursiveMethod(Consumer<TreeNode> func, TreeNode node){
if(node.isLeafNode()){
func.accept(node);
} else{
//Perform recursive call
}
}
For a bunch of files, it might look like this:
public void recursiveMethod(Consumer<File> func, File curFile){
if(curFile.isFile()){
func.accept(curFile);
} else{
for(File f : curFile.listFiles()){
recursiveMethod(func, f);
}
}
}
You can then call it with:
File startingFile;
//Initialize f as pointing to a directory
recursiveMethod((File file)->{
//Do something with file
}, startingFile);
Adapt as necessary.
I think the state should be saved while you return from your recursive function, then you need to restore the state as you call that recursive function again. There is no generic way to save such a state, however a template can probably be created. Something like this:
class Crawler<T> {
LinkedList<T> innerState;
Callback<T> callback;
constructor Crawler(T base,Callback<T> callback) {
innerState=new LinkedList<T>();
innerState.push(base);
this.callback=callback; // I want functions passed here
}
T recursiveFunction() {
T base=innerState.pop();
T result=return recursiveInner(base);
if (!result) innerState.push(base); // full recursion complete
return result;
}
private T recursiveInner(T element) {
ArrayList<T> c=callback.getAllSubElements(element);
T d;
for each (T el in c) {
if (innerState.length()>0) {
d=innerState.pop();
c.skipTo(d);
el=d;
if (innerState.length()==0) el=c.getNext();
// we have already processed "d", if full inner state is restored
}
T result=null;
if (callback.testFunction(el)) result=el;
if ((!result) && (callback.recursiveFunction(el))) result=recursiveInner(el); // if we can recurse on this element, go for it
if (result) {
// returning true, go save state
innerState.push(el); // push current local state to "stack"
return result;
}
} // end foreach
return null;
}
}
interface Callback<T> {
bool testFunction(T element);
bool recursiveFunction(T element);
ArrayList<t> getAllSubElements(T element);
}
Here, skipTo() is a method that modifies the iterator on c to point to provided element. Callback<T> is a means to pass functions to class to be used as condition checkers. Say "Is T a folder" for recursive check, "Is T a *.txt" for return check, and "getAllSubclassElements" should also belong here. The for each loop is fron lack of knowledge on how to work with modifiable iterators in Java, please adapt to actual code.
The only way I can think of that would meet your exact requirement would be to perform the recursive tree walk in a separate thread, and have that thread deliver results back to the main thread one at a time. (For simplicity you could use a bounded queue for the delivery, but it is also possible to implement is using wait / notify, a lock object and a single shared reference variable.)
In Python, for example, this would be a good fit for coroutines. Unfortunately, Java doesn't have a direct equivalent.
I should add that using threads is likely to incur significant overhead in synchronization and thread context switching. Using a queue will reduce them to a degree provided that rate of "producing" and "consuming" is well matched.

make wait function do something on and on if it fails

I am very beginner with Selenium and Java to write tests.
I know that I can use the code below to try to click on a web element twice (or as many time as I want):
for(int i=0;i<2;i++){
try{
wait.until(wait.until(ExpectedConditions.visibilityOfElementLocated
(By.xpath("//button[text()='bla bla ..']"))).click();
break;
}catch(Exception e){ }
}
but i was wondering if there is anything like passing a veriable to the wait function to make it do it ith times itself, something like:
wait.until(wait.until(ExpectedConditions.visibilityOfElementLocated
(By.xpath("//button[text()='bla bla ..']"),2)).click();
For example in here 2 may mean that try to do it two times if it fails, do we have such a thing?
Take a look at FluentWait, I think this will cover your use case specifying appropriate timeout and polling interval.
https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/support/ui/FluentWait.html
If you can't find something in the set of ExpectedConditions that does what you are wanting you can always write your own.
The WebDriverWait.until method can be passed either a com.google.common.base.Function or com.google.common.base.Predicate. If you create your own Function implementation then it's good to note that any non-null value will end the wait condition. For Predicate the apply method simply needs to return true.
Armed with that I do believe there's very little you can't do with this API. The feature you're asking about probably does not exist out of the box, but you have full capability to create it.
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Function.html
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Predicate.html
Best of Luck.
Untested Snippet
final By locator = By.xpath("");
Predicate<WebDriver> loopTest = new Predicate<WebDriver>(){
#Override
public boolean apply(WebDriver t) {
int tryCount = 0;
WebElement element = null;
while (tryCount < 2) {
tryCount++;
try {
element = ExpectedConditions.visibilityOfElementLocated(locator).apply(t);
//If we get this far then the element resolved. Break loop.
break;
} catch (org.openqa.selenium.TimeoutException timeout) {
//FIXME LOG IT
}
}
return element != null;
}
};
WebDriverWait wait;
wait.until(loopTest);

Variable Declaration in multi thread usage, Java [Memory Leak Issue]

I'm building a crawler using Jsoup Library in Java.
The code structure is as follows:
public static BoneCP connectionPool = null;
public static Document doc = null;
public static Elements questions = null;
static
{
// Connection Pool Created here
}
In the MAIN method, I've called getSeed() method from 10 different threads.
The getSeed() method selects 1 random URL from the database and forwards it to processPage() method.
The processPage() method connects to the URL passed from getSeed() method using jSoup library and extracts all the URLs from it and further adds them all to database.
This process goes on for 24x7.
The problem is:
In processPage() method, it first connects to the URL sent from getSeed() method using:
doc = Jsoup.connect(URL)
And then, for each URL that is found by visiting that particular URL, a new connection is made again by jSoup.
questions = doc.select("a[href]");
for(Element link: questions)
{
doc_child = Jsoup.connect(link.attr("abs:href"))
}
Now, if I declare doc and questions variable as global variable and null them after whole processing in processPage() method, it solves the problem of memory leak but the other threads stops because doc and questions get nulled in between. What should I do next ?
It's crying "wrong design" if you are using static fields, particularly for that kind of state, and based on your description it seems like it's behaving very thread-unsafe. I don't know why you think you have a memory-leak at hand but whatever it is it's easier to diagnose if stuff is in order.
What I would say is, try getting something working based on something like this:
class YieldLinks implements Callable<Set<URI>>{
final URI seed;
YieldLinks(URI seed){
this.seed = seed;
}
}
public static void main(String[] args){
Set<URI> links = new HashSet<>();
for(URI uri : seeds){
YieldLinks yieldLinks = new YieldLinks(uri);
links.addAll(yieldLinks.call());
}
}
Once this single threaded thing works ok, you could look at adding threads.

Java Server Client, shared variable between threads

I am working on a project to create a simple auction server that multiple clients connect to. The server class implements Runnable and so creates a new thread for each client that connects.
I am trying to have the current highest bid stored in a variable that can be seen by each client. I found answers saying to use AtomicInteger, but when I used it with methods such as atomicVariable.intValue() I got null pointer exception errors.
What ways can I manipulate the AtomicInteger without getting this error or is there an other way to have a shared variable that is relatively simple?
Any help would be appreciated, thanks.
Update
I have the AtomicInteger working. The problem is now that only the most recent client to connect to the server seems to be able to interact with it. The other client just sort of freeze.
Would I be correct in saying this is a problem with locking?
Well, most likely you forgot to initialize it:
private final AtomicInteger highestBid = new AtomicInteger();
However working with highestBid requires a great deal of knowledge to get it right without any locking. For example if you want to update it with new highest bid:
public boolean saveIfHighest(int bid) {
int currentBid = highestBid.get();
while (currentBid < bid) {
if (highestBid.compareAndSet(currentBid, bid)) {
return true;
}
currentBid = highestBid.get();
}
return false;
}
or in a more compact way:
for(int currentBid = highestBid.get(); currentBid < bid; currentBid = highestBid.get()) {
if (highestBid.compareAndSet(currentBid, bid)) {
return true;
}
}
return false;
You might wonder, why is it so hard? Image two threads (requests) biding at the same time. Current highest bid is 10. One is biding 11, another 12. Both threads compare current highestBid and realize they are bigger. Now the second thread happens to be first and update it to 12. Unfortunately the first request now steps in and revert it to 11 (because it already checked the condition).
This is a typical race condition that you can avoid either by explicit synchronization or by using atomic variables with implicit compare-and-set low-level support.
Seeing the complexity introduced by much more performant lock-free atomic integer you might want to restore to classic synchronization:
public synchronized boolean saveIfHighest(int bid) {
if (highestBid < bid) {
highestBid = bid;
return true;
} else {
return false;
}
}
I wouldn't look at the problem like that. I would simply store all the bids in a ConcurrentSkipListSet, which is a thread-safe SortedSet. With the correct implementation of compareTo(), which determines the ordering, the first element of the Set will automatically be the highest bid.
Here's some sample code:
public class Bid implements Comparable<Bid> {
String user;
int amountInCents;
Date created;
#Override
public int compareTo(Bid o) {
if (amountInCents == o.amountInCents) {
return created.compareTo(created); // earlier bids sort first
}
return o.amountInCents - amountInCents; // larger bids sort first
}
}
public class Auction {
private SortedSet<Bid> bids = new ConcurrentSkipListSet<Bid>();
public Bid getHighestBid() {
return bids.isEmpty() ? null : bids.first();
}
public void addBid(Bid bid) {
bids.add(bid);
}
}
Doing this has the following advantages:
Automatically provides a bidding history
Allows a simple way to save any other bid info you need
You could also consider this method:
/**
* #param bid
* #return true if the bid was successful
*/
public boolean makeBid(Bid bid) {
if (bids.isEmpty()) {
bids.add(bid);
return true;
}
if (bid.compareTo(bids.first()) <= 0) {
return false;
}
bids.add(bid);
return true;
}
Using an AtomicInteger is fine, provided you initialise it as Tomasz has suggested.
What you might like to think about, however, is whether all you will literally ever need to store is just the highest bid as an integer. Will you never need to store associated information, such as the bidding time, user ID of the bidder etc? Because if at a later stage you do, you'll have to start undoing your AtomicInteger code and replacing it.
I would be tempted from the outset to set things up to store arbitrary information associated with the bid. For example, you can define a "Bid" class with the relevant field(s). Then on each bid, use an AtomicReference to store an instance of "Bid" with the relevant information. To be thread-safe, make all the fields on your Bid class final.
You could also consider using an explicit Lock (e.g. see the ReentrantLock class) to control access to the highest bid. As Tomasz mentions, even with an AtomicInteger (or AtomicReference: the logic is essentially the same) you need to be a little careful about how you access it. The atomic classes are really designed for cases where they are very frequently accessed (as in thousands of times per second, not every few minutes as on a typical auction site). They won't really give you any performance benefit here, and an explicit Lock object might be more intuitive to program with.

Categories