I have a list of different URLs (about 10) from which I need content. I have made a program with which I am getting the content of 1 URL but I am unable to do it with multiple URLs.
I've studied lots of tutorials on threads in Java but I'm unable to find an answer.
In my case, URLs are like www.example1.com, www.example2.com, www.example3.com, www.example4.com.
I want to make thread for each URL and run it at the same time.
public class HtmlParser {
public static int searchedPageCount = 0,
skippedPageCount = 0,
productCount = 0;
public static void main(String[] args) {
List<String> URLs = new LinkedList<String>();
long t1 = System.currentTimeMillis();
URLs.add("www.example.com");
int i = 0;
for (ListIterator iterator = URLs.listIterator(); i < URLs.size();) {
i++;
System.out.println("While loop");
List<String> nextLevelURLs = processURL(URLs.get(iterator
.nextIndex()));
for (String URL : nextLevelURLs) {
if (!URLs.contains(URL)) {
System.out.println(URL);
iterator.add(new String(URL));
}
}
System.out.println(URLs.size());
}
System.out.println("Total products found: " + productCount);
System.out.println("Total searched page: " + searchedPageCount);
System.out.println("Total skipped page: " + skippedPageCount);
long t2 = System.currentTimeMillis();
System.out.println("Total time taken: " + (t2 - t1) / 60000);
}
public static List<String> processURL(String URL) {
List<String> nextLevelURLs = new ArrayList<String>();
try {
searchedPageCount++;
// System.out.println("Current URL: " + URL);
Elements products = Jsoup.connect(URL).timeout(60000).get()
.select("div.product");
for (Element product : products) {
System.out.println(product.select(" a > h2").text());
System.out.println(product.select(" a > h3").text());
System.out.println(product.select(".product > a").attr("href"));
System.out
.println(product.select(".image a > img").attr("src"));
System.out.println(product.select(".price").text());
System.out.println();
productCount++;
}
// System.out.println("Total products found until now: " +
// productCount);
Elements links = Jsoup.connect(URL).timeout(60000).get()
.select("a[href]");
for (Element link : links) {
URL = link.attr("href");
if (URL.startsWith("http://www.example.com/")) {
// System.out.println("URLs added.");
nextLevelURLs.add(URL);
} else {
skippedPageCount++;
// System.out.println("URL skipped: " + URL);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return nextLevelURLs;
}
}
Unfortunately, there is no way to start two threads at the same time.
Let me explain better: first of all, the sequence thread1.Start(); and thread2.Start(); is executed with thread1 first and, after that, thread2. It means only that thread thread1 is scheduled before thread 2, not actually started. The two methods take fractions of second each one, so the fact that they are in sequence cannot be seen by a human observer.
More, Java threads are scheduled, ie. assigned to be eventually executed. Even if you have a multi-core CPU, you are not sure that 1) the threads run in parallel (other system processes may interfere) and 2) the threads both start just after the Start() method is called.
but you can run multiple threads in this way..
new Thread(thread1).start();
new Thread(thread2).start();
basically create a class that implements Runnable, put the code that deals with one url in this code. In your main class for each URL, construct a class with the information that is needs (E.g. URL) and then run run
Plenty of sites that teach how to do multi-threaded java
First of all, the code you pasted looks like bad because it is orienting a simple process. You need to turn it into OO form and then extends the Thread (or Runnable) such like:
public class URLProcessor extends Thread {
private String url;
public URLProcessor(String url) {
this.url = url;
}
#Override
public void run() {
//your business logic to parse the site with "this.url" here
}
}
And then use the main entrance to load multiple ones by using:
public static void main(String[] args) {
List<String> allmyurls = null;//get multiple urls from somewhere
for (String url : allmyurls) {
URLProcessor p = new URLProcessor(url);
p.start();
}
}
Related
I have different sources of data from which I want to request in parallel (since each of this request is an http call and may be pretty time consuming). But I'm going to use only 1 response from these requests. So I kind of prioritize them. If the first response is invalid I'm going to check the second one. If it's also invalid I want to use the third, etc.
But I want to stop processing and return the result as soon as I receive the first correct response.
To simulate the problem I created the following code, where I'm trying to use java parallel streaming. But the problem is that I receive final results only after processing all requests.
public class ParallelExecution {
private static Supplier<Optional<Integer>> testMethod(String strInt) {
return () -> {
Optional<Integer> result = Optional.empty();
try {
result = Optional.of(Integer.valueOf(strInt));
System.out.printf("converted string %s to int %d\n",
strInt,
result.orElse(null));
} catch (NumberFormatException ex) {
System.out.printf("CANNOT CONVERT %s to int\n", strInt);
}
try {
int randomValue = result.orElse(10000);
TimeUnit.MILLISECONDS.sleep(randomValue);
System.out.printf("converted string %s to int %d in %d milliseconds\n",
strInt,
result.orElse(null), randomValue);
} catch (InterruptedException e) {
e.printStackTrace();
}
return result;
};
}
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
List<Supplier<Optional<Integer>>> listOfFunctions = new ArrayList();
for (String arg: args) {
listOfFunctions.add(testMethod(arg));
}
Integer value = listOfFunctions.parallelStream()
.map(function -> function.get())
.filter(optValue -> optValue.isPresent()).map(val-> {
System.out.println("************** VAL: " + val);
return val;
}).findFirst().orElse(null).get();
Instant end = Instant.now();
Long diff = end.toEpochMilli() - start.toEpochMilli();
System.out.println("final value:" + value + ", worked during " + diff + "ms");
}
}
So when I execute the program using the following command:
$java ParallelExecution dfafj 34 1341 4656 dfad 245df 5767
I want to get the result "34" as soon as possible (around after 34 milliseconds) but in fact, I'm waiting for more than 10 seconds.
Could you help to find the most efficient solution for this problem?
ExecutorService#invokeAny looks like a good option.
List<Callable<Optional<Integer>>> tasks = listOfFunctions
.stream()
.<Callable<Optional<Integer>>>map(f -> f::get)
.collect(Collectors.toList());
ExecutorService service = Executors.newCachedThreadPool();
Optional<Integer> value = service.invokeAny(tasks);
service.shutdown();
I converted your List<Supplier<Optional<Integer>>> into a List<Callable<Optional<Integer>>> to be able to pass it in invokeAny. You may build Callables initially. Then, I created an ExecutorService and submitted the tasks.
The result of the first successfully executed task will be returned as soon as that result is returned from a task. Other tasks will end up interrupted.
You also may want to look into CompletionService.
List<Callable<Optional<Integer>>> tasks = Arrays
.stream(args)
.<Callable<Optional<Integer>>>map(arg -> () -> testMethod(arg).get())
.collect(Collectors.toList());
final ExecutorService underlyingService = Executors.newCachedThreadPool();
final ExecutorCompletionService<Optional<Integer>> service = new ExecutorCompletionService<>(underlyingService);
tasks.forEach(service::submit);
Optional<Integer> value = service.take().get();
underlyingService.shutdownNow();
You can use a queue to put your results in:
private static void testMethod(String strInt, BlockingQueue<Integer> queue) {
// your code, but instead of returning anything:
result.ifPresent(queue::add);
}
and then call it with
for (String s : args) {
CompletableFuture.runAsync(() -> testMethod(s, queue));
}
Integer result = queue.take();
Note that this will only handle the first result, as in your sample.
I have tried it using competableFutures and anyOf method. It will return when any one of the future is completed. Now, key to stop other tasks is to provide your own executor service to the completableFuture(s) and shutting it down when required.
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
CompletableFuture<Optional<Integer>> completableFutures[] = new CompletableFuture[args.length];
ExecutorService es = Executors.newFixedThreadPool(args.length,r -> {
Thread t = new Thread(r);
t.setDaemon(false);
return t;
});
for (int i = 0;i < args.length; i++) {
completableFutures[i] = CompletableFuture.supplyAsync(testMethod(args[i]),es);
}
CompletableFuture.anyOf(completableFutures).
thenAccept(res-> {
System.out.println("Result - " + res + ", Time Taken : " + (Instant.now().toEpochMilli()-start.toEpochMilli()));
es.shutdownNow();
});
}
PS :It will throw interrupted exceptions that you can ignore in try catch block and not print the stack trace.Also, your thread pool size ideally should be same as length of args array.
Please, i have some problems with my web application.
I can't paste my code here ( too big and i have the difficulty to reproduce the error ) but, this is my issus.
I have a object that contains a collection. I use BlockquingQueue to share this objet betwen some thread. the second kind of thread is a servlet.
When i put my objet in the queue, the collection is not empty and i can display thier element.
But, when i take the same element, the collection size is not null, but it don't have elements.
NB: I don't have problems to get the objet in queue. My problems it which their attribute of type Collection. It show me a strange behavoir.
a big part of a code:
public class HttpCollectionConsumer extends JCasAnnotator_ImplBase{
private static BlockingQueue<Answer> queue = new LinkedBlockingQueue<>();
private static boolean hasNext = true;
public void initialize(UimaContext context) throws ResourceInitializationException{
super.initialize(context);
}
#Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
// TODO Auto-generated method stub
edu.cmu.lti.oaqa.type.input.Question q = TypeUtil.getQuestion(jcas);
System.out.println("get Text " + q.getText());
Question question = new Question(q.getId() , q.getText());
Focus focus = TypeUtil.getFocus(jcas);
Collection<LexicalAnswerType> types = TypeUtil.getLexicalAnswerTypes(jcas);
Answer a = new Answer();
a.setQuestion(question);
a.setFocus(focus);
a.setTypes(types);
try {
System.out.println("identifiant : ( " + a + " ) types " + a.getTypes().iterator().next());
System.out.println("the answer type is not empty : " + a.getTypes().iterator().hasNext());
synchronized(this){
queue.put(a);
Thread.sleep(1000);
}
System.out.println("putting finished " );
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static synchronized void put(Answer question) throws InterruptedException{
System.out.println("new answer : " + question);
queue.offer(question);
}
public synchronized static Answer take() throws InterruptedException{
Answer a = queue.take();
Thread.sleep(2000);
System.out.println(" someone takess ( " + a + " ) , remaining: " + queue.size());
System.out.println("the answer type is not empty : " + a.getTypes().iterator().hasNext());
return a;
}
public synchronized static BlockingQueue<Answer> getQueue(){
return queue;
}
public synchronized static void stop(){
hasNext = false;
}
}
Someone can know why ?
You can only get a item from a Queue once, First Object In First Out. If you want to access a object more than once or directly you should use a List and transform that into a Collection.
Also, if you share your object between Threads only one of the Threads will be able to access the mentioned object, since after being processed by one it won't be in the Queue anymore.
I'm using Java to download HTML contents of websites whose URLs are stored in a database. I'd like to put their HTML into database, too.
I'm using Jsoup for this purpose:
public String downloadHTML(String byLink) {
String htmlInPage = "";
try {
Document doc = Jsoup.connect(byLink).get();
htmlInPage = doc.html();
} catch (org.jsoup.UnsupportedMimeTypeException e) {
// process this and some other exceptions
}
return htmlInPage;
}
I'd like to download websites concurrently and use this function:
public void downloadURL(int websiteId, String url,
String categoryName, ExecutorService executorService) {
executorService.submit((Runnable) () -> {
String htmlInPage = downloadHTML(url);
System.out.println("Category: " + categoryName + " " + websiteId + " " + url);
String insertQuery =
"INSERT INTO html_data (website_id, html_contents) VALUES (?,?)";
dbUtils.query(insertQuery, websiteId, htmlInPage);
});
}
dbUtils is my class based on Apache Commons DbUtils. Details are here: http://pastebin.com/iAKXchbQ
And I'm using everything mentioned above in a such way: (List<Object[]> details are explained on pastebin, too)
public static void main(String[] args) {
DbUtils dbUtils = new DbUtils("host", "db", "driver", "user", "pass");
List<String> categoriesList =
Arrays.asList("weapons", "planes", "cooking", "manga");
String sql = "SELECT lw.id, lw.website_url, category_name " +
"FROM list_of_websites AS lw JOIN list_of_categories AS lc " +
"ON lw.category_id = lc.id " +
"where category_name = ? ";
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (String category : categoriesList) {
List<Object[]> sitesInCategory = dbUtils.select(sql, category );
for (Object[] entry : sitesInCategory) {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
}
}
executorService.shutdown();
}
I'm not sure if this solution is correct but it works. Now I want to modify code to save HTML not from all websites in my database, but only their fixed ammount in each category.
For example, download and save HTML of 50 websites from the "weapons" category, 50 from "planes", etc. I don't think it's necessary to use sql for this purpose: if we select 50 sites per category, it doesn't mean we save them all, because of possibly incorrect syntax and connection problems.
I've tryed to create separate class implementing Runnable with fields: counter and maxWebsitesPerCategory, but these variables aren't updated. Another idea was to create field Map<String,Integer> sitesInCategory instead of counter, put each category as a key there and increment its value until it reaches maxWebsitesPerCategory, but it didn't work, too. Please, help me!
P.S: I'll also be grateful for any recommendations connected with my realization of concurrent downloading (I haven't worked with concurrency in Java before and this is my first attempt)
How about this?
for (String category : categoriesList) {
dbUtils.select(sql, category).stream()
.limit(50)
.forEach(entry -> {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
});
}
sitesInCategory has been replaced with a stream of at most 50 elements, then your code is run on each entry.
EDIT
In regard to comments. I've gone ahead and restructured a bit, you can modify/implement the content of the methods I've suggested.
public void werk(Queue<Object[]> q, ExecutorService executorService) {
executorService.submit(() -> {
try {
Object[] o = q.remove();
try {
String html = downloadHTML(o); // this takes one of your object arrays and returns the text of an html page
insertIntoDB(html); // this is the code in the latter half of your downloadURL method
}catch (/*narrow exception type indicating download failure*/Exception e) {
werk(q, executorService);
}
}catch (NoSuchElementException e) {}
});
}
^^^ This method does most of the work.
for (String category : categoriesList) {
Queue<Object[]> q = new ConcurrentLinkedQueue<>(dbUtils.select(sql, category));
IntStream.range(0, 50).forEach(i -> werk(q, executorService));
}
^^^ this is the for loop in your main
Now each category tries to download 50 pages, upon failure of downloading a page it moves on and tries to download another page. In this way, you will either download 50 pages or have attempted to download all pages in the category.
I'm having multiple threads running in my threadPool Each thread reads a huge file and returns the data from this file in a List.
Code looks like :
class Writer{
ArrayList finalListWhereDataWillBeWritten = new Array<Integer>()
for(query q : allQueries){ //all the read queries to read file
threadPool.submit(new GetDataFromFile(fileName,filePath));
}//all the read queries have been submitted.
}
Now I know that following section of code will occur some where in my code but I don't know where to place it.
Because if I place it just after submit() in for loop it'll not add it because each file is very huge and may not have completed its processing.
synchronized(finalListWhereDataWillBeWritten){
//process the data obtained from single file and add it to target list
finalListWhereDataWillBeWritten.addAll(dataFromSingleThread);
}
So can anyone please tell me that where do I place this chunk of code and what other things I need to make sure of so that Critical Section Problem donot occur.
class GetDataFromFile implements Runnable<List<Integer>>{
private String fileName;
private String filePath;
public List<Integer> run(){
//code for streaming the file fileName
return dataObtainedFromThisFile;
}
}
And do i need to use wait() / notifyAll() methods in my code given that I'm only reading data from files parallely in threads and placing them in a shared List
Instead of reinventing the wheel you should simply implement Callable<List<Integer>> and submit it to the JDK's standard Executor Service. Then, as the futures complete, you collect the results into the list.
final ExecutorService threadPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final List<Future<List<Integer>>> futures = new ArrayList<>();
for(query q : allQueries) {
futures.add(threadPool.submit(new GetDataFromFile(fileName, filePath)));
}
for (Future<List<Integer>> f : futures) {
finalListWhereDataWillBeWritten.addAll(f.get());
}
And this is all assuming you are below Java 8. With Java 8 you would of course use a parallel stream:
final List<Integer> finalListWhereDataWillBeWritten =
allQueries.parallelStream()
.flatMap(q -> getDataFromFile(q.fileName, q.filePath))
.collect(toList());
UPDATE Please consider the answer provided by Marko which is far better
If you want to ensure that your threads all complete before you work on your list, do the following:
import java.util.List;
import java.util.Vector;
public class ThreadWork {
public static void main(String[] args) {
int count = 5;
Thread[] threads = new ListThread[count];
List<String> masterList = new Vector<String>();
for(int index = 0; index < count; index++) {
threads[index] = new ListThread(masterList, "Thread " + (index + 1));
threads[index].start();
}
while(isOperationRunning(threads)) {
// do nothing
}
System.out.println("Done!! Print Your List ...");
for(String item : masterList){
System.out.println("[" + item + "]");
}
}
private static boolean isOperationRunning(Thread[] threads) {
boolean running = false;
for(Thread thread : threads) {
if(thread.isAlive()) {
running = true;
break;
}
}
return running;
}
}
class ListThread extends Thread {
private static String items[] = { "A", "B", "C", "D"};
private List<String> list;
private String name;
public ListThread(List<String> masterList, String threadName) {
list = masterList;
name = threadName;
}
public void run() {
for(int i = 0; i < items.length;++i) {
randomWait();
String data = "Thread [" + name + "][" + items[i] + "]";
System.out.println( data );
list.add( data );
}
}
private void randomWait() {
try {
Thread.currentThread();
Thread.sleep((long)(3000 * Math.random()));
}
catch (InterruptedException x) {}
}
}
I am working on a project in which I have three datacenters - DC1, DC2 and DC3.
In DC1 I have 2 machines (machineA and machineB), in DC2 I have two machine (machineC and machineD) and in DC3 I have two machines again (machineE and machineF).
Each machine URL in each datacenter is like this and it returns back the string as the response -
http://machineName:8080/textbeat
For DC1-
http://machineA:8080/textbeat
http://machineB:8080/textbeat
For DC2-
http://machineC:8080/textbeat
http://machineD:8080/textbeat
For DC3-
http://machineE:8080/textbeat
http://machineF:8080/textbeat
Here is the response string I see in general after hitting the url for any particular machine -
state: READY server_uptime: 12462125 data_syncs: 29
Problem Statement:-
Now I need to iterate all the machines in each datacenters and execute the URL and then extract data_syncs from it. And this has to be done every 1 minute.
And now if machineA data_syncs is always zero continuously for a period of 5 minutes, then I would like to print DC1 and machineA. Similarly for machineB and other datacenters.
The logic that I was thinking -
Ping each individual machine from each datacenter, extract the data_syncs value if it is zero, increment the counter by one,
Then try again after one minute, if the value is still zero, increment the same counter again by one.
If the counter reaches 5 (as it is 5 minutes) and it was still zero continuously, then I would add this machine and datacenter name in my map.
But suppose during three continuous tries it was zero and in fourth try it became non zero, then my counter will get reset to zero for that machine in the datacenter and start the process again for that machine.
Below is my map in which I am putting the datacenter and its machines if they have met above condition -
final Map<String, List<String>> holder = new LinkedHashMap<String, List<String>>();
Here key is the datacenter name and value is the list of machines for that datacenter which has met the condition.
Below is the code I came up with to solve the above problem but it doesn't work the way as I am supposed to do I guess. Here my counter is same for all the machines I guess which is not what I want.
public class MachineTest {
private static int counter = 0;
private final static ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(2);
public static void main(String[] args) {
final ScheduledFuture<?> taskUtility = scheduler.scheduleAtFixedRate(new Runnable() {
public void run() {
try {
generalUtility();
} catch (Exception ex) {
// log an exception
}
}
}, 0, 1L, TimeUnit.MINUTES);
}
protected static void generalUtility() {
try {
final Map<String, List<String>> holder = new LinkedHashMap<String, List<String>>();
List<String> datacenters = Arrays.asList("DC1", "DC2", "DC3");
for (String datacenter : datacenters) {
LinkedList<String> machines = new LinkedList<String>();
List<String> childrenInEachDatacenter = getMachinesInEachDatacenter(datacenter);
for (String hosts : childrenInEachDatacenter) {
String host_name = hosts;
String url = "http://" + host_name + ":8080/textbeat";
MachineMetrics metrics = GeneralUtilities.getMetricsOfMachine(host_name, url); // execute the url and populate the MachineMetrics object
if (metrics.getDataSyncs().equalsIgnoreCase("0")) {
counter++;
if (counter == 5) {
machines.add(hosts);
}
}
}
if(!machines.isEmpty()) {
holder.put(datacenter, machines);
}
}
if (!holder.isEmpty()) {
// log the datacenter and its machine as our criteria is met
System.out.println(holder);
}
} catch (Exception e) {
e.printStackTrace();
}
}
// Below method will return list of machines given the name of datacenter
private static List<String> getMachinesInEachDatacenter(String datacenter) {
// this will return list of machines for a given datacenter
}
}
And here is my MachineMetrics class -
public class MachineMetrics {
private String machineName;
private String dataSyncs;
// getters and setters
}
Is this possible to do using ScheduledExecutorService as this is not one time process? It has to be done repeatedly
Basically for each machine if data_syncs is 0 for a period of 5 minutes continuously then I need to log that datacenter and its machines.
public class Machine{
private String dataCenter;
private String machineName;
private String hostname;
private int zeroCount = 0;
//getters setters, except for zeroCount
// constructor with datacenter,machine as args
private boolean isEligibleForLogging(String dataSyncs){
if(dataSyncs.equals("0")){
zeroCount++;
}else{
zeroCount = 0;
}
if(zeroCount > 5){
zeroCount = 0;
return true;
}
return false;
}
}
static List<Machine> machines = new ArrayList<Machine>();
static{
Machine machine1 = new Machine("DC1", "name1","hostname1");
machines.add(machine1);
//repeat the above two lines per each machine.
}
protected static void generalUtility() {
try {
for (Machine machine : machines) {
String host_name = machine.getHostName();
String url = "http://" + host_name + ":8080/textbeat";
String dataSyncs = //execute url and get datasyncs
if(machine.isEligibleForLogging()){
System.out.println(machine.getName() + ... +machine.getDataCenter() + ... + dataSyncs......);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}