I am programming a links collector from specified number of pages. To make it more efficient I am using a ThreadPool with fixed size. Because I am really a newbie in the multithreading area I have problems with fixing some issues. So my idea is that every thread does the same thing: Connect to page and collect every url. After that urls are added to Queue for next thread.
But this doesn't work. At first program analyze baseurl and add urls from it. But at first I want to do it only with LinksToVisit.add(baseurl) and run it with threadpool but it always poll queue and threads add nothing new so on the top of queue is null.And I dont know why:(
I tried to do it with ArrayBlockingQueue but with no success. Fixing it with analyze base url is not good solution because when on baseurl is for example only one link it doesn't follow it. So I think I am going about it the wrong way or missing something important. As html parser I am using Jsoup. Thanks for answers.
Source(removed unnecessary methods) :
package collector;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Iterator;
import java.util.Map;
import java.util.Scanner;
import java.util.Map.Entry;
import java.util.concurrent.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Collector {
private String baseurl;
private int links;
private int cvlinks;
private double time;
private int chcount;
private static final int NTHREADS = Runtime.getRuntime().availableProcessors()*2;
private ConcurrentLinkedQueue<String> LinksToVisit = new ConcurrentLinkedQueue<String>();
private ConcurrentSkipListMap<String, Double> SortedCharMap = new ConcurrentSkipListMap<String, Double>();
private ConcurrentHashMap<String, Double> CharMap = new ConcurrentHashMap<String, Double>();
public Collector(String url, int links) {
this.baseurl = url;
this.links = links;
this.cvlinks = 0;
this.chcount = 0;
try {
Document html = Jsoup.connect(url).get();
if(cvlinks != links){
Elements collectedLinks = html.select("a[href]");
for(Element link:collectedLinks){
if(cvlinks == links) break;
else{
String current = link.attr("abs:href");
if(!current.equals(url) && current.startsWith(baseurl)&& !current.contains("#")){
LinksToVisit.add(current);
cvlinks++;
}
}
}
}
AnalyzeDocument(html, url);
} catch (IOException e) {
e.printStackTrace();
}
CollectFromWeb();
}
private void AnalyzeDocument(Document doc,String url){
String text = doc.body().text().toLowerCase().replaceAll("[^a-z]", "").trim();
chcount += text.length();
String chars[] = text.split("");
CharCount(chars);
}
private void CharCount(String[] chars) {
for(int i = 1; i < chars.length; i++) {
if(!CharMap.containsKey(chars[i]))
CharMap.put(chars[i],1.0);
else
CharMap.put(chars[i], CharMap.get(chars[i]).doubleValue()+1);
}
}
private void CollectFromWeb(){
long startTime = System.nanoTime();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
CollectorThread[] workers = new CollectorThread[this.links];
for (int i = 0; i < this.links; i++) {
if(!LinksToVisit.isEmpty()){
int j = i+1;
System.out.println("Collecting from "+LinksToVisit.peek()+" ["+j+"/"+links+"]");
//Runnable worker = new CollectorThread(LinksToVisit.poll());
workers[i] = new CollectorThread(LinksToVisit.poll());
executor.execute(workers[i]);
}
else break;
}
executor.shutdown();
while (!executor.isTerminated()) {}
SortedCharMap.putAll(CharMap);
this.time =(System.nanoTime() - startTime)*10E-10;
}
class CollectorThread implements Runnable{
private Document html;
private String url;
public CollectorThread(String url){
this.url = url;
try {
this.html = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
}
#Override
public void run() {
if(cvlinks != links){
Elements collectedLinks = html.select("a[href]");
for(Element link:collectedLinks){
if(cvlinks == links) break;
else{
String current = link.attr("abs:href");
if(!current.equals(url) && current.startsWith(baseurl)&& !current.contains("#")){
LinksToVisit.add(current);
cvlinks++;
}
}
}
}
AnalyzeDocument(html, url);
}
}
}
Instead of using the LinksToVisit queue, just call executor.execute(new CollectorThread(current)) directly from CollectorThread.run(). The ExecutorService has its own internal queue of tasks which it will run as threads become available.
The other problem here is that calling shutdown() after adding the first set of URLs to the queue will prevent new tasks from being added to the executor. You can fix this by instead making the executor shut down when it has emptied its queue:
class Queue extends ThreadPoolExecutor {
Queue(int nThreads) {
super(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>());
}
protected void afterExecute(Runnable r, Throwable t) {
if(getQueue().isEmpty()) {
shutdown();
}
}
}
Related
is there any nice way to print the progresss in a kafka stream app? I feel that my app is falling behind and I want a nice way to show the progress of processing the events in my app
Out of the box, not within the Streams API.
You're more than welcome to import methods that ConsumerGroupCommand.scala uses to get the group lag and calculate / print from there.
Or you can externally install a tool like Burrow or Remora which have REST APIs for accessing lag information
I wrote the following class to help be print the lag/progress easily
package util;
import lombok.extern.slf4j.Slf4j;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.ListConsumerGroupOffsetsResult;
import org.apache.kafka.clients.admin.ListOffsetsResult;
import org.apache.kafka.clients.admin.OffsetSpec;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.Function;
import java.util.stream.Collectors;
#Slf4j
public class LagLogger implements AutoCloseable {
private ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(1);
private String topic;
private String consumerGroupName;
private int logDelayInMilliSeconds;
private Properties kafkaStreamsProperties;
private boolean closed;
private AdminClient adminClient;
public LagLogger(String topic, String consumerGroupName, Properties kafkaStreamProperties, int logDelayInMilliSeconds) {
this.topic = topic;
this.kafkaStreamsProperties = kafkaStreamProperties;
this.logDelayInMilliSeconds = logDelayInMilliSeconds;
this.consumerGroupName = consumerGroupName;
adminClient = AdminClient.create(LagLogger.this.kafkaStreamsProperties);
}
public class LagVisualizerTask implements AutoCloseable, Runnable {
public LagVisualizerTask() {
}
public void run() {
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(LagLogger.this.consumerGroupName);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = null;
try {
topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(topic+": ");
for (var entry : topicPartitionOffsetAndMetadataMap.entrySet()) {
String finalString = stringBuilder.toString();
if (entry.getKey().topic().equals(LagLogger.this.topic)) {
long current_offset = entry.getValue().offset();
long end_offset = 0;
try {
end_offset = listOffsetsResult.partitionResult(entry.getKey()).get().offset();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
stringBuilder.append(current_offset);
stringBuilder.append(" --> ");
stringBuilder.append(end_offset);
stringBuilder.append(" ("+String.format("%.2f", ((double)current_offset/end_offset)*100) +"%)");
stringBuilder.append(" / ");
}
}
log.info(stringBuilder.toString());
}
public void close() {
closed = true;
}
}
public LagVisualizerTask startNewLagVisualizerTask() {
LagVisualizerTask lagVisualizerTask = new LagVisualizerTask();
scheduledExecutorService.scheduleWithFixedDelay(lagVisualizerTask,0, LagLogger.this.logDelayInMilliSeconds, TimeUnit.MILLISECONDS);
return lagVisualizerTask;
}
public void close() {
if (scheduledExecutorService != null) {
scheduledExecutorService.shutdownNow();
scheduledExecutorService = null;
}
}
}
Which can be used as follows:
LagLogger lagVisualizer = new LagLogger(INPUT_TOPIC_NAME,APPLICATION_ID,configuration.getKafkaStreamsProperties(),DELY_BETWEEN_LOGS);
lagVisualizer.startNewLagVisualizerTask();
This is my first post on Stack Overflow so please go easy on me! I made this Web Scraper as a final project in my CS course last semester. I was able to pass with it, however, it always bothered me about how slow my program ran compared to others in the class. My program took 11 hours to gather 10,000 emails, whereas my friend took 5 minutes. I couldn't figure out why! I even tried seeing what's wrong with a java profiler, and it just showed me that my threads are waiting. I don't know how to fix that and why it only affected me. I really want to learn about how to properly use threads, so I'm asking you guys.
My CPU is an i7 7700k, so there shouldn't be a problem there and I have Gigabit internet. So it's definitely the way I coded my program. Here is the main class:
import java.net.MalformedURLException;
import java.net.URL;
import java.sql.*;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
static int EMAIL_MAX_COUNT = 10_000;
static int MAX_VISITS = 5;
static final Set<String> emails = Collections.synchronizedSet(new HashSet<>(10_000));
static Set<String> linksToVisit = Collections.synchronizedSet(new HashSet<>(20_000));
static Set<String> linksFilter = Collections.synchronizedSet(new HashSet<>(20_000));
static Set<String> linksVisited = Collections.synchronizedSet(new HashSet<>(10_000));
static Map<String, Set<String>> maxLinksVisited = Collections.synchronizedMap(new HashMap<>());
public static void main(String[] args) {
ExecutorService pool = Executors.newFixedThreadPool(200);
linksToVisit.add("https://www.touro.edu/");//starts with touro.edu
while (!linksToVisit.isEmpty() && emails.size() <= EMAIL_MAX_COUNT) {
String link;
synchronized (linksToVisit) {
link = linksToVisit.stream().findFirst().get();
System.out.println(link);
linksToVisit.remove(link);
}
if (hasTooManyVisits(link)) {
link = "";
}
if (!(link.equals(""))) {
linksVisited.add(link);
pool.execute(new WebScraper(link));
}
}
pool.shutdownNow();
}
private static boolean hasTooManyVisits(String link) {
try {
URL currentURL = new URL(link);
String host = currentURL.getHost();
int startIndex = 0;
int nextIndex = host.indexOf('.');
int lastIndex = host.lastIndexOf('.');
while (nextIndex < lastIndex) {
startIndex = nextIndex + 1;
nextIndex = host.indexOf('.', startIndex);
}
synchronized (maxLinksVisited) {
if (startIndex > 0) {
Set<String> tempSet = maxLinksVisited.get(host.substring(startIndex));
if (tempSet == null) {
tempSet = new HashSet<>();
maxLinksVisited.put(host.substring(startIndex), tempSet);
}
tempSet.add(link);
maxLinksVisited.put(host.substring(startIndex), tempSet);
if (maxLinksVisited.get(host.substring(startIndex)).size() >= MAX_VISITS) {
return true;
}
} else {
Set<String> tempSet = maxLinksVisited.get(host);
if (tempSet == null) {
tempSet = new HashSet<>();
maxLinksVisited.put(host, tempSet);
}
tempSet.add(link);
maxLinksVisited.put(host, tempSet);
if (maxLinksVisited.get(host).size() >= MAX_VISITS) {
return true;
}
}
}
} catch (MalformedURLException e) {
return false;
}
return false;
}
All it really does is setup the initial part of the program and create the threads. Here is the WebScraper class:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WebScraper implements Runnable {
String currentUrl;
String[] randomFileExtensions = {"png", "jpg", "gif", "pdf", "mp3", "css", "mp4", "mov", "7z", "zip", "mkv", "avi", "jpeg"};//common files
WebScraper(String url) {
this.currentUrl = url;
run();//for some reason it's needed
}
#Override
public void run() {
try {
try { // double try block so the program doesn't stop on errors
Document doc = Jsoup.connect(currentUrl).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.referrer("http://www.google.com").get();
Pattern emailPattern = Pattern.compile("[\\w\\d._]+#[\\w\\d]+\\.[\\w]{2,3}");
Matcher emailMatcher = emailPattern.matcher(doc.toString());
while (emailMatcher.find()) {//find and add emails
String email = emailMatcher.group();
if (Arrays.stream(randomFileExtensions).parallel().noneMatch(email::contains)) {//filter for any files that are not emails
Main.emails.add(emailMatcher.group());
}
}
synchronized (Main.linksFilter) {
Main.linksFilter.addAll(doc.select("a[href]").eachAttr("abs:href"));//find and add all links on the page
for (String randomFileExtension : randomFileExtensions) {
Main.linksFilter.removeIf(s -> s.contains(randomFileExtension));//filter links for any files
}
synchronized (Main.linksToVisit) {
Main.linksFilter.removeAll(Main.linksVisited);
Main.linksToVisit.addAll(Main.linksFilter);
Main.linksFilter.clear();
}
}
} catch (IOException e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
One thing that bothers me is that the program will stop after touro.edu if I take out the run(); method from the constructor. I don't know why, the program should automatically call it...
In conclusion, I just want to know what I did wrong. Please help me understand, and thank you in advanced!
first, you can't call the pool.shutdownNow(); method, suggest call pool.shutdown();
they are different can read the java document.
when you call java.util.Collections#synchronizedSet(java.util.Set<T>) method, the return Set is already thread-safe, so you don't add synchronized.
I can't execute Runnables(delayed tasks in a queue), that have been returned in an list of Runnables after invoking of shutdownNow() on ScheduledThreadPoolExecutor object.
I've tried some ways to do it: you can get list size, one of the Runnable objects itself, invoke isDone() query, but I haven't coped to run them.
CAN they be executed and HOW if possible?
See please code below. Thank you.
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.FutureTask;
import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.ThreadPoolExecutor;
public class ExecuteExisitingDelayedTasksAfterShutdownPolicy1 {
private static int count = 0;
private static class Task implements Runnable {
private String name;
public Task(String name) {
this.name = name;
count++;
}
#Override
public void run() {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
return;
}
System.out.printf("\n%s: " + getName(), Thread.currentThread().getName());
}
public String getName() {
return name;
}
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
ScheduledThreadPoolExecutor stpe = new ScheduledThreadPoolExecutor(10, new ThreadPoolExecutor.DiscardPolicy());
stpe.setExecuteExistingDelayedTasksAfterShutdownPolicy(true);
List<Runnable> queue = null;
for (int i = 0; i < 100; i++) {
stpe.execute(new Task("Task " + count));
if (i == 50) {
Thread.sleep(1000);
queue = stpe.shutdownNow();
System.out.print("\nQueue SIZE: " + queue.size());
}
}
Thread.sleep(3000);
System.out.print("\n" + queue.get(0));
#SuppressWarnings("unchecked")
FutureTask<Task> ftask = (FutureTask<Task>) queue.get(0);
ExecutorService ses = Executors.newSingleThreadExecutor();
/**
* all of the next.. doesn't work: tasks returned in a queue are likely
to be
* unrunnable
*/
ftask.get().run();
System.out.println(ftask.get().name);
ses.execute(ftask);
queue.get(0).run();
}
}
I have a program that should make really fast http requests. Requests should be made asynchronously so that it won't block the main thread.
So I have created a queue which is observed by 10 separate threads that make http requests. If something is inserted in the queue then the first thread that gets the data will make the requests and process the result.
The queue gets filled with thousands of items so multithreading is really neccessary to get the response as fast as possible.
Since I have alot of code I'll give a short example.
main class
package fasthttp;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
public class FastHTTP {
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
LinkedBlockingQueue queue = new LinkedBlockingQueue();
queue.add("http://www.lennar.eu/ip.php");//for example
executor.execute(new HTTPworker(queue));
}
}
}
FastHTTP class
package fasthttp;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.concurrent.LinkedBlockingQueue;
public class HTTPworker implements Runnable {
private final LinkedBlockingQueue queue;
public HTTPworker(LinkedBlockingQueue queue) {
this.queue = queue;
}
private String getResponse(String url) throws IOException {
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
StringBuilder response;
try (BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()))) {
String inputLine;
response = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
}
return response.toString();
}
#Override
public void run() {
while (true) {
try {
String data = (String) queue.take();
String response = getResponse(data);
//Do something with response
System.out.println(response);
} catch (InterruptedException | IOException ex) {
//Handle exception
}
}
}
}
Is there a better or faster way to make thousands of http requests response processing asynchronously? Speed and performance is what I'm after.
Answering my own question. Tried Apaches asynchronous http client but after a while I started using Ning's async client and I am happy with it.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.stream.Collectors;
import org.apache.http.client.methods.HttpGet;
import java.util.Iterator;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
public class RestService {
private final static Executor executor = Executors.newCachedThreadPool();
private final static CloseableHttpClient closeableHttpClient = HttpClientBuilder.create().build();
public static String sendSyncGet(final String url) {
return sendAsyncGet(List.of(url)).get(0);
}
public static List<String> sendAsyncGet(final List<String> urls){
List<GetRequestTask> tasks = urls.stream().map(url -> new GetRequestTask(url, executor)).collect(Collectors.toList());
List<String> responses = new ArrayList<>();
while(!tasks.isEmpty()) {
for(Iterator<GetRequestTask> it = tasks.iterator(); it.hasNext();) {
final GetRequestTask task = it.next();
if(task.isDone()) {
responses.add(task.getResponse());
it.remove();
}
}
//if(!tasks.isEmpty()) Thread.sleep(100); //avoid tight loop in "main" thread
}
return responses;
}
private static class GetRequestTask {
private final FutureTask<String> task;
public GetRequestTask(String url, Executor executor) {
GetRequestWork work = new GetRequestWork(url);
this.task = new FutureTask<>(work);
executor.execute(this.task);
}
public boolean isDone() {
return this.task.isDone();
}
public String getResponse() {
try {
return this.task.get();
} catch(Exception e) {
throw new RuntimeException(e);
}
}
}
private static class GetRequestWork implements Callable<String> {
private final String url;
public GetRequestWork(String url) {
this.url = url;
}
public String getUrl() {
return this.url;
}
public String call() throws Exception {
return closeableHttpClient.execute(new HttpGet(getUrl()), new BasicResponseHandler());
}
}
}
I'm a new french user on stack and I have a problem ^^
I use an HTML parse Jsoup for parsing a html page. For that it's ok but I can't parse more url in same time.
This is my code:
first class for parsing a web page
package test2;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public final class Utils {
public static Map<String, String> parse(String url){
Map<String, String> out = new HashMap<String, String>();
try
{
Document doc = Jsoup.connect(url).get();
doc.select("img").remove();
Elements denomination = doc.select(".AmmDenomination");
Elements composition = doc.select(".AmmComposition");
Elements corptexte = doc.select(".AmmCorpTexte");
for(int i = 0; i < denomination.size(); i++)
{
out.put("denomination" + i, denomination.get(i).text());
}
for(int i = 0; i < composition.size(); i++)
{
out.put("composition" + i, composition.get(i).text());
}
for(int i = 0; i < corptexte.size(); i++)
{
out.put("corptexte" + i, corptexte.get(i).text());
System.out.println(corptexte.get(i));
}
} catch(IOException e){
e.printStackTrace();
}
return out;
}//Fin Methode parse
public static void excelizer(int fileId, Map<String, String> values){
try
{
FileOutputStream out = new FileOutputStream("C:/Documents and Settings/c.bon/git/clinsearch/drugs/src/main/resources/META-INF/test/fichier2.xls" );
Workbook wb = new HSSFWorkbook();
Sheet mySheet = wb.createSheet();
Row row1 = mySheet.createRow(0);
Row row2 = mySheet.createRow(1);
String entete[] = {"CIS", "Denomination", "Composition", "Form pharma", "Indication therapeutiques", "Posologie", "Contre indication", "Mise en garde",
"Interraction", "Effet indesirable", "Surdosage", "Pharmacodinamie", "Liste excipients", "Incompatibilité", "Duree conservation",
"Conservation", "Emballage", "Utilisation Manipulation", "TitulaireAMM"};
for (int i = 0; i < entete.length; i++)
{
row1.createCell(i).setCellValue(entete[i]);
}
Set<String> set = values.keySet();
int rowIndexDenom = 1;
int rowIndexCompo = 1;
for(String key : set)
{
if(key.contains("denomination"))
{
mySheet.createRow(1).createCell(1).setCellValue(values.get(key));
rowIndexDenom++;
}
else if(key.contains("composition"))
{
row2.createCell(2).setCellValue(values.get(key));
rowIndexDenom++;
}
}
wb.write(out);
out.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
second class
package test2;
public final class Task extends Thread {
private static int fileId = 0;
private int id;
private String url;
public Task(String url)
{
this.url = url;
id = fileId;
fileId++;
}
#Override
public void run()
{
Utils.excelizer(id, Utils.parse(url));
}
}
the main class (entry point)
package test2;
import java.util.ArrayList;
public class Main {
public static void main(String[] args)
{
ArrayList<String> urls = new ArrayList<String>();
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=61266250&typedoc=R");
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=66207341&typedoc=R");
for(String url : urls)
{
new Task(url).run();
}
}
}
When the data was copied to my excel file, the second url doesn't work.
Can you help me solve my problem please?
Thanks
I think its because your main() exits before your second thread has a chance to do its job. You should wait for all spawned threads to complete using Thread.join(). Or better yet, create one of the ExecutorService's and use awaitTermination(...) to block until all URLs are parsed.
EDIT See some examples here http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html