I have created a password cracker in Java that cracks passwords from a text file list. It then generates a dictionary that contains the following pairs: the word hashed and the original word. I am looking for a way to speed up the program as having it read all of the words from the file and then use multithreading to generate the hashes. How can I break up the list of words so that it is in four separate partitions that I can then have multiple threads operate on in the createDictionary method? Here is what I have so far:
public class Main {
private static final String FNAME = "words.txt";
private final static String PASSWDFNAME = "passwd.txt";
private static Map<String, String> dictionary = new HashMap<>();
public static void main(String[] args) {
// Create dictionary of plain / hashed passwords from list of words
System.out.println("Generating dictionary ...");
long start = System.currentTimeMillis();
createDictionary(FNAME);
System.out.println("Generated " + dictionary.size() + " hashed passwords in dictionary");
long stop = System.currentTimeMillis();
System.out.println("Elapsed time: " + (stop - start) + " milliseconds");
// Read password file, hash plaintext passwords and lookup in dictionary
System.out.println("\nCracking password file ...");
start = System.currentTimeMillis();
crackPasswords(PASSWDFNAME);
stop = System.currentTimeMillis();
System.out.println("Elapsed time: " + (stop - start) + " milliseconds");
}
private static void createDictionary(String fname) {
// Read in list of words
List<String> words = new ArrayList<>();
try (Scanner input = new Scanner(new File(fname));) {
while (input.hasNext()) {
String s = input.nextLine();
if (s != null && s.length() >= 4) {
words.add(s);
}
}
} catch (FileNotFoundException e) {
System.out.println("File " + FNAME + " not found");
e.printStackTrace();
System.exit(-1);
}
// Generate dictionary from word list
for (String word : words) {
generateHashes(word);
}
}
private static void crackPasswords(String fname) {
File pfile = new File(fname);
try (Scanner input = new Scanner(pfile);) {
while (input.hasNext()) {
String s = input.nextLine();
String[] t = s.split(",");
String userid = t[0];
String hashedPassword = t[1];
String password = dictionary.get(hashedPassword);
if (password != null) {
System.out.println("CRACKED - user: "+userid+" has password: "+password);
}
}
} catch (FileNotFoundException ex) {
System.out.println(ex.getMessage());
ex.printStackTrace();
System.exit(-1);
}
}
private static void generateHashes(String word) {
// Convert word to lower case, generate hash, store dictionary entry
String s = word.toLowerCase();
String hashedStr = HashUtils.hashPassword(s);
dictionary.put(hashedStr, s);
// Capitalize word, generate hash, store dictionary entry
s = s.substring(0, 1).toUpperCase() + s.substring(1);
hashedStr = HashUtils.hashPassword(s);
dictionary.put(hashedStr, s);
}
}
It's very simple, check this out:
public static void main(String[] args) {
List<String> words = new ArrayList<>();
List<Thread> threads = new ArrayList<>();
int numThreads = 4;
int threadsSlice = words.size() / numThreads;
for(int i = 0; i < numThreads; i++) {
Thread t = new Thread(new WorkerThread(i * threadsSlice, (i + 1) * threadsSlice, words));
t.start();
threads.add(t);
}
threads.forEach(t -> {
try {
t.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
});
}
static class WorkerThread implements Runnable {
private final int left;
private final int right;
private final List<String> words;
public WorkerThread(int left, int right, List<String> words) {
this.left = left;
this.right = right;
this.words = words;
}
#Override
public void run() {
for (int i = left; i < right; i++) {
generateHashes(words.get(i));
}
}
}
This code is creating 4 threads, each one scanning one partition of your list, and applying the generateHashes method on all the words in the partition.
You can put the words in the heap memory to avoid passing it to each thread via constructor param.
Also make sure to use a ConcurrentMap for your dictionary in the generateHashes method
Related
I am trying to make a program that takes input from the user, searches through 2D array and prints out if the input matches data from the arrays. So, basically if the user types in VA, it should output Virginia. I am reading data from a Binary file that has 2 rows of data. The 1st row contains 2 letter abbreviations for all the states and the 2nd row contains the state names. For example: VA Virginia and in new line FL Florida and so on. Below is what I have so far. readStateFile() method works fine. I just need some help with getState method.
public static void main(String[] args) throws IOException {
try {
int age = getAge();
String[][] states = readStateFile();
String state = getState(states);
int ZIPCode = getZIPcode();
System.out.printf("\nAge:\t\t%d\n", age);
System.out.printf("Address:\t%s %s\n\n", ZIPCode, state);
System.out.println("Your survey is complete. " + "Your participation has been valuable.");
} catch (CancelledSurveyException e) {
System.out.println(e.getMessage());
} finally {
System.out.println("Thank you for your time.");
}
}
private static String getState(String[][] states) throws IOException {
states = readStateFile();
String in = "";
String[][] abb;
abb = states;
System.out.println("Please enter the 2 letter state abbrevation or 'q' to quit: ");
Scanner st = new Scanner(System.in);
in = st.next();
if (in.equals("q")) {
System.out.println("Your survey was cancelled.\n" + "Thank you for your time.");
System.exit(0);
}
if (abb.equals(states)) {
for (int i = 0; states[0][i] != null; i++) {
if (abb.equals(states[0][i])) {
for (int state = 1; state <= 100; state++) {
System.out.println(states[0][i]);
}
}
}
} else {
System.out.println("You've entered invalid state abbrevation.");
}
return in;
}
private static String[][] readStateFile() throws IOException {
String states[][] = new String[50][50];
try {
FileInputStream fstream = new FileInputStream("states copy.bin");
DataInputStream inputFile = new DataInputStream(fstream);
for (int i = 0, j = i + 1; i < 50; i++) {
states[i][0] = inputFile.readUTF();
states[i][j] = inputFile.readUTF();
// System.out.println(states);
}
inputFile.close();
return states;
} catch (EOFException e) {
System.out.println("Survey Cancelled");
}
return states;
} ```
Instead of using a multidimensional array, it might be more helpful to use a HashMap.
Each abbreviation is used as a key, and the name of the state can be found using that key as a lookup. Illustrated below:
public static void main(final String[] args)
{
try
{
final Map<String, String> states = readStateFile();
// Display the contents of the file
// for (final Map.Entry<String, String> s : states.entrySet())
// {
// System.out.println(s.getKey() + " = " + s.getValue());
// }
final String state = getState(states);
final int age = getAge();
final int postalCode = getZIPcode();
System.out.println();
System.out.printf("Age:\t\t%d\n", Integer.valueOf(age));
System.out.printf("Address:\t%s %s\n\n", Integer.valueOf(postalCode), state);
System.out.println("Your survey is complete. Your participation has been valuable.");
}
catch (final IOException ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Thank you for your time.");
}
private static String getState(final Map<String, String> states)
{
System.out.println("Please enter the 2 letter state abbrevation or 'q' to quit: ");
final StringBuilder sb = new StringBuilder();
try (final Scanner st = new Scanner(System.in))
{
final String stateAbbrev = st.next().toUpperCase(Locale.getDefault());
if ("Q".equals(stateAbbrev))
{
System.out.println("Your survey was cancelled." + System.lineSeparator() + "Thank you for your time.");
System.exit(0);
}
if (states.containsKey(stateAbbrev))
{
final String stateName = states.get(stateAbbrev);
sb.append(stateName);
}
else
{
System.out.println("You've entered an invalid state abbrevation: " + stateAbbrev);
}
}
return sb.toString();
}
private static Map<String, String> readStateFile() throws IOException
{
final List<String> lines = Files.readAllLines(Paths.get("C:/states copy.bin"));
// Get a list of items, with each item separated by any whitespace character
final String[] stateAbbrev = lines.get(0).split("\\s");
final String[] stateNames = lines.get(1).split("\\s");
final Map<String, String> states = new HashMap<>();
for (int i = 0; i < stateAbbrev.length; i++)
{
states.put(stateAbbrev[i], stateNames[i]);
}
return states;
}
private static List<A> compute(Path textFile, String word) {
List<A> results = new ArrayList<A>();
try {
Files.lines(textFile).forEach(line -> {
BreakIterator it = BreakIterator.getWordInstance();
it.setText(line.toString());
int start = it.first();
int end = it.next();
while (end != BreakIterator.DONE) {
String currentWord = line.toString().substring(start, end);
if (Character.isLetterOrDigit(currentWord.charAt(0))) {
if (currentWord.equals(word)) {
results.add(new WordLocation(textFile, line));
break;
}
}
start = end;
end = it.next();
}
});
} catch (IOException e) {
e.printStackTrace();
}
return results;
}
How can I get the line number which the word has been found?
I want to use a stream to calculate in Lamdba.
Do you have any idea?
public class Try {
public static void main(String[] args) {
Path path = Paths.get("etc/demo.txt");
List<String> result = compute(path, "Test");
result.stream().forEach(s -> System.out.println(s));
}
private static List<String> compute(Path textFilePath, String wordToFind) {
List<String> results = new ArrayList<String>();
// Added position and initialized with 0
int[] position = new int[]{0};
try {
Files.lines(textFilePath).forEach(line -> {
BreakIterator it = BreakIterator.getWordInstance();
it.setText(line.toString());
int start = it.first();
int end = it.next();
// Increment position by 1 for each line
position[0] += 1;
while (end != BreakIterator.DONE) {
String currentWord = line.toString().substring(start, end);
if (Character.isLetterOrDigit(currentWord.charAt(0))) {
if (currentWord.equals(wordToFind)) {
results.add("File Path: " + textFilePath + ", Found Word: " + wordToFind + ", Line: " + position[0]);
break;
}
}
start = end;
end = it.next();
}
});
} catch (IOException e) {
e.printStackTrace();
}
return results;
}
}
demo.txt:
Stream1
Review
Stream
2020-10-10 10:00
Test
0.0
admin HOST Test
Stream2
Review
Output:
Note:
This is an example for your reference as it uses List<String>.
Added int[] position = new int[]{0}; and position[0] += 1; for line numbers to be displayed.
In above example Test exists in line number 5 and 7.
You can use a LineNumberReader to get the linenumber. That would look something like this:
private static List<A> compute(Path textFile, String word) {
List<A> results = new ArrayList<A>();
try (final LineNumberReader reader = new LineNumberReader(new FileReader(textFile.toFile()))) {
String line;
while ((line = reader.readLine()) != null) {
BreakIterator it = BreakIterator.getWordInstance();
it.setText(line);
int start = it.first();
int end = it.next();
final int lineNumber = reader.getLineNumber(); // here is your linenumber
while (end != BreakIterator.DONE) {
String currentWord = line.substring(start, end);
if (Character.isLetterOrDigit(currentWord.charAt(0))) {
if (currentWord.equals(word)) {
results.add(new WordLocation(textFile, line));
break;
}
}
start = end;
end = it.next();
}
}
} catch (IOException e) {
e.printStackTrace();
}
return results;
}
I would like to create 5 million csv files, I have waiting for almost 3 hours, but the program is still running. Can somebody give me some advice, how to speed up the file generation.
After these 5 million files generation complete, I have to upload them to s3 bucket.
It would be better if someone know how to generate these files through AWS, thus, we can move files to s3 bucket directly and ignore network speed issue.(Just start to learning AWS, there are lots of knowledge need to know)
The following is my code.
public class ParallelCsvGenerate implements Runnable {
private static AtomicLong baseID = new AtomicLong(8160123456L);
private static ThreadLocalRandom random = ThreadLocalRandom.current();
private static ThreadLocalRandom random2 = ThreadLocalRandom.current();
private static String filePath = "C:\\5millionfiles\\";
private static List<String> headList = null;
private static String csvHeader = null;
public ParallelCsvGenerate() {
headList = generateHeadList();
csvHeader = String.join(",", headList);
}
#Override
public void run() {
for(int i = 0; i < 1000000; i++) {
generateCSV();
}s
}
private void generateCSV() {
StringBuilder builder = new StringBuilder();
builder.append(csvHeader).append(System.lineSeparator());
for (int i = 0; i < headList.size(); i++) {
if(i < headList.size() - 1) {
builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr()).append(",");
} else {
builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr());
}
}
String fileName = String.valueOf(baseID.addAndGet(1));
File csvFile = new File(filePath + fileName + ".csv");
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(csvFile);
fileWriter.write(builder.toString());
fileWriter.flush();
} catch (Exception e) {
System.err.println(e);
} finally {
try {
if(fileWriter != null) {
fileWriter.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
private static List<String> generateHeadList() {
List<String> headList = new ArrayList<>(20);
String baseFiledName = "Field";
for(int i = 1; i <=20; i++) {
headList.add(baseFiledName + i);
}
return headList;
}
/**
* generate a number in range of 0-50000
* #return
*/
private Integer generateRandomInteger() {
return random.nextInt(0,50000);
}
/**
* generate a string length is 5 - 8
* #return
*/
private String generateRandomStr() {
int strLength = random2.nextInt(5, 8);
String str="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int length = str.length();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < strLength; i++) {
builder.append(str.charAt(random.nextInt(length)));
}
return builder.toString();
}
Main
ParallelCsvGenerate generate = new ParallelCsvGenerate();
Thread a = new Thread(generate, "A");
Thread b = new Thread(generate, "B");
Thread c = new Thread(generate, "C");
Thread d = new Thread(generate, "D");
Thread e = new Thread(generate, "E");
a.run();
b.run();
c.run();
d.run();
e.run();
Thanks for your guys advice, just refactor the code, and generate 3.8million files using 2.8h, which is much better.
Refactor code:
public class ParallelCsvGenerate implements Callable<Integer> {
private static String filePath = "C:\\5millionfiles\\";
private static String[] header = new String[]{
"FIELD1","FIELD2","FIELD3","FIELD4","FIELD5",
"FIELD6","FIELD7","FIELD8","FIELD9","FIELD10",
"FIELD11","FIELD12","FIELD13","FIELD14","FIELD15",
"FIELD16","FIELD17","FIELD18","FIELD19","FIELD20",
};
private String fileName;
public ParallelCsvGenerate(String fileName) {
this.fileName = fileName;
}
#Override
public Integer call() throws Exception {
try {
generateCSV();
} catch (IOException e) {
e.printStackTrace();
}
return 0;
}
private void generateCSV() throws IOException {
CSVWriter writer = new CSVWriter(new FileWriter(filePath + fileName + ".csv"), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER);
String[] content = new String[]{
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr()
};
writer.writeNext(header);
writer.writeNext(content);
writer.close();
}
}
Main
public static void main(String[] args) {
System.out.println("Start generate");
long start = System.currentTimeMillis();
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(8, 8,
0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>());
List<ParallelCsvGenerate> taskList = new ArrayList<>(3800000);
for(int i = 0; i < 3800000; i++) {
taskList.add(new ParallelCsvGenerate(i+""));
}
try {
List<Future<Integer>> futures = threadPoolExecutor.invokeAll(taskList);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Success");
long end = System.currentTimeMillis();
System.out.println("Using time: " + (end-start));
}
You could write directly into the file (without allocating the whole file in one StringBuilder). (I think this is the biggest time+memory bottleneck here: builder.toString())
You could generate each file in parallel.
(little tweaks:) Omit the if's inside loop.
if(i < headList.size() - 1) is not needed, when you do a more clever loop + 1 extra iteration.
The i % 2 == 0 can be eliminated by a better iteration (i+=2) ..and more labor inside the loop (i -> int, i + 1 -> string)
If applicable prefer append(char) to append(String). (Better append(',') than append(",")!)
...
You can use Fork/Join framework (java 7 and above) to make your process in parallel and use multi core of your Cpu.
I'll take an example for you.
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveTask;
import java.util.stream.LongStream;
public class ForkJoinAdd extends RecursiveTask<Long> {
private final long[] numbers;
private final int start;
private final int end;
public static final long threshold = 10_000;
public ForkJoinAdd(long[] numbers) {
this(numbers, 0, numbers.length);
}
private ForkJoinAdd(long[] numbers, int start, int end) {
this.numbers = numbers;
this.start = start;
this.end = end;
}
#Override
protected Long compute() {
int length = end - start;
if (length <= threshold) {
return add();
}
ForkJoinAdd firstTask = new ForkJoinAdd(numbers, start, start + length / 2);
firstTask.fork(); //start asynchronously
ForkJoinAdd secondTask = new ForkJoinAdd(numbers, start + length / 2, end);
Long secondTaskResult = secondTask.compute();
Long firstTaskResult = firstTask.join();
return firstTaskResult + secondTaskResult;
}
private long add() {
long result = 0;
for (int i = start; i < end; i++) {
result += numbers[i];
}
return result;
}
public static long startForkJoinSum(long n) {
long[] numbers = LongStream.rangeClosed(1, n).toArray();
ForkJoinTask<Long> task = new ForkJoinAdd(numbers);
return new ForkJoinPool().invoke(task);
}
}
use this example
And if you want to read more about it, Guide to the Fork/Join Framework in Java | Baeldung
and Fork/Join (The Java™ Tutorials
can help you to better understand and better design your app.
be lucky.
Remove the for(int i = 0; i < 1000000; i++) loop from run method (leave a single generateCSV() call.
Create 5 million ParallelCsvGenerate objects.
Submit them to a ThreadPoolExecutor
Converted main:
public static void main(String[] args) {
ThreadPoolExecutor ex = (ThreadPoolExecutor) Executors.newFixedThreadPool(8);
for(int i = 0; i < 5000000; i++) {
ParallelCsvGenerate generate = new ParallelCsvGenerate();
ex.submit(generate);
}
ex.shutdown();
}
It takes roughly 5 minutes to complete on my laptop (4 physical cores with hyperthreading, SSD drive).
EDIT:
I've replaced FileWriter with AsynchronousFileChannel using the following code:
Path file = Paths.get(filePath + fileName + ".csv");
try(AsynchronousFileChannel asyncFile = AsynchronousFileChannel.open(file,
StandardOpenOption.WRITE,
StandardOpenOption.CREATE)) {
asyncFile.write(ByteBuffer.wrap(builder.toString().getBytes()), 0);
} catch (IOException e) {
e.printStackTrace();
}
to achieve 30% speedup.
I believe that the main bottleneck is the hard drive and filesystem itself. Not much more can be achieved here.
I am reading a file with a disease name and its remedies. Therefore, i want to save the name as key and remedies in a set as the value. How can i reach that? It seems there is some problems in my code.
public static HashMap<String,Set<String>> disease = new HashMap <> ();
public static void main(String[] args) {
Scanner fin = null;
try {
fin = new Scanner (new File ("diseases.txt"));
while (fin.hasNextLine()) {
HashSet <String> remedies = null;
String [] parts = fin.nextLine().split(",");
int i = 1;
while (fin.hasNext()) {
remedies.add(parts[i].trim());
i++;
}
disease.put(parts[0],remedies);
}
fin.close();
}catch(Exception e) {
System.out.println("Error: " + e.getMessage());
}
finally {
try {fin.close();} catch(Exception e) {}
}
Set <String> result = disease.get("thrombosis");
display(result);
public static <T> void display (Set<T> items) {
if (items == null)
return;
int LEN = 80;
String line = "[";
for (T item:items) {
line+= item.toString() + ",";
if (line.length()> LEN) {
line = "";
}
}
System.out.println(line + "]");
}
here is my code
cancer,pain,swelling,bleeding,weight loss
gout,pain,swelling
hepatitis A,discoloration,malaise,tiredness
thrombosis,high heart rate
diabetes,frequent urination
and here is what the txt contains.
In your code , you haven't initialized the remedies HashSet(thats why it is throwing NullPointerException at line number 14).
and second issue is : i is getting incremented by 1 and you are not checking with size of your pats array ( i > parts.length) .
I edited your code :
Scanner fin = null;
try {
fin = new Scanner(new File("diseases.txt"));
while (fin.hasNextLine()) {
HashSet<String> remedies = new HashSet<String>();
String[] parts = fin.nextLine().split(",");
int i = 1;
while (fin.hasNext()&&parts.length>i) {
remedies.add(parts[i].trim());
i++;
}
disease.put(parts[0], remedies);
}
import java.util.HashMap;
import java.util.HashSet;
import java.util.Scanner;
import java.io.File;
import java.util.Set;
public class Solution {
public static HashMap<String, Set<String>> disease = new HashMap<>();
public static void main(String[] args) {
Scanner fin = null;
try {
fin = new Scanner (new File("diseases.txt"));
while (fin.hasNextLine()) {
HashSet <String> remedies = new HashSet<>();
String [] parts = fin.nextLine().split(",");
for (int i=1; i < parts.length; i++) {
remedies.add(parts[i].trim());
}
disease.put(parts[0],remedies);
}
fin.close();
}catch(Exception e) {
System.out.println("Error: " + e.getMessage());
}
finally {
try {fin.close();} catch(Exception e) {}
}
Set <String> result = disease.get("thrombosis");
display(result);
}
public static <T> void display(Set<T> items) {
if (items == null)
return;
int LEN = 80;
String line = "[";
for (T item : items) {
line += item.toString() + ",";
if (line.length() > LEN) {
line = "";
}
}
System.out.println(line + "]");
}
}
Here is full working code. As suggested by #Pratik that you forget to initialize HashSet that's why NullPointerException error was coming.
You have a few issues here:
no need for inner while loop (while (fin.hasNext()) {) - instead use `for(int i=1; i
HashSet <String> remedies = null; - this means the set is not initialized and we cannot put items in it - nede to change to: HashSet<String> remedies = new HashSet<>();
It is better practice to close() the file in the finally part
The 'display' method will delete the line (if it is longer than 80 characters) before printing it.
it is better to use StringBuilder when appending strings
So the corrected code would be:
import java.io.File;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class TestSOCode {
public static HashMap<String,Set<String>> disease = new HashMap<>();
private static int LINE_LENGTH = 80;
public static void main(String[] args) {
Scanner fin = null;
try {
fin = new Scanner(new File("diseases.txt"));
while (fin.hasNextLine()) {
HashSet<String> remedies = new HashSet<>();
String[] parts = fin.nextLine().split(",");
disease.put(parts[0], remedies);
for (int i = 1; i < parts.length; i++) {
remedies.add(parts[i].trim());
}
}
} catch (Exception e) {
System.out.println("Error: " + e.getMessage());
} finally {
try {
fin.close();
} catch (Exception e) {
System.out.println("Error when closing file: " + e.getMessage());
}
}
Set<String> result = disease.get("thrombosis");
display(result);
}
public static <T> void display (Set<T> items) {
if (items == null)
return;
StringBuilder line = new StringBuilder("[");
int currentLength = 1; // start from 1 because of the '[' char
for (T item:items) {
String itemStr = item.toString();
line.append(itemStr).append(",");
currentLength += itemStr.length() + 1; // itemStr length plus the ',' char
if (currentLength >= LINE_LENGTH) {
line.append("\n");
currentLength = 0;
}
}
// replace last ',' with ']'
line.replace(line.length() - 1, line.length(), "]");
System.out.println(line.toString());
}
}
I keep getting an error with my program after it craws the first 2 URL's "Exception in thread "AWT-EventQueue-0" java.lang.StringIndexOutOfBoundsException: String index out of range: 0". The first couple of URL's craw as I want them to and I get the text from them using a method in another class. The other class could be the problem I don't know. Please have a look at my code and see whats happening.
package WebCrawler;
import java.util.Scanner;
import java.util.ArrayList;
import static TextAnalyser.Textanalyser.analyse;
public class Crawler {
public static void main(String[] args) {
// java.util.Scanner input = new java.util.Scanner(System.in);
// System.out.print("Enter a URL: ");
// String url = input.nextLine();
crawler("http://www.port.ac.uk/"); // Traverse the Web from the a starting url
}
public static void crawler(String startingURL) {
ArrayList<String> listOfPendingURLs = new ArrayList<String>();
ArrayList<String> listOfTraversedURLs = new ArrayList<String>();
listOfPendingURLs.add(startingURL);
while (!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
String urlString = listOfPendingURLs.remove(0);
if (!listOfTraversedURLs.contains(urlString)) {
listOfTraversedURLs.add(urlString);
String text = urlString;
text = ReadTextfromURL.gettext(text);
text = analyse(text);
System.out.println("text : " + text);
System.out.println("Craw " + urlString);
for (String s: getSubURLs(urlString)) {
if (!listOfTraversedURLs.contains(s)) {
listOfPendingURLs.add(s);
}
}
}
}
}
public static ArrayList<String> getSubURLs(String urlString) {
ArrayList <String> list = new ArrayList<String>();
try {
java.net.URL url = new java.net.URL(urlString);
Scanner input = new Scanner(url.openStream());
int current = 0;
while (input.hasNext()) {
String line = input.nextLine();
current = line.indexOf("http:", current);
while (current > 0) {
int endIndex = line.indexOf("\"", current);
if (endIndex > 0) { // Ensure that a correct URL is found
list.add(line.substring(current, endIndex));
current = line.indexOf("http:", endIndex);
} else {
current = -1;
}
}
}
} catch (Exception ex) {
System.out.println("Error: " + ex.getMessage());
}
return list;
}
}