How many times the word is used on the html page

How many times the word is used on the html page - java

I have a method that should return an integer which is the number of uses of the searchWord in the text of an HTML document:
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
if (bodyText.toLowerCase().contains(searchWord.toLowerCase())){
count++;
}
return count;
}
But my method always returns count=1, even if the word is used several times. I understand that the error should be obvious, but I’m stuck and I don’t see it.

You are currently only checking once that the text contains the search word, so the count will always be either 0 or 1. To find the total count, keep looping using String#indexOf(str, fromIndex) while the String can be found using the second argument that indicates the index to start searching from.
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
for(int idx = -1; (idx = bodyText.indexOf(searchWord, idx + 1)) != -1; count++);
return count;
}

According to the Java docs String#contains:
Returns true if and only if this string contains the specified sequence of char values.
You're asking if the word you're looking for is contained in the document, which it is.
You could:
Split the text on words (splitting it by spaces) and then count how many times it appears
Iterate the String using String#indexOf starting on index 0 and then from last index you found until the end of the String.
Iterate the String using contains but starting from a certain index (doing this logic yourself).
I'd go for the 2nd approach as it seems like the easiest one.

These are only conditional statements, you aren't looping through the HTML text, therefor, if it finds the instance of searchWord in bodyText, it'll increment it, and then exit the method with a value of 1. I suggest looping through every word in the html, adding it to an array, and counting it that way using something like this:
char[] bodyTextA = bodyText.toCharArray();
Or keep it in a string array and split it by a space, or new line, or whatever criteria you have. Example of space:
//puts hello, i'm, your, and string into their own array slots in the array
/split
str = "Hello I'm your String";
String[] split = str.split("\\s+");

Your issue here is that the if statement is checking if the text contains the word and the increments your count variable. So even if it contains the word multiple time, your logic goes basically, if it contains it at all, increase count by one. You will have to rewrite your code to check for multiple occurrences of the word. There are many ways you can go about this, you could loop through the entire body text, you could split the body text into an array of words and check that, or you could remove the search word from the text each time you find it and keep checking until it no longer contains the search word.

You can use indexOf(,) with an index for the last found word
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
int index = 0;
while ((index = bodyText.indexOf(searchWord, index + 1)) != -1) {
count++;
}
return count;
}

Related

Reading a file -- pairing a String and int value -- with multiple split lines

I am working on an exercise with the following criteria:
"The input consists of pairs of tokens where each pair begins with the type of ticket that the person bought ("coach", "firstclass", or "discount", case-sensitively) and is followed by the number of miles of the flight."
The list can be paired -- coach 1500 firstclass 2000 discount 900 coach 3500 -- and this currently works great. However, when the String and int value are split like so:
firstclass 5000 coach 1500 coach
100 firstclass
2000 discount 300
it breaks entirely. I am almost certain that it has something to do with me using this format (not full)
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ")
while(token.hasMoreTokens())
{
String ticketClass = token.nextToken().toLowerCase();
int count = Integer.parseInt(token.nextToken());
...
}
}
because it will always read the first value as a String and the second value as an integer. I am very lost on how to keep track of one or the other while going to read the next line. Any help is truly appreciated.
Similar (I think) problems:
Efficient reading/writing of key/value pairs to file in Java
Java-Read pairs of large numbers from file and represent them with linked list, get the sum and product of each pair
Reading multiple values in multiple lines from file (Java)

If you can afford to read the text file in all at once as a very long String, simply use the built-in String.split() with the regex \\s+, like so
String[] tokens = fileAsString.split("\\s+");
This will split the input file into tokens, assuming the tokens are separated by one or more whitespace characters (a whitespace character covers newline, space, tab, and carriage return). Even and odd tokens are ticket types and mile counts, respectively.
If you absolutely have to read in line-by-line and use StringTokenizer, a solution is to count number of tokens in the last line. If this number is odd, the first token in the current line would be of a different type of the first token in the last line. Once knowing the starting type of the current line, simply alternating types from there.
int tokenCount = 0;
boolean startingType = true; // true for String, false for integer
boolean currentType;
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ");
startingType = startingType ^ (tokenCount % 2 == 1); // if tokenCount is odd, the XOR ^ operator will flip the starting type of this line
tokenCount = 0;
while(token.hasMoreTokens())
{
tokenCount++;
currentType = startingType ^ (tokenCount % 2 == 0); // alternating between types in current line
if (currentType) {
String ticketClass = token.nextToken().toLowerCase();
// do something with ticketClass here
} else {
int mileCount = Integer.parseInt(token.nextToken());
// do something with mileCount here
}
...
}
}

I found another way to do this problem without using either the StringTokenizer or the regex...admittedly I had trouble with the regular expressions haha.
I declare these outside of the try-catch block because I want to use them in both my finally statement and return the points:
int points = 0;
ArrayList<String> classNames = new ArrayList<>();
ArrayList<Integer> classTickets = new ArrayList<>();
Then inside my try-statement, I declare the index variable because I won't need that outside of this block. That variable increases each time a new element is read. Odd elements are read as ticket classes and even elements are read as ticket prices:
try
{
int index = 0;
// read till the file is empty
while(fileScanner.hasNext())
{
// first entry is the ticket type
if(index % 2 == 0)
classNames.add(fileScanner.next());
// second entry is the number of points
else
classTickets.add(Integer.parseInt(fileScanner.next()));
index++;
}
}
You can either catch it here like this or use throws NoSuchElementException in your method declaration -- As long as you catch it on your method call
catch(NoSuchElementException noElement)
{
System.out.println("<###-NoSuchElementException-###>");
}
Then down here, loop through the number of elements. See which flight class it is and multiply the ticket count respectively and return the points outside of the block:
finally
{
for(int i = 0; i < classNames.size(); i++)
{
switch(classNames.get(i).toLowerCase())
{
case "firstclass": // 2 points for first
points += 2 * classTickets.get(i);
break;
case "coach": // 1 point for coach
points += classTickets.get(i);
break;
default:
// budget gets nothing
}
}
}
return points;
The regex seems like the most convenient way, but this was more intuitive to me for some reason. Either way, I hope the variety will help out.

simply use the built-in String.split() - #bui
I was finally able to wrap my head around regular expressions, but \s+ was not being recognized for some reason. It kept giving me this error message:
Invalid escape sequence (valid ones are \b \t \n \f \r " ' \ )Java(1610612990)
So when I went through with those characters instead, I was able to write this:
int points = 0, multiplier = 0, tracker = 0;
while(fileScanner.hasNext())
{
String read = fileScanner.next().split(
"[\b \t \n \f \r \" \' \\ ]")[0];
if(tracker % 2 == 0)
{
if(read.toLowerCase().equals("firstclass"))
multiplier = 2;
else if(read.toLowerCase().equals("coach"))
multiplier = 1;
else
multiplier = 0;
}else
{
points += multiplier * Integer.parseInt(read);
}
tracker++;
}
This code goes one entry at a time instead of reading a whole array void of whitespace as a work-around for that error message I was getting. If you could show me what the code would look like with String[] tokens = fileAsString.split("\s+"); instead I would really appreciate it :)
you need to add another "\" before "\s" to escape the slash before "s" itself – #bui

How to read and return the second index of an array if first index matches a string?

I'm trying to write a translate method using the following parameters. However, every time I run the method it skips the first if statement and goes right to the second for loop.
/**
Translates a word according to the data in wordList then matches the case.
The parameter wordList contains the mappings for the translation. The data is
organized in an ArrayList containing String arrays of length 2. The first
cell (index 0) contains the word in the original language, called the key,
and the second cell (index 1) contains the translation.
It is assumed that the items in the wordList are sorted in ascending order
according to the keys in the first cell.
#param word
The word to translate.
#param wordList
An ArrayList containing the translation mappings.
#return The mapping in the wordList with the same case as the original. If no
match is found in wordList, it returns a string of Config.LINE_CHAR of the same length as word.
*/
public static String translate(String word, ArrayList<String[]> wordList) {
String newWord = "";
int i = 0;
for (i = 0; i < wordList.size(); i++) {
word = matchCase(wordList.get(i)[0], word); //make cases match
if (word.equals(wordList.get(i)[0])) { //check each index at 0
newWord = wordList.get(i)[1]; //update newWord to skip second for loop
return wordList.get(i)[1];
}
}
if (newWord == "") {
for (i = 0; i < word.length(); i++) {
newWord += Config.LINE_CHAR;
}
}
return newWord;
}
For the files I'm running, each word should have a translated word so no Config.LINE_CHAR should be printed. But this is the only thing that prints. How do I fix this.

You are initializing newWord to the value "". The only time newWord can possibly change is in the first loop, where it is promptly followed by a return statement, exiting your method. The only way your if statement can be reached is if you didn't return during the first loop, so if it reaches that if statement, then newWord must be unchanged since its initial assignment of "".
Some unrelated advice: You should use the equals operator when comparing strings. For example, if ("".equals(newWord)). Otherwise, you're comparing the memory address of the two String objects rather than their values.
You may need to share your matchCase method to ensure all bugs are addressed, though.

Java: Removing duplicate words & substrings of words in java

Recently i have come up against a question which i am not able to tackle in school.
I need to remove duplicate words in an input string which consists of words. The main issue here is that the requirement states that i cannot use arrays or regular expressions.
E.g.
userInput = "this is a test testing is fun really fun"
the first "is" is a duplicate of "this" as it is a substring
the second "is" is a duplicate of the first "is"
"testing" is not a duplicate of "test" as it is not an exact match
therefore the output comes out as - "this a test testing fun really"
How would one actually achieve this without using Arrays or Regular Expressions as it is impossible to split the words up by the white spaces and dynamically create a String in java.

I didn't compile this code, but I think it should works.
Let me know if it can help you to solved your problem.
public String solve(String input) {
String ret = "";
int pos = 0;
while(pos<input.length()) {
// find next position of space
int next = input.indexOf(' ',pos);
// space not exists, skip next to end of string
if(next==-1) next = input.length();
// take 1 word from input
String word = input.substring(pos,next);
// check if word exists in previous result
if(ret.indexOf(word)==-1) {
if(ret.length() > 0) ret += " ";
// append word to ret
ret += word;
}
pos = next + 1;
}
return ret;
}

How to explode a string on a hyphen in Java?

I have a task which involves me creating a program that reads text from a text file, and from that produces a word count, and lists the occurrence of each word used in the file. I managed to remove punctuation from the word count but I'm really stumped on this:
I want java to see this string "hello-funny-world" as 3 separate strings and store them in my array list, this is what I have so far , with this section of code I having issues , I just get "hello funny world" seen as one string:
while (reader.hasNext()){
String nextword2 = reader.next();
String nextWord3 = nextword2.replaceAll("[^a-zA-Z0-9'-]", "");
String nextWord = nextWord3.replace("-", " ");
int apcount = 0;
for (int i = 0; i < nextWord.length(); i++){
if (nextWord.charAt(i)== 39){
apcount++;
}
}
int i = nextWord.length() - apcount;
if (wordlist.contains(nextWord)){
int index = wordlist.indexOf(nextWord);
count.set(index, count.get(index) + 1);
}
else{
wordlist.add(nextWord);
count.add(1);
if (i / 2 * 2 == i){
wordlisteven.add(nextWord);
}
else{
wordlistodd.add(nextWord);
}
}

This can work for you ....
List<String> items = Arrays.asList("hello-funny-world".split("-"));

By considering that you are using the separator as '-'
I would suggest you to use simple split() of java
String name="this-is-string";
String arr[]=name.split("-");
System.out.println("Here " +arr.length);
Also you will be able to iterate through this array using for() loop
Hope this helps.

Print words which occurs more than once from a string

I am trying to find and print the words in a string that occurs more than one. And it works almost. I am however fighting with a small problem. The words a printed out twice since they occur twice in the sentence. I want them printed only once:
This is my code:
public class Main {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String sentence = "is this a sentence or is this not ";
String[] myStringArray = sentence.split(" "); //Split the sentence by space.
int[] count = new int[myStringArray.length];
for (int i = 0; i < myStringArray.length; i++){
for (int j = 0; j < myStringArray.length; j++){
if (myStringArray[i].matches(myStringArray[j]))
count[i]++;
//else break;
}
}
for (int i = 0; i < myStringArray.length; i++) {
if (count[i] > 1)
System.out.println("1b. - Tokens that occurs more than once: " + myStringArray[i] + "\n");
}
}
}

You can try for (int i = 0; i < myStringArray.length; i+=2) instead.

break on the first match, after incrementing. then it won't also increment the second match.

Your code has some problems with it.
If you notice, your code will look through the list of n elements n^2 times.
If the occurrence of the word is twice. You will increment each word's count value twice.
What you need to keep track of is the set of words you have already seen, and check if a new word you encounter has already been seen or not.
If you had 3 occurrence of one word in your sentence, you each word would have a count of 3. The 3 is redundant data that doesn't need to be stored for each token, but rather just the word.
All this can be done easily if you know how a Map works.
Here is an implementation that would work.
import java.util.HashMap;
public class Main {
public static void main(String[] args) {
String sentence = "is this a sentence or is this not ";
String[] myStringArray = sentence.split("\\s"); //Split the sentence by space.
Map <String, Integer> wordOccurrences = new HashMap <String, Integer> (myStringArray.length);
for (String word : myStringArray)
if (wordOccurrences.contains(word))
wordOccurrences.put(word, wordOccurrences.get(word) + 1);
else wordOccurrences.put(word, 1);
for (String word : wordOccurrences.keySet())
if (wordOccurrences.get(word) > 1)
System.out.println("1b. - Tokens that occurs more than once: " + word + "\n");
}
}

We want to find the repeating words from an input string. So, I suggest the following approach which is fairly simple:
Make a Hash Map instance. The key (String) will be the word and the value(Integer) will be the frequency of its occurrence.
Split the string using split("\s") method to make an array of only words.
Introduce an Integer type 'frequency' variable with initial value '0'.
Iterate of the string array and after checking frequency, add each element ( or word) to the map (if frequency for that key is 0) or if
the key (word) exists, only increment the frequency by 1.
So you are now left with each word and its frequency.
For example, if input string is "We are getting dirty as this earth is getting polluted. We must stop it."
So, the map will be
{ ("We",2), ("are",1), ("getting",2), ("dirty",1), ("as",1), ("this",1), ("earth",1), ("is",1), ("polluted.",1), ("must",1), ("stop",1), ("it.",1) }
Now you know what is next step and how to use it. I agree with Kaushik.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How many times the word is used on the html page - java

Related

Reading a file -- pairing a String and int value -- with multiple split lines

How to read and return the second index of an array if first index matches a string?

Java: Removing duplicate words & substrings of words in java

How to explode a string on a hyphen in Java?

Print words which occurs more than once from a string

Categories

Resources