Java: Removing duplicate words & substrings of words in java

Java: Removing duplicate words & substrings of words in java - java

Recently i have come up against a question which i am not able to tackle in school.
I need to remove duplicate words in an input string which consists of words. The main issue here is that the requirement states that i cannot use arrays or regular expressions.
E.g.
userInput = "this is a test testing is fun really fun"
the first "is" is a duplicate of "this" as it is a substring
the second "is" is a duplicate of the first "is"
"testing" is not a duplicate of "test" as it is not an exact match
therefore the output comes out as - "this a test testing fun really"
How would one actually achieve this without using Arrays or Regular Expressions as it is impossible to split the words up by the white spaces and dynamically create a String in java.

I didn't compile this code, but I think it should works.
Let me know if it can help you to solved your problem.
public String solve(String input) {
String ret = "";
int pos = 0;
while(pos<input.length()) {
// find next position of space
int next = input.indexOf(' ',pos);
// space not exists, skip next to end of string
if(next==-1) next = input.length();
// take 1 word from input
String word = input.substring(pos,next);
// check if word exists in previous result
if(ret.indexOf(word)==-1) {
if(ret.length() > 0) ret += " ";
// append word to ret
ret += word;
}
pos = next + 1;
}
return ret;
}

Related

How many times the word is used on the html page

I have a method that should return an integer which is the number of uses of the searchWord in the text of an HTML document:
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
if (bodyText.toLowerCase().contains(searchWord.toLowerCase())){
count++;
}
return count;
}
But my method always returns count=1, even if the word is used several times. I understand that the error should be obvious, but I’m stuck and I don’t see it.

You are currently only checking once that the text contains the search word, so the count will always be either 0 or 1. To find the total count, keep looping using String#indexOf(str, fromIndex) while the String can be found using the second argument that indicates the index to start searching from.
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
for(int idx = -1; (idx = bodyText.indexOf(searchWord, idx + 1)) != -1; count++);
return count;
}

According to the Java docs String#contains:
Returns true if and only if this string contains the specified sequence of char values.
You're asking if the word you're looking for is contained in the document, which it is.
You could:
Split the text on words (splitting it by spaces) and then count how many times it appears
Iterate the String using String#indexOf starting on index 0 and then from last index you found until the end of the String.
Iterate the String using contains but starting from a certain index (doing this logic yourself).
I'd go for the 2nd approach as it seems like the easiest one.

These are only conditional statements, you aren't looping through the HTML text, therefor, if it finds the instance of searchWord in bodyText, it'll increment it, and then exit the method with a value of 1. I suggest looping through every word in the html, adding it to an array, and counting it that way using something like this:
char[] bodyTextA = bodyText.toCharArray();
Or keep it in a string array and split it by a space, or new line, or whatever criteria you have. Example of space:
//puts hello, i'm, your, and string into their own array slots in the array
/split
str = "Hello I'm your String";
String[] split = str.split("\\s+");

Your issue here is that the if statement is checking if the text contains the word and the increments your count variable. So even if it contains the word multiple time, your logic goes basically, if it contains it at all, increase count by one. You will have to rewrite your code to check for multiple occurrences of the word. There are many ways you can go about this, you could loop through the entire body text, you could split the body text into an array of words and check that, or you could remove the search word from the text each time you find it and keep checking until it no longer contains the search word.

You can use indexOf(,) with an index for the last found word
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
int index = 0;
while ((index = bodyText.indexOf(searchWord, index + 1)) != -1) {
count++;
}
return count;
}

Split long lines and Indent and output as so

I have a code to remove duplicate words from a string. Lets say i have:
This is serious serious work. I apply the code and get: This is serious work
This is the code:
return Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" "));
Now i want to add new constraints that is if the string/line is longer than 78 characters, break and indent it where it makes sense so the line does not run longer than 78 characters. Example:
This one is a very long line that runs off the right side because it is longer than 78 characters long
It should then be
This one is a very long line that runs off the right side because it is longer
than 78 characters long
I cant find a solution to this. It was brought to my attention that there is a possible duplicate to my question. I cant find my answer there. I need to be able to indent.

You could create a StringBuilder off of the String and then insert a newline and tab at the last word break after 78 characters. You can find the last word break to insert the newline/tab by getting the substring of the first 78 characters, and then finding the index of the last space:
StringBuilder sb = new StringBuilder(Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" ")));
if(sb.length() > 78) {
int lastWordBreak = sb.substring(0, 78).lastIndexOf(" ");
sb.insert(lastWordBreak , "\n\t");
}
return sb.toString();
Output:
This one is a very long line that runs off the right side because it longer
than 78 characters
Also your Stream does not do what you want it to. Yes it removes duplicate words but.. it removes duplicate words. So for the String:
This is a great sentence. It is a great example.
It would remove the duplicate is, great and a, and return
This is a great sentence. It example.
To only remove consecutive duplicate words you can look at the following solution:
Removing consecutive duplicates words out of text using Regex and displaying the new text
Alternatively you could create your own them by splitting the text into words, and comparing the current element to the one ahead of it to remove the consecutive duplicate words

Instead of using
Collectors.joining(" ")
it is possible to write a custom collector that adds new lines and indentation at proper places.
Let's introduce a LineWrapper class, which contains indent and limit fields:
public class LineWrapper {
private final int limit;
private final String indent;
The default constructor sets the fields to reasonable default values.
Note how the indent starts with a new line character.
public LineWrapper() {
limit = 78;
indent = "\n ";
}
A custom constructor allows the client to specify limit and indent:
public LineWrapper(int limit, String indent) {
if (limit <= 0) {
throw new IllegalArgumentException("limit");
}
if (indent == null || !indent.matches("\\n *")) {
throw new IllegalArgumentException("indent");
}
this.limit = limit;
this.indent = indent;
}
Following is a regex used to split the input around one or more spaces. This makes sure that the split will not produce empty Strings:
private static final String SPACES = " +";
The apply method splits the input and collects the words into lines of the specified maximum length, indents the lines and removes duplicate consecutive words. Note how duplicates are not removed using the Stream.distinct method, since it also removes duplicates that are not consecutive.
public String apply(String input) {
return Arrays.stream(input.split(SPACES)).collect(toWrappedString());
}
The toWrappedString method returns a collector that accumulates the words in a new ArrayList, and uses the following methods:
addIfDistinct: to add the words to the ArrayList
combine: to merge two array lists
wrap: to split and indent the lines
.
Collector<String, ArrayList<String>, String> toWrappedString() {
return Collector.of(ArrayList::new,
this::addIfDistinct,
this::combine,
this::wrap);
}
The addIfDistinct adds the word to the accumulator ArrayList if it is different than the previous word.
void addIfDistinct(ArrayList<String> accumulator, String word) {
if (!accumulator.isEmpty()) {
String lastWord = accumulator.get(accumulator.size() - 1);
if (!lastWord.equals(word)) {
accumulator.add(word);
}
} else {
accumulator.add(word);
}
}
The combine method adds all words from the second ArrayList to the first one. It also makes sure that the first word of the second ArrayList does not duplicate the last word of the first ArrayList.
ArrayList<String> combine(ArrayList<String> words,
ArrayList<String> moreWords) {
List<String> other = moreWords;
if (!words.isEmpty() && !other.isEmpty()) {
String lastWord = words.get(words.size() - 1);
if (lastWord.equals(other.get(0))) {
other = other.subList(1, other.size());
}
}
words.addAll(other);
return words;
}
Finally the wrap method appends all words to a StringBuffer, inserting the indent when the line length limit is reached:
String wrap(ArrayList<String> words) {
StringBuilder result = new StringBuilder();
if (!words.isEmpty()) {
String firstWord = words.get(0);
result.append(firstWord);
int lineLength = firstWord.length();
for (String word : words.subList(1, words.size())) {
//add 1 to the word length,
//to account for the space character
int len = word.length() + 1;
if (lineLength + len <= limit) {
result.append(' ');
result.append(word);
lineLength += len;
} else {
result.append(indent);
result.append(word);
//subtract 1 from the indent length,
//because the new line does not count
lineLength = indent.length() - 1 + word.length();
}
}
}
return result.toString();
}

How to find duplicates inside a string?

I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}

No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.

Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.

for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Ignoring upper/lowercase strings

My goal is to change any form of the word "java" in a sentence to "JAVA".I've got everything done but my code won't read in mixed cases for example:JaVa, JAva,etc. I know I am suppose to use toUpperCase and toLowerCase or equalsIgnoreCase but I am not sure how to use it properly. I am not allowed to use replace or replace all, teacher wants substring method.
Scanner input=new Scanner(System.in);
System.out.println("Enter a sentence with words including java");
String sentence=input.nextLine();
String find="java";
String replace="JAVA";
String result="";
int n;
do{
n=sentence.indexOf(find);
if(n!=-1){
result =sentence.substring(0,n);
result=result +replace;
result = result + sentence.substring(n+find.length());
sentence=result;
}
}while(n!=-1);
System.out.println(sentence);
}
}

You can't do that using String.indexOf because it is case sensitive.
The simple solution is to use a regex with a case insensitive pattern; e.g.
Pattern.compile(regex, Pattern.CASE_INSENSITIVE).matcher(str).replaceAll(repl);
That also has the benefit of avoiding the messy string-bashing you are currently using to do the replacement.
In your example, your input string is also valid as a regex ... because it doesn't include any regex meta-characters. If it did, then the simple workaround is to use Pattern.quote(str) which will treat the meta-characters as literal matches.
It is also worth nothing that String.replaceAll(...) is a "convenience method" for doing a regex replace on a string, though you can't use it for your example because it does case sensitive matching.
For the record, here is a partial solution that does the job by string-bashing. #ben - this is presented for you to read and understand ... not to copy. It is deliberately uncommented to encourage you to read it carefully.
// WARNING ... UNTESTED CODE
String input = ...
String target = ...
String replacement = ...
String inputLc = input.lowerCase();
String targetLc = target.lowerCase();
int pos = 0;
int pos2;
while ((pos2 = inputLc.indexOf(targetLc, pos)) != -1) {
if (pos2 - pos > 0) {
result += input.substring(pos, pos2);
}
result += replacement;
pos = pos2 + target.length();
}
if (pos < input.length()) {
result += input.substring(pos);
}
It probably be more efficient to use a StringBuilder instead of a String for result.

you are allowed to use toUpperCase() ? try this one
Scanner input=new Scanner(System.in);
System.out.println("Enter a sentence with words including java");
String sentence=input.nextLine();
String find="java";
String replace="JAVA";
String result="";
result = sentence.toLowerCase();
result = result.replace(find,replace);
System.out.println(result);
}
reply with the result :))
Update : Based on
I've got everything done but my code won't read in mixed cases for
example:JaVa, JAva,etc.
you can use your code
Scanner input=new Scanner(System.in);
System.out.println("Enter a sentence with words including java");
String sentence=input.nextLine();
String find="java";
String replace="JAVA";
String result="";
int n;
do{
//for you to ignore(converts the sentence to lowercase) either lower or upper case in your sentence then do the nxt process
sentence = sentence.toLowerCase();
n=sentence.indexOf(find);
if(n!=-1){
result =sentence.substring(0,n);
result=result +replace;
result = result + sentence.substring(n+find.length());
sentence=result;
}
}while(n!=-1);
System.out.println(sentence);
}
Update 2 : I put toLowerCase Convertion outside the loop.
public static void main(String[] args){
String sentence = "Hello my name is JAva im a jaVa Man with a jAvA java Ice cream";
String find="java";
String replace="JAVA";
String result="";
int n;
//for you to ignore(converts the sentence to lowercase) either lower or upper case in your sentence then do the nxt process
sentence = sentence.toLowerCase();
System.out.println(sentence);
do{
n=sentence.indexOf(find);
if(n!=-1){
result =sentence.substring(0,n);
result=result +replace;
result = result + sentence.substring(n+find.length());
sentence=result;
}
}while(n!=-1);
System.out.println(sentence);
}
RESULT
hello my name is java im a java man with a java java ice cream
hello my name is JAVA im a JAVA man with a JAVA JAVA ice cream

A quick solution would be to remove your do/while loop entirely and just use a case-insensitive regex with String.replaceAll(), like:
sentence = sentence.replaceAll("(?i)java", "JAVA");
System.out.println(sentence);
Or, more general and according to your variable namings:
sentence = sentence.replaceAll("(?i)" + find, replace);
System.out.println(sentence);
Sample Program
EDIT:
Based on your comments, if you need to use the substring method, here is one way.
First, since String.indexOf does case-sensitive comparisons, you can write your own case-insensitive method, let's call it indexOfIgnoreCase(). This method would look something like:
// Find the index of the first occurrence of the String find within the String str, starting from start index
// Return -1 if no match is found
int indexOfIgnoreCase(String str, String find, int start) {
for(int i = start; i < str.length(); i++) {
if(str.substring(i, i + find.length()).equalsIgnoreCase(find)) {
return i;
}
}
return -1;
}
Then, you can use this method in the following manner.
You find the index of the word you need, then you add the portion of the String before this word (up to the found index) to the result, then you add the replaced version of the word you found, then you add the rest of the String after the found word.
Finally, you update the starting search index by the length of the found word.
String find = "java";
String replace = "JAVA";
int index = 0;
while(index + find.length() <= sentence.length()) {
index = indexOfIgnoreCase(sentence, find, index); // use the custom indexOf method here
if(index == -1) {
break;
}
sentence = sentence.substring(0, index) + // copy the string up to the found word
replace + // replace the found word
sentence.substring(index + find.length()); // copy the remaining part of the string
index += find.length();
}
System.out.println(sentence);
Sample Program
You could use a StringBuilder to make this more efficient since the + operator creates a new String on each concatenation. Read more here
Furthermore, you could combine the logic in the indexOfIgnoreCase and the rest of the code in a single method like:
String find = "java";
String replace = "JAVA";
StringBuilder sb = new StringBuilder();
int i = 0;
while(i + find.length() <= sentence.length()) {
// if found a match, add the replacement and update the index accordingly
if(sentence.substring(i, i + find.length()).equalsIgnoreCase(find)) {
sb.append(replace);
i += find.length();
}
// otherwise add the current character and update the index accordingly
else {
sb.append(sentence.charAt(i));
i++;
}
}
sb.append(sentence.substring(i)); // append the rest of the string
sentence = sb.toString();
System.out.println(sentence);

Regular expression for validating an answer to a question

Hey everyone,
I'm having a minor difficulty setting up a regular expression that evaluates a sentence entered by a user in a textbox to keyword(s). Essentially, the keywords have to be entered consecutive from one to the other and can have any number of characters or spaces before, between, and after (ie. if the keywords are "crow" and "feet", crow must be somewhere in the sentence before feet. So with that in mind, this statement should be valid "blah blah sccui crow dsj feet "). The characters and to some extent, the spaces (i would like the keywords to have at least one space buffer in the beginning and end) are completely optional, the main concern is whether the keywords were entered in their proper order.
So far, I was able to have my regular expression work in a sentence but failed to work if the answer itself was entered only.
I have the regular expression used in the function below:
// Comparing an answer with the right solution
protected boolean checkAnswer(String a, String s) {
boolean result = false;
//Used to determine if the solution is more than one word
String temp[] = s.split(" ");
//If only one word or letter
if(temp.length == 1)
{
if (s.length() == 1) {
// check multiple choice questions
if (a.equalsIgnoreCase(s)) result = true;
else result = false;
}
else {
// check short answer questions
if ((a.toLowerCase()).matches(".*?\\s*?" + s.toLowerCase() + "\\s*?.*?")) result = true;
else result = false;
}
}
else
{
int count = temp.length;
//Regular expression used to
String regex=".*?\\s*?";
for(int i = 0; i<count;i++)
regex+=temp[i].toLowerCase()+"\\s*?.*?";
//regex+=".*?";
System.out.println(regex);
if ((a.toLowerCase()).matches(regex)) result = true;
else result = false;
}
return result;
Any help would greatly be appreciated.
Thanks.

I would go about this in a different way. Instead of trying to use one regular expression, why not use something similar to:
String answer = ... // get the user's answer
if( answer.indexOf("crow") < answer.indexOf("feet") ) {
// "correct" answer
}
You'll still need to tokenize the words in the correct answer, then check in a loop to see if the index of each word is less than the index of the following word.

I don't think you need to split the result on " ".
If I understand correctly, you should be able to do something like
String regex="^.*crow.*\\s+.*feet.*"
The problem with the above is that it will match "feetcrow feetcrow".
Maybe something like
String regex="^.*\\s+crow.*\\s+feet\\s+.*"
That will enforce that the word is there as opposed to just in a random block of characters.

Depending on the complexity Bill's answer might be the fastest solution. If you'd prefer a regular expression, I wouldn't look for any spaces, but word boundaries instead. That way you won't have to handle commas, dots, etc. as well:
String regex = "\\bcrow(?:\\b.*\\b)?feet\\b"
This should match "crow bla feet" as well as "crowfeet" and "crow, feet".
Having to match multiple words in a specific order you could just join them together using '(?:\b.*\b)?' without requiring any additional sorting or checking.

Following Bill answer, I'd try this:
String input = // get user input
String[] tokens = input.split(" ");
String key1 = "crow";
String key2 = "feet";
String[] tokens = input.split(" ");
List<String> list = Arrays.asList(tokens);
return list.indexOf(key1) < list.indexOf(key2)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Removing duplicate words & substrings of words in java - java

Related

How many times the word is used on the html page

Split long lines and Indent and output as so

How to find duplicates inside a string?

Ignoring upper/lowercase strings

Regular expression for validating an answer to a question

Categories

Resources