Regex in java for finding duplicate consecutive words

Regex in java for finding duplicate consecutive words - java

I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This and is are the same and deletes the is.
Regex
"\\b(\\w+)\\b\\s+\\1"
Any idea why this is happening?
Here is the code that I am using for duplicate removal
public static String RemoveDuplicateWords(String input)
{
String originalText = input;
String output = "";
Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
//Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
if (!m.find())
output = "No duplicates found, no changes made to data";
else
{
while (m.find())
{
if (output == "")
output = input.replaceFirst(m.group(), m.group(1));
else
output = output.replaceAll(m.group(), m.group(1));
}
input = output;
m = p.matcher(input);
while (m.find())
{
output = "";
if (output == "")
output = input.replaceAll(m.group(), m.group(1));
else
output = output.replaceAll(m.group(), m.group(1));
}
}
return output;
}

Try this one:
String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);
The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:
"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"
\b match a word boundary
[a-z]+ match a word with one or more characters;
the parentheses capture the word as a group
\b match a word boundary
(?: indicates a non-capturing group (which starts here)
\s+ match one or more white space characters
\1 is a back reference to the first (captured) group;
so the word is repeated here
\b match a word boundary
)+ indicates the end of the non-capturing group and
allows it to occur one or more times

you should have used \b(\w+)\b\s+\b\1\b, click here to see the result...
Hope this is what you want...
Update 1
Well well well, the output that you have is
the final string after removing duplicates
import java.util.regex.*;
public class MyDup {
public static void main (String args[]) {
String input="This This is text text another another";
String originalText = input;
String output = "";
Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
System.out.println(m);
if (!m.find())
output = "No duplicates found, no changes made to data";
else
{
while (m.find())
{
if (output == "") {
output = input.replaceFirst(m.group(), m.group(1));
} else {
output = output.replaceAll(m.group(), m.group(1));
}
}
input = output;
m = p.matcher(input);
while (m.find())
{
output = "";
if (output == "") {
output = input.replaceAll(m.group(), m.group(1));
} else {
output = output.replaceAll(m.group(), m.group(1));
}
}
}
System.out.println("After removing duplicate the final string is " + output);
}
Run this code and see what you get as output... Your queries will be solved...
Note
In output you are replacing duplicate by single word... Isn't it??
When I put System.out.println(m.group() + " : " + m.group(1)); in first if condition I get output as text text : text i.e. duplicates are replacing by single word.
else
{
while (m.find())
{
if (output == "") {
System.out.println(m.group() + " : " + m.group(1));
output = input.replaceFirst(m.group(), m.group(1));
} else {
Hope you got now what is going on... :)
Good Luck!!! Cheers!!!

The below pattern will match duplicate words even with any number of occurrences.
Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
For e-g, "This is is my my my pal pal pal pal pal pal pal pal"
will output "This is my pal"
Also, Only one iteration with "while (m.find())" is enough with this pattern.

\b(\w+)(\b\W+\1\b)*
Explanation:
\b : Any word boundary <br/>(\w+) : Select any word character (letter, number, underscore)
Once all the words are selected, now it's time to select the common words.
( : Grouping starts<br/>
\b : Any word boundary<br/>
\W+ : Any non-word character<br/>
\1 : Select repeated words<br/>
\b : Un select if it repeated word is joined with another word<br/>
) : Grouping ends
Reference : Example

I believe this is the regular expression you should be using to detect 2 consecutive words separated by any number of non-word characters:
Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);

if unicodes are important than you should use this:
Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*",
Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)

Also try with this Regex that find only repeat words
(?i)\\b(\\w+)(\\b\\W+\\b\\1\\b){1,}

Related

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.

There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}

you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Java Regex expression not working

I have a problem with not working REGEX. I dont know what I am doing wrong. My code:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("\\btimetable:(.*);");
//also tried "timetable:(.*);" and "(\\btimetable:)(.*)(;)"
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("S:" + m.start() + ", E:" + m.end());
System.out.println("x: "+ test.substring(m.start(), m.end()));
}
Expected result:
(1) "timetable:xxxxxtimetable:"
(2) "timetable: fullihhghtO"
I thanks for any help.

A non-capturing group could be handy in our case:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("(?:\\btimetable:(.*?);)+"); // <-- here
Matcher m = p.matcher(test);
int i = 1;
while (m.find()) {
System.out.println(i + ") "+ m.group(1));
i++;
}
OUTPUT
1) xxxxxtimetable:
2) fullihhghtO
Regex explained:
(?:\\btimetable:(.*?);)+ by using the non-capturing (?:\\btimetable:...) we'll consume the "timetable:" without capturing it, then the second matching group (.*?) captures what we want to capture (everything between \btimetable: and ;). Pay special attention to the non-greedy term: .*? which means that we'll consume the minimum possible amount of characters until the ;. If we won't use this lazy form, the regex will use "greedy" default mode and will consume all the characters until the last ; in the string!
Now, all that is relevant if you wanted to catch only the unique part, but if you wanted to catch the whole thing:
1) timetable:xxxxxtimetable:;
2) timetable: fullihhghtO;
It can be done easily by modifying the line with the regex to:
Pattern p = Pattern.compile("\\b(timetable:.*?;)+");
which is even simpler: only one capturing group (see that we still have to use the non-greedy mode!).

You don't need to use regex, a simple split would do it :
public static void main(String[] args) throws IOException {
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
String[] array = test.split(";");
String str1 = array[0].trim();
String str2 = array[1].trim();
System.out.println(str1 + "\n" + str2); //timetable:xxxxxtimetable:
//timetable: fullihhghtO
}

Regular expression to remove everything but words. java

This code doesn't seem doing the right job. It removes the spaces between the words!
input = scan.nextLine().replaceAll("[^A-Za-z0-9]", "");
I want to remove all extra spaces and all numbers or abbreviations from a string, except words and this character: '.
For Example:
input: 34 4fF$##D one 233 r # o'clock 329riewio23
returns: one o'clock

public static String filter(String input) {
return input.replaceAll("[^A-Za-z0-9' ]", "").replaceAll(" +", " ");
}
The first replace replaces all characters except alphabetic characters, the single-quote, and spaces. The second replace replaces all instances of one or more spaces, with a single space.

Your solution doesn't work because you don't replace numbers and you also replace the ' character.
Check out this solution:
Pattern pattern = Pattern.compile("[^| ][A-Za-z']{2,} ");
String input = scan.nextLine();
Matcher matcher = pattern.matcher(input);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append(matcher.group());
}
System.out.println(result.toString());
It looks for the beginning of the string or a space ([^| ]) and then takes all the following characters ([A-Za-z']). However, it only takes the word if there are 2 or more charactes ({2,}) and there has to be a trailing space.

If you want to just extract that time information use this regex group match:
input = scan.nextLine();
Pattern p = Pattern.compile("([a-zA-Z]{3,})\\s.*?(o'clock)");
Matcher m = p.matcher(input);
if (m.find()) {
input = m.group(1) + " " + m.group(2);
}
The regex is quite naive though, and will only work if the input is always of a similar format.

Extract every complete word that contains a certain substring

I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.
I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?
private static String foo()
{
String searchTerm = "Pizza";
String text = "Cheese Pizza";
String sPattern = "(?i)\b("+searchTerm+"(.+?)?)\b";
Pattern pattern = Pattern.compile ( sPattern );
Matcher matcher = pattern.matcher ( text );
if(matcher.find ())
{
String result = "-";
for(int i=0;i < matcher.groupCount ();i++)
{
result+= matcher.group ( i ) + " ";
}
return result.trim ();
}else
{
System.out.println("No Luck");
}
}

In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.
Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.
In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)
In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).
So your code can look like
private static String foo() {
String searchTerm = "Pizza";
String text = "Cheese Pizza, Other Pizzas";
String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");
Pattern pattern = Pattern.compile(sPattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.append(matcher.group()).append(' ');
}
return result.toString().trim();
}

While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.
public List<String> doIt(final String inputString, final String term) {
final List<String> output = new ArrayList<String>();
final String[] parts = input.split("\\s+");
for(final String part : parts) {
if(part.indexOf(term) > 0) {
output.add(part);
}
}
return output;
}
Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.
If one pass is necessary though, the regex path is better.

I find nicholas.hauschild's answer to be the best.
However if you really wanted to use regex, you could do it as such:
String searchTerm = "Pizza";
String text = "Cheese Pizza";
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchTerm)
+ "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Pizza

The pattern should have been
String sPattern = "(?i)\\b("+searchTerm+"(?:.+?)?)\\b";
You want to capture the whole (pizza)string.?: ensures you don't capture a part of the string twice.

Try this pattern:
String searchTerm = "Po";
String text = "Porky Pork Chop oPod zzz llPo";
Pattern p = Pattern.compile("\\p{Alpha}+" + substring + "|\\p{Alpha}+" + substring + "\\p{Alpha}+|" + substring + "\\p{Alpha}+");
Matcher m = p.matcher(myString);
while(m.find()) {
System.out.println(">> " + m.group());
}

Ok, I give you a pattern in raw style (not java style, you must double escape yourself):
(?i)\b[a-z]*po[a-z]*\b
And that's all.

Java and regular expression, substring

I'm am tottaly lost when coming to regular expressions.
I get generated strings like:
Your number is (123,456,789)
How can I filter out 123,456,789?

You can use this regex for extracting the number including the commas
\(([\d,]*)\)
The first captured group will have your match. Code will look like this
String subjectString = "Your number is (123,456,789)";
Pattern regex = Pattern.compile("\\(([\\d,]*)\\)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
String resultString = regexMatcher.group(1);
System.out.println(resultString);
}
Explanation of the regex
"\\(" + // Match the character “(” literally
"(" + // Match the regular expression below and capture its match into backreference number 1
"[\\d,]" + // Match a single character present in the list below
// A single digit 0..9
// The character “,”
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
")" +
"\\)" // Match the character “)” literally
This will get you started http://www.regular-expressions.info/reference.html

String str="Your number is (123,456,789)";
str = str.replaceAll(".*\\((.*)\\).*","$1");
or you can make the replacement a bit faster by doing:
str = str.replaceAll(".*\\(([\\d,]*)\\).*","$1");

try
"\\(([^)]+)\\)"
or
int start = text.indexOf('(')+1;
int end = text.indexOf(')', start);
String num = text.substring(start, end);

private void showHowToUseRegex()
{
final Pattern MY_PATTERN = Pattern.compile("Your number is \\((\\d+),(\\d+),(\\d+)\\)");
final Matcher m = MY_PATTERN.matcher("Your number is (123,456,789)");
if (m.matches()) {
Log.d("xxx", "0:" + m.group(0));
Log.d("xxx", "1:" + m.group(1));
Log.d("xxx", "2:" + m.group(2));
Log.d("xxx", "3:" + m.group(3));
}
}
You'll see the first group is the whole string, and the next 3 groups are your numbers.

String str = "Your number is (123,456,789)";
str = new String(str.substring(16,str.length()-1));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex in java for finding duplicate consecutive words - java

I believe this is the regular expression you should be using to detect 2 consecutive words separated by any number of non-word characters: Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);

if unicodes are important than you should use this: Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)

Also try with this Regex that find only repeat words (?i)\\b(\\w+)(\\b\\W+\\b\\1\\b){1,}

Related

Search substring in a string using regex

Java Regex expression not working

Regular expression to remove everything but words. java

Extract every complete word that contains a certain substring

Java and regular expression, substring

Categories

Resources