java replaceAll and '+' match - java

I have some code setup to remove extra spaces in between the words of a title
String formattedString = unformattedString.replaceAll(" +"," ");
My understanding of this type of regex is that it will match as many spaces as possible before stopping. However, my strings that are coming out are not changing in any way. Is it possible that it's only matching one space at a time, and then replacing that with a space? Is there something to the replaceAll method, since it's doing multiple matches, that would alter the way this type of match would work here?

A better approach might be to use "\\s+" to match runs of all possible whitespace characters.
EDIT
Another approach might be to extract all matches for "\\b([A-Za-z0-9]+)\\b" and then join them using a space which would allow you to remove everything except for valid words and numbers.
If you need to preserve punctuation, use "(\\S+)" which will capture all runs of non-whitespace characters.

Are you sure you string is spaces and not tabs? The following is a bit more "aggressive" on whitespace.
String formattedString = unformattedString.replaceAll("\\s+"," ");

all responses should work.
Both:
String formattedString = unformattedString.replaceAll(" +"," ");
or
String formattedString = unformattedString.replaceAll("\\s+"," ");
Maybe your unformattedString is a multiline expression. In that case you can instantiate an Pattern object
String unformattedString = " Hello \n\r\n\r\n\r World";
Pattern manySpacesPattern = Pattern.compile("\\s+",Pattern.MULTILINE);
Matcher formatMatcher = manySpacesPattern.matcher(unformattedString);
String formattedString = formatMatcher.replaceAll(" ");
System.out.println(unformattedString.replaceAll("\\s+", " "));
Or maybe unformattedString have special characters in that case you can play with Pattern flags en compile method.
Examples:
Pattern.compile("\\s+",Pattern.MULTILINE|Pattern.UNIX_LINES);
or
Pattern.compile("\\s+",Pattern.MULTILINE|Pattern.UNICODE_CASE);

Related

regex for tab doesnt work | java

I'm trying to do regex for splitting string when tab is spotted.
I used this :
String line = scan.nextLine(); String Splitted[] = line.split("\t");
but it doesn't work so currently I'm using (which is working for me) :
String line = scan.nextLine(); String Splitted[] = line.split("\\s\\s\\s\\s");
Do you guys have idea why I can't use the "\t" regex?
Yes, \t is a valid Regex, but in Java string literals, a backslash has a special meaning, so to get the Regex symbol \t you'll have to use \\t. But since you are processing user input, you never know what this "tab" really consists of (could be a tab symbol or 4 spaces). So maybe you should just split at (\\t|\\s{2,}) - beware, this is a Java string literal. Hence the double backslash.
EDIT: In my above answer i suspect you don't want to split at single whitespaces too, is that right? In case you do want to split at single whitespaces, you could really just use \\s+ instead.

Java regex pattern replacing space followed by special character

I have below 2 strings
1. Economy / Coach
2. First Class
I want regex pattern in such a way,
Conditions :
If the space is followed by special character(/), I need to remove the space with "". (Example : Economy / Coach to Economy/Coach )
If the space is not followed by any special character, then the string should be as it is. (Example: First Class to First class)
Expected output:
1. Economy/Coach
2. First class
Can anybody please help me to write this regex pattern?
Thanks in advance
The best way is tu use String#replaceAll method, but you have to store it explicitely into the variable because strings are immutable. The replaceAll method will give you a new instance and will not affect the original string.
Example :
String str = "Economy / Coach";
str = str.replaceAll("\\s+/\\s+", "/");
Note that the pattern "\\s+" will capture each sequence of spaces (tabulations, spaces etc..).
Simply replace occurrences of " / " with "/":
String replaced = string.replace(" / ", "/");
Note that this doesn't use regex (directly; it does internally, but that's just an implementation detail).
First Check with "/" is in the String or not.
If it is in that, then remove the WhiteSpaces.
if(testString.contains("/")){
str1 = testString.replaceAll("\\s+","");
}

Sentence split with<sup></sup>

I have the following sentence:
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.";
How do I retrieve all of the words in the sentence, expecting the following?
And
God
said
Let
there
be
light
and
there
was
light
First, get rid of any leading or trailing space:
.trim()
Then get rid of HTML entities (&...;):
.replaceAll("&.*?;", "")
& and ; are literal chars in Regex, and .*? is the non-greedy version of "any character, any number of times".
Next get rid of tags and their contents:
.replaceAll("<(.*?)>.*?</\\1>", "")
< and > will be taken literally again, .*? is explained above, (...) defined a capturing group, and \\1 references that group.
And finally, split on any sequence of non-letters:
.split("[^a-zA-Z]+")
[a-zA-Z] means all characters from a to z and A to Z, ^ inverts the match, and + means "once or more".
So everything together would be:
String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");
Note that this doesn't handle self-closing tags like <img src="a.png" />.
Also note that if you need full HTML parsing, you should think about letting a real engine parse it, as parsing HTML with Regex is a bad idea.
You can use String.replaceAll(regex, replacement) with the regex [^A-Za-z]+ like this to get only characters. Which will also include the sup tag and the c. Which is why you replace the tags and all between them with the first statement.
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.".replaceAll("<sup>[^<]</sup>", "");
String newstr = str.replaceAll("[^A-Za-z]+", " ");

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish
From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");
You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well
The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";
You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

Replace empty space wherever Regex matches in a string

I have been trying to solve this problem. I have a string which has a pattern. Eg.
CW1234 has been despatched to CW334545
i.e the String can have patterns starting with CW followed by any number of intergers (at max 16).
I want to replace all these patters with an empty character. So that the string will look like
has been despatched to
I have tried the following but it replaces only the first digit followed by the CW. I'm pretty new to java. Any insights would be of great help.
if(Pattern.matches(".*[C][W][0-9].*", str1)) {
Matcher m = Pattern.compile(".*[C][W][0-9].*").matcher(str1);
while(m.find()) {
str1 = str1.replaceAll("[C][W][0-9]", "");
}
}
System.out.println(str1);
You need to have {n,m} quantifier on your digits, to enforce maximum digits. Also, for replacement purpose, you don't need to check beforehand whether the pattern is there or not. replaceAll will replace only if there is matching pattern, else will leave the string as it is.
So, remove all those Pattern and Matcher part, and change your regex to:
str1 = str1.replaceAll("CW\\d{0,16}", "");
If you want at least 1 digit, then make it {1,16}. No need to put C and W in different character classes. A character class with single character is as good as that character itself (given that it's not a special character). Also, you can use \\d instead of [0-9].
You're needlessly constructing the pattern and matching the string several times.
str1 = str1.replaceAll("CW\\d+", "");
This is sufficient. All other code is redundant.
You can also opt to do the replace by hand if performance is a problem.
Your replaceAll is missing a +:
str1 = str1.replaceAll("[C][W][0-9]+", "");
The + will make the regex match any number of digits directly following CW.
Your regex is wrong. Try with:
String str1 = CW1234;
str1 = str1.replaceAll("\\bCW\\d{0,16}\\b","");
if the "CW12134" is a single token in a string or with
String str1 = CW1234;
str1 = str1.replaceAll("^CW\\d{0,16}$","");
if the "CW1234" is a full string.
String.replaceAll("CW[0-9\\s]*", "") does what you need, and it also removes the space at the end of the number.
On another note, the whole point of Pattern.compile() is that you need to compile the required expression once in the application, and then use the matcher to find occurences. So I think your usage is inappropriate (rather than incorrect).
Pattern pattern = Pattern.compile("CD[0-9\\s]*");occurs only once in the code and then reuse it as
Matcher matcher = pattern.matcher(stringToMatch);

Categories