using regex to slash out initials in pattern - java

I am trying to slash out pattern as specified using regex , but in replacement also replaces wanted character . specifying boundary does not help in this case .
String name = "Dr.Dre" ;
Pattern p = Pattern.compile("(Mr.|MR.|Dr.|mr.|DR.|dr.|ms.|Ms.|MS.|Miss.|Mrs.|mrs.|miss.|MR|mr|Mr|Dr|DR|dr|ms|Ms|MS|miss|Miss|Mrs|mrs)"+"\\b");
Matcher m = p.matcher(name);
StringBuffer sb = new StringBuffer();
String namef = m.replaceAll("");
System.out.println(namef);
Input : Dr.Dre or Dr Dre or Dr. Dre
> output(expected) : Dre or Dre or Dre
Edit:
Thanks for help , but there is little regex issue I am facing:
Program:
String name = "Dr. Dre" ;
Pattern p = Pattern.compile("(Mr\\.|MR\\.|Dr\\.|mr\\.|DR\\.|dr\\.|ms\\.|Ms\\.|MS\\.|Miss\\.|Mrs\\.|mrs\\.|miss\\.|MR|mr|Mr|Dr|DR|dr|ms|Ms|MS|miss|Miss|Mrs|mrs)"+"\\b");
Matcher m = p.matcher(name);
String namef = m.replaceAll("");
System.out.println(namef);
For above program I receive output as:
. Dre
while the desired output is :
Dre

Dot in a regular expression means "any character". You need to escape it with a backslash, which in turn needs to be escaped in a string literal:
Pattern p = Pattern.compile("Mr\\.|MR\\.|Dr\\.|mr\\.|DR\\.|dr\\.|ms\\."); // etc
Note that you'll end up with a double space after removing "Dr." from "or Dr. Dre" though...
EDIT: For some reason (I haven't worked out why), a space after a dot doesn't count as a word boundary. If you change your pattern to use \\s instead of \\b, so replace a single whitespace character, it works for "Dr. Dre" - but as noted in comments, it then fails for "Dr.Dre". You could either remove the word boundary entirely and add a space to the later parts of the pattern ("DR |Dr |" etc) or use (\\s|\\b) which works for the cases I tried it on, but may well have other undesirable side-effects.

The question is a bit unclear (you aren't providing the problematic results), but my guess is that the problem lies in using the period character. The period has a meaning in regex - it matches ANY character, so "Dr." will actually match *Dr.D*re. You have to escape it like so "Dr." or in your code specifically, to escape the escape slash, like this: "Dr\."
Hope that helps!

Related

can deal with the first line space when i use regex for polynomials

here is my code
String a = "X^5+2X^2+3X^3+4X^4";
String exp[]=a.split("(|\\+\\d)[xX]\\^");
for(int i=0;i<exp.length;i++) {
System.out.println("exp: "+exp[i]+" ");
}
im try to find the output which is 5,2,3,4
but instead i got this answer
exp:
exp:5
exp:2
exp:3
exp:4
i dont know where is the first line space come from, and i cannot find a will to get rid of that, i try to use others regex for this and also use compile,still can get rid of the first line, i try to use new string "X+X^5+2X^2+3X^3+4X^4";the first line shows exp:X.
and i also use online regex compiler to try my problem, but their answer is 5,2,3,4, buy eclipse give a space ,and then 5,2,3,4 ,need a help to figure this out
Try to use regex, e.g:
String input = "X^5+2X^2+3X^3+4X^4";
Pattern pattern = Pattern.compile("\\^([0-9]+)");
Matcher matcher = pattern.matcher(input);
for (int i = 1; matcher.find(); i++) {
System.out.println("exp: " + matcher.group(1));
}
It gives output:
exp: 5
exp: 2
exp: 3
exp: 4
How does it work:
Pattern used: \^([0-9]+)
Which matches any strings starting with ^ followed by 1 or more digits (note the + sign). Dash (^) is prefixed with backslash (\) because it has a special meaning in regular expressions - beginning of a string - but in Your case You just want an exact match of a ^ character.
We want to wrap our matches in a groups to refer to them late during matching process. It means we need to mark them using parenthesis ( and ).
Then we want to pu our pattern into Java String. In String literal, \character has a special meaning - it is used as a control character, eg "\n" represents a new line. It means that if we put our pattern into String literal, we need to escape a \ so our pattern becomes: "\\^([0-9]+)". Note double \.
Next we iterate through all matches getting group 1 which is our number match. Note that a ^.character is not covered in our match even if it is a part of our pattern. It is so because wr used parenthesis to mark our searched group, which in our case are only digits
Because you are using the split method which looks for the occurrence of the regex and, well.. splits the string at this position. Your string starts with X^ so it very much matches your regex.

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.
We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.
The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish
From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");
You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well
The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";
You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

JAVA REGEX fine the correct pattern

I tried to use regex. I have this pattern
STACK
blabla
OVER
blabla
STACK
vlvlv
OVER
and maybe can another line in the end.
I write this patter that seems to work in sites that check regex but dont work in java.
"^(STACK(\n[^\n]+\n)OVER(\n[^\n]+(\n)?)?)+$"
what is the right pattern?
THANKS
Assuming that you want to check if your entire input can be matched with regex you can use something like
String data =
"STACK\r\n" +
"blabla\r\n" +
"OVER\r\n" +
"blabla\r\n" +
"STACK\r\n" +
"vlvlv\r\n" +
"OVER";
String regex ="(^STACK$((\r?\n|\r).+(\r?\n|\r))^OVER$((\r?\n|\r).+(\r?\n|\r)?)?+)+";
Pattern p = Pattern.compile(regex,Pattern.MULTILINE);
Matcher m = p.matcher(data);
System.out.println(m.matches());
I added Pattern.MULTILINE flag to let ^ and $ be start and end of lines, not like it is by default start and end of entire input.
Also to say that START and OVER has to be the only word in line I surrounded it with ^ and $.
Another thing you didn't include in your regex is possibility that line separator can also be \r\n or \r so I changed it to reflect it.
Last thing I did was changing [^\n] to . since they represents almost the same (dot doesn't include \r while [^\n] does.

What's wrong with this regex?

I need to match Twitter-Hashtags within an Android-App, but my code doesn't seem to do what it's supposed to.
What I came up with is:
ArrayList<String> tags = new ArrayList<String>(0);
Pattern p = Pattern.compile("\b#[a-z]+", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(tweet); // tweet contains the tweet as a String
while(m.find()){
tags.add(m.group());
}
The variable tweet contains a regular tweet including hashtags - but find() doesn't trigger. So I guess my regular expression is wrong.
Your regex fails because of the \b word boundary anchor. This anchor only matches between a non-word character and a word-character (alphanumeric character). So putting it directly in front of the # causes the regex to fail unless there is an alphanumeric character before the #! Your regex would match a hashtag in foobarfoo#hashtag blahblahblah but not in foobarfoo #hashtag blahblahblah.
Use #\w+ instead, and remember, inside a string, you need to double the backslashes:
Pattern p = Pattern.compile("#\\w+");
Your pattern should be "#(\\w+)" if you are trying to just match the hash tag. Using this and the tweet "retweet pizza to #pizzahut", doing m.group() would give "#pizzahut" and m.group(1) would give "pizzahut".
Edit: Note, the html display is messing with the backslashes for escape, you'll need to have two for the w in your string literal in Java.

Categories