Remove all non-word char except if & or ' pattern

Remove all non-word char except if & or ' pattern - java

I am trying to clean a string of all non-word character except when it is & i.e. pattern might be like &[\w]+;
For example:
abc; => abc
abc & => abc &
abc& => abc
if i use string.replaceAll("\W","") it removes ; and '&' too from second example which I don't want.
Can using negative look-ahead in this problem could give a quick solution regex pattern?

First of all, I really like the question. Now, what you want could not be done with a single replaceAll, because for that, we would need a negative look-behind with variable length, which is not allowed. If it was allowed, then it would not have been that difficult.
Anyways, since single replaceAll is no option here, you can use a little hack here. Like first replacing the last semi-colon of you entity reference, with some character sequence, which you are sure won't be there in the rest of the string, like XXX or anything. I know this is not correct, but you sure can't help it out.
So, here's what you can try:
String str = "a;b&c &";
str = str.replaceAll("(&\\w+);", "$1XXX")
.replaceAll("&(?!\\w+?XXX)|[^\\w&]", "")
.replaceAll("(&\\w+)XXX", "$1;");
System.out.println(str);
Explanation:
The first replaceAll, replaces the pattern like & with &ampXXX, or any other sequence replaced for last ;.
The second replaceAll, replaces any & not followed by \\w+XXX, or any non-word, non & character. This will replace all the &'s which are not a part of & kind of pattern. Plus, also replaces any other non-word character.
The third replaceAll, re-replaces XXX with ;, to create back & from &ampXXX
And to make it easier to understand, you can rather use Pattern and Matcher classes and I would always prefer to use them whenever the replacement criteria is complex.
String str = "a;b&c &";
Pattern pattern = Pattern.compile("&\\w+;|[^\\w]");
Matcher matcher = pattern.matcher(str);
StringBuilder sb = new StringBuilder();
while (matcher.find()) {
String match = matcher.group();
if (!match.matches("&\\w+;")) {
matcher.appendReplacement(sb, "");
} else {
matcher.appendReplacement(sb, match);
}
}
matcher.appendTail(sb);
System.out.println(sb.toString());
This one is similar to #Eric's code, but is a generalization over it. That one will only work for & of course if it was improved to remove NullPointerException that is thrown in it.

I'm not sure you can do this using a simple String.replaceAll. You should probably use a Pattern and Matcher to loop through the matches, effectively doing a manual search and replace. Something like the following code should do the trick.
public String replaceString(String origString) {
Pattern pattern = Pattern.compile("&(\w+);|[^\w]");
Matcher matcher = pattern.matcher(origString);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
if (matcher.group().startsWith("&") && !matcher.group(1).equals("amp")) {
matcher.appendReplacement(sb, matcher.group());
} else {
matcher.appendReplacement(sb, "");
}
}
matcher.appendTail(sb);
return sb.toString();
}

I would suggest you use a negative lookahead like this:
string.replace(/&(?!\w+;)/ig, '');
Which replaces all & not followed by a word characters ending with a semicolon.
EDIT (Java):
string.replaceAll("/&(?!\w+;)/i", '');

Related

Java Pattern matcher not matching for HTTP response code [duplicate]

I have this small piece of code
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]"))
{
System.out.println(s);
}
}
Supposed to print
dkoe
but it prints nothing!!

Welcome to Java's misnamed .matches() method... It tries and matches ALL the input. Unfortunately, other languages have followed suit :(
If you want to see if the regex matches an input text, use a Pattern, a Matcher and the .find() method of the matcher:
Pattern p = Pattern.compile("[a-z]");
Matcher m = p.matcher(inputstring);
if (m.find())
// match
If what you want is indeed to see if an input only has lowercase letters, you can use .matches(), but you need to match one or more characters: append a + to your character class, as in [a-z]+. Or use ^[a-z]+$ and .find().

[a-z] matches a single char between a and z. So, if your string was just "d", for example, then it would have matched and been printed out.
You need to change your regex to [a-z]+ to match one or more chars.

String.matches returns whether the whole string matches the regex, not just any substring.

java's implementation of regexes try to match the whole string
that's different from perl regexes, which try to find a matching part
if you want to find a string with nothing but lower case characters, use the pattern [a-z]+
if you want to find a string containing at least one lower case character, use the pattern .*[a-z].*

Used
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]+"))
{
System.out.println(s);
}
}

I have faced the same problem once:
Pattern ptr = Pattern.compile("^[a-zA-Z][\\']?[a-zA-Z\\s]+$");
The above failed!
Pattern ptr = Pattern.compile("(^[a-zA-Z][\\']?[a-zA-Z\\s]+$)");
The above worked with pattern within ( and ).

Your regular expression [a-z] doesn't match dkoe since it only matches Strings of lenght 1. Use something like [a-z]+.

you must put at least a capture () in the pattern to match, and correct pattern like this:
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("(^[a-z]+$)"))
{
System.out.println(s);
}
}

You can make your pattern case insensitive by doing:
Pattern p = Pattern.compile("[a-z]+", Pattern.CASE_INSENSITIVE);

Pattern in java regEx does not match [duplicate]

I have this small piece of code
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]"))
{
System.out.println(s);
}
}
Supposed to print
dkoe
but it prints nothing!!

Welcome to Java's misnamed .matches() method... It tries and matches ALL the input. Unfortunately, other languages have followed suit :(
If you want to see if the regex matches an input text, use a Pattern, a Matcher and the .find() method of the matcher:
Pattern p = Pattern.compile("[a-z]");
Matcher m = p.matcher(inputstring);
if (m.find())
// match
If what you want is indeed to see if an input only has lowercase letters, you can use .matches(), but you need to match one or more characters: append a + to your character class, as in [a-z]+. Or use ^[a-z]+$ and .find().

[a-z] matches a single char between a and z. So, if your string was just "d", for example, then it would have matched and been printed out.
You need to change your regex to [a-z]+ to match one or more chars.

String.matches returns whether the whole string matches the regex, not just any substring.

java's implementation of regexes try to match the whole string
that's different from perl regexes, which try to find a matching part
if you want to find a string with nothing but lower case characters, use the pattern [a-z]+
if you want to find a string containing at least one lower case character, use the pattern .*[a-z].*

Used
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]+"))
{
System.out.println(s);
}
}

I have faced the same problem once:
Pattern ptr = Pattern.compile("^[a-zA-Z][\\']?[a-zA-Z\\s]+$");
The above failed!
Pattern ptr = Pattern.compile("(^[a-zA-Z][\\']?[a-zA-Z\\s]+$)");
The above worked with pattern within ( and ).

Your regular expression [a-z] doesn't match dkoe since it only matches Strings of lenght 1. Use something like [a-z]+.

you must put at least a capture () in the pattern to match, and correct pattern like this:
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("(^[a-z]+$)"))
{
System.out.println(s);
}
}

You can make your pattern case insensitive by doing:
Pattern p = Pattern.compile("[a-z]+", Pattern.CASE_INSENSITIVE);

Check only string and only digits with regex in Java [duplicate]

I have this small piece of code
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]"))
{
System.out.println(s);
}
}
Supposed to print
dkoe
but it prints nothing!!

Welcome to Java's misnamed .matches() method... It tries and matches ALL the input. Unfortunately, other languages have followed suit :(
If you want to see if the regex matches an input text, use a Pattern, a Matcher and the .find() method of the matcher:
Pattern p = Pattern.compile("[a-z]");
Matcher m = p.matcher(inputstring);
if (m.find())
// match
If what you want is indeed to see if an input only has lowercase letters, you can use .matches(), but you need to match one or more characters: append a + to your character class, as in [a-z]+. Or use ^[a-z]+$ and .find().

[a-z] matches a single char between a and z. So, if your string was just "d", for example, then it would have matched and been printed out.
You need to change your regex to [a-z]+ to match one or more chars.

String.matches returns whether the whole string matches the regex, not just any substring.

java's implementation of regexes try to match the whole string
that's different from perl regexes, which try to find a matching part
if you want to find a string with nothing but lower case characters, use the pattern [a-z]+
if you want to find a string containing at least one lower case character, use the pattern .*[a-z].*

Used
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("[a-z]+"))
{
System.out.println(s);
}
}

I have faced the same problem once:
Pattern ptr = Pattern.compile("^[a-zA-Z][\\']?[a-zA-Z\\s]+$");
The above failed!
Pattern ptr = Pattern.compile("(^[a-zA-Z][\\']?[a-zA-Z\\s]+$)");
The above worked with pattern within ( and ).

Your regular expression [a-z] doesn't match dkoe since it only matches Strings of lenght 1. Use something like [a-z]+.

you must put at least a capture () in the pattern to match, and correct pattern like this:
String[] words = {"{apf","hum_","dkoe","12f"};
for(String s:words)
{
if(s.matches("(^[a-z]+$)"))
{
System.out.println(s);
}
}

You can make your pattern case insensitive by doing:
Pattern p = Pattern.compile("[a-z]+", Pattern.CASE_INSENSITIVE);

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.

Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times

in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

How to find and replace a substring?

For example I have such a string, in which I must find and replace multiple substrings, all of which start with #, contains 6 symbols, end with ' and should not contain ) ... what do you think would be the best way of achieving that?
Thanks!
Edit:
just one more thing I forgot, to make the replacement, I need that substring, i.e. it gets replaces by a string generated from the substring being replaced.

yourNewText=yourOldText.replaceAll("#[^)]{6}'", "");
Or programmatically:
Matcher matcher = Pattern.compile("#[^)]{6}'").matcher(yourOldText);
StringBuffer sb = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(sb,
// implement your custom logic here, matcher.group() is the found String
someReplacement(matcher.group());
}
matcher.appendTail(sb);
String yourNewString = sb. toString();

Assuming you just know the substrings are formatted like you explained above, but not exactly which 6 characters, try the following:
String result = input.replaceAll("#[^\\)]{6}'", "replacement"); //pattern to replace is #+6 characters not being ) + '

You must use replaceAll with the right regular expression:
myString.replaceAll("#[^)]{6}'", "something")
If you need to replace with an extract of the matched string, use a a match group, like this :
myString.replaceAll("#([^)]{6})'", "blah $1 blah")
the $1 in the second String matches the first parenthesed expression in the first String.

this might not be the best way to do it but...
youstring = youstring.replace("#something'", "new stringx");
youstring = youstring.replace("#something2'", "new stringy");
youstring = youstring.replace("#something3'", "new stringz");
//edited after reading comments, thanks

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove all non-word char except if & or ' pattern - java

I would suggest you use a negative lookahead like this: string.replace(/&(?!\w+;)/ig, ''); Which replaces all & not followed by a word characters ending with a semicolon. EDIT (Java): string.replaceAll("/&(?!\w+;)/i", '');

Related

Java Pattern matcher not matching for HTTP response code [duplicate]

Pattern in java regEx does not match [duplicate]

Check only string and only digits with regex in Java [duplicate]

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

How to find and replace a substring?

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove all non-word char except if & or &apos; pattern - java

I would suggest you use a negative lookahead like this: string.replace(/&(?!\w+;)/ig, ''); Which replaces all & not followed by a word characters ending with a semicolon. EDIT (Java): string.replaceAll("/&(?!\w+;)/i", '');

Related

Java Pattern matcher not matching for HTTP response code [duplicate]

Pattern in java regEx does not match [duplicate]

Check only string and only digits with regex in Java [duplicate]

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

How to find and replace a substring?

Categories

Resources

Remove all non-word char except if & or ' pattern - java