replace substrings in string using regex groups - java

I can't find the correct way to remove substrings case insensitive equals to "null" and replacing them with an empty string against a huge input data string, which contains many lines and uses ; as a separator.
To simplify here is an example of what I am looking for:
Input string
Steve;nuLL;2;null\n
null;nullo;nUll;Marc\n
....
Expected Output
Steve;;2;\n
;nullo;;Marc\n
...
Code
Matcher matcher = Pattern.compile("(?i)(^|;)(null)(;|$)").matcher(dataStr);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group(1) + "" + matcher.group(3));
}
return sb.toString();
Can this be solved by using regex?
EDIT:
From the java code above I only get the first match ever being replaced, but not every appearance in the line and in the data stream. For whatever reason the matcher.find() is only executed once.

return dataStr.replaceAll("(?smi)\\bnull\\b", "");
\b is the word boundary.
(?i) is a command with i=ignore case.
((?s) is DOT_ALL, . matching newline characters too.)
(?m) is MULTI_LINE.
You forgot appendTail, for all after the last replacement.
If the string contains more than one line, add the MULTI_LINE option for reinterpretation of ^ and $. See the javadoc of Pattern.
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group(1) + "" + matcher.group(3));
}
matcher.appendTail(sb);
Alternatively with lambda:
String result = matcher.replaceAll(mr -> mr.group(1) + mr.group(3));
where mr is a freely named MatchResult provided by replaceAll.

You probably what to replace null as long as it is followed by some characters, like:
first.replaceAll("(?i)(null)(?=[;$\\\n])", "")

You don't need anything fancy:
str = str.replaceAll("(?i)\\bnull\\b", "");
(?1) means "ignore case". \b means "word boundary". Embedded newlines are irrelevant.

Related

split string based on text qualifier regex java

I want to split a string based on text qualifier for example
"1","10411721","MikeTison","08/11/2009","21/11/2009","2800.00","002934538","051","New York","10411720-002",".\Images\b.jpg",".\RTF\b.rtf"
Qualifer="
Spliter = ,
I want to split string based on Spliter , but if Spliter comes inside qualifier " than ignore it and return string including Spliter .
Regular expression i am using is (?:|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
but this regular expression only returns commas,please help me in this perspective as i am new to regular expressions
please note that if we have newline characters in string ie \r\n than it should ignore newline character
"1","10411","Muis","a","21/11/2009","2800.06","0029683778","03005136851","Awan","10411720-001",".\Images\a.jpg",".\RTF\a.rtf"
"2","08/10/2009","07:32","Call","On-Net","030092343242342376543","Monk","00:00","1.500","0.000","10.000","0.200"
"2","08/10/2009","02:50","Call","Off-Net","030092343242342376543","Une","08:00","1.500","2.000","20.000","3.500"
"2","09/10/2009","03:55","SMS","On-Net","030092343242342376543","Mink","00:00","1.500","0.000","5.000","100.500"
"2","09/10/2009","12:30","Call","Off-Net","030092343242342376543","Zog","01:01","3.500","3.000","70.000","6.500"
"2","09/10/2009","09:11","Call","On-Net","030092343242342376543","Monk","02:30","2.00","2.000","90.000","4.000"
Probably easiest solution is not searching for place to split, but finding elements which you want to return. In your case these elements
starts "
ends with "
have no " inside.
So you try with something like
String data = "\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
Pattern p = Pattern.compile("\"([^\"]+)\"");
Matcher m = p.matcher(data);
while(m.find()){
System.out.println(m.group(1));
}
Output:
1
10411721
MikeTison
08/11/2009
21/11/2009
2800.00
002934538
051
New York
10411720-002
.\Images\b.jpg
.\RTF\b.rtf
You can split using this regex:
String[] arr = input.split( "(?=(([^\"]*\"){2})*[^\"]*$),+" );
This regex will split on commas if those are outside double quotes by using a lookahead to make sure there are even number of quotes after a comma.
Remove the first and the last character of the whole string. Then split with ","
String test = "\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
if (test.length() > 0)
test = test.substring(1, test.length()-1);
System.out.println(Arrays.toString(test.split("\",\"")));
This works even if you have new line character..try it out
String str="\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
System.out.println(Arrays.toString(str.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")));

Java Pattern / Matcher not finding word break

I am having trouble with Java Pattern and Matcher. I've included a very simplified example of what I'm trying to do.
I had expected the pattern ".\b" to find the last character of the first word (or "4" in the example), but as I step through the code, m.find() always returns false. What am I missing here?
Why does the following Java code always print out "Not Found"?
Pattern p = Pattern.compile(".\b");
Matcher m = p.matcher("102939384 is a word");
int ixEndWord = 0;
if (m.find()) {
ixEndWord = m.end();
System.out.println("Found: " + ixEndWord);
} else {
System.out.println("Not Found");
}
You need to escape special characters in the regex: ".\\b"
Basically, in a String the backslash has to be escaped. So "\\" becomes the character '\'.
So the String ".\\b" becomes the litteral String ".\b", which will be used by the Pattern.
To expand upton AntonH's comment, whenever you want the "\" character to appear in a regex expression, you have to escape it so that it first appears in the string you are passing in.
As is, ".\b" is the string of a dot . followed by the special backspace character represented by \b, compared to ".\\b", which is the regex .\b.

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

Remove all non-word char except if & or &apos; pattern

I am trying to clean a string of all non-word character except when it is & i.e. pattern might be like &[\w]+;
For example:
abc; => abc
abc & => abc &
abc& => abc
if i use string.replaceAll("\W","") it removes ; and '&' too from second example which I don't want.
Can using negative look-ahead in this problem could give a quick solution regex pattern?
First of all, I really like the question. Now, what you want could not be done with a single replaceAll, because for that, we would need a negative look-behind with variable length, which is not allowed. If it was allowed, then it would not have been that difficult.
Anyways, since single replaceAll is no option here, you can use a little hack here. Like first replacing the last semi-colon of you entity reference, with some character sequence, which you are sure won't be there in the rest of the string, like XXX or anything. I know this is not correct, but you sure can't help it out.
So, here's what you can try:
String str = "a;b&c &";
str = str.replaceAll("(&\\w+);", "$1XXX")
.replaceAll("&(?!\\w+?XXX)|[^\\w&]", "")
.replaceAll("(&\\w+)XXX", "$1;");
System.out.println(str);
Explanation:
The first replaceAll, replaces the pattern like & with &ampXXX, or any other sequence replaced for last ;.
The second replaceAll, replaces any & not followed by \\w+XXX, or any non-word, non & character. This will replace all the &'s which are not a part of & kind of pattern. Plus, also replaces any other non-word character.
The third replaceAll, re-replaces XXX with ;, to create back & from &ampXXX
And to make it easier to understand, you can rather use Pattern and Matcher classes and I would always prefer to use them whenever the replacement criteria is complex.
String str = "a;b&c &";
Pattern pattern = Pattern.compile("&\\w+;|[^\\w]");
Matcher matcher = pattern.matcher(str);
StringBuilder sb = new StringBuilder();
while (matcher.find()) {
String match = matcher.group();
if (!match.matches("&\\w+;")) {
matcher.appendReplacement(sb, "");
} else {
matcher.appendReplacement(sb, match);
}
}
matcher.appendTail(sb);
System.out.println(sb.toString());
This one is similar to #Eric's code, but is a generalization over it. That one will only work for & of course if it was improved to remove NullPointerException that is thrown in it.
I'm not sure you can do this using a simple String.replaceAll. You should probably use a Pattern and Matcher to loop through the matches, effectively doing a manual search and replace. Something like the following code should do the trick.
public String replaceString(String origString) {
Pattern pattern = Pattern.compile("&(\w+);|[^\w]");
Matcher matcher = pattern.matcher(origString);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
if (matcher.group().startsWith("&") && !matcher.group(1).equals("amp")) {
matcher.appendReplacement(sb, matcher.group());
} else {
matcher.appendReplacement(sb, "");
}
}
matcher.appendTail(sb);
return sb.toString();
}
I would suggest you use a negative lookahead like this:
string.replace(/&(?!\w+;)/ig, '');
Which replaces all & not followed by a word characters ending with a semicolon.
EDIT (Java):
string.replaceAll("/&(?!\w+;)/i", '');

How to find and replace a substring?

For example I have such a string, in which I must find and replace multiple substrings, all of which start with #, contains 6 symbols, end with ' and should not contain ) ... what do you think would be the best way of achieving that?
Thanks!
Edit:
just one more thing I forgot, to make the replacement, I need that substring, i.e. it gets replaces by a string generated from the substring being replaced.
yourNewText=yourOldText.replaceAll("#[^)]{6}'", "");
Or programmatically:
Matcher matcher = Pattern.compile("#[^)]{6}'").matcher(yourOldText);
StringBuffer sb = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(sb,
// implement your custom logic here, matcher.group() is the found String
someReplacement(matcher.group());
}
matcher.appendTail(sb);
String yourNewString = sb. toString();
Assuming you just know the substrings are formatted like you explained above, but not exactly which 6 characters, try the following:
String result = input.replaceAll("#[^\\)]{6}'", "replacement"); //pattern to replace is #+6 characters not being ) + '
You must use replaceAll with the right regular expression:
myString.replaceAll("#[^)]{6}'", "something")
If you need to replace with an extract of the matched string, use a a match group, like this :
myString.replaceAll("#([^)]{6})'", "blah $1 blah")
the $1 in the second String matches the first parenthesed expression in the first String.
this might not be the best way to do it but...
youstring = youstring.replace("#something'", "new stringx");
youstring = youstring.replace("#something2'", "new stringy");
youstring = youstring.replace("#something3'", "new stringz");
//edited after reading comments, thanks

Categories