Replace a word that is not on a string - java

I'm trying to replace a word in a file whenever it appears except when it is contained in a string:
So I should replace this in
The test in this line consists in ...
But should not match in :
The test "in this line" consist in ...
This is what I'm trying:
line.replaceAll( "\\s+this\\s+", " that ")
But it fails with this scenario so I tried using:
line.replaceAll( "[^\"]\\s+this\\s+", " that ")
But doesn't work either.
Any help would be appreciated

This seems to work (in so far as I understand your requirements from the examples provided):
(?!.*\s+this\s+.*\")\s+this\s+
http://rubular.com/r/jZvR4XEbRf
You may need to adjust the escaping for java.
This is a bit better actually:
(?!\".*\s+this\s+)(?!\s+this\s+.*\")\s+this\s+

The only reliable way to do this is to search for EITHER a complete, quoted sequence OR the search term. You do this with one regex, and after each match you determine which one you matched. If it's the search term, you replace it; otherwise you leave it alone.
That means you can't use replaceAll(). Instead you have to use the appendReplacement() and appendTail() methods like replaceAll() itself does. Here's an example:
String s = "Replace this example. Don't replace \"this example.\" Replace this example.";
System.out.println(s);
Pattern p = Pattern.compile("\"[^\"]*\"|(\\bexample\\b)");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
if (m.start(1) != -1)
{
m.appendReplacement(sb, "REPLACE");
}
}
m.appendTail(sb);
System.out.println(sb.toString());
output:
Replace this example. Don't replace "this example." Replace this example.
Replace this REPLACE. Don't replace "this example." Replace this REPLACE.
See demo online
I'm assuming every quotation mark is significant and they can't be escaped--in other words, that you're working with prose, not source code. Escaped quotes can be dealt with, but it greatly complicates the regex.
If you really must use replaceAll(), there is a trick where you use a lookahead to assert that the match is followed by an even number of quotes. But it's really ugly, and for large texts you might find it prohibitively expensive, performance-wise.

Related

Java Regex complex ID expression filtering

I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.
My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish
From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");
You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well
The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";
You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

java easy Regular expression

I have strings like "xxxxx?434334", "xxx?411112", "xxxxxxxxx?11113" and so on.
How to substring properly to retrieve "xxxxx" (everything that comes untill '?' character)?
return s.substring(0, s.indexOf('?'));
No need for a regex for that.
If you have a problem, use a regex. Now you have two problems.
str = str.replaceAll("[?].*", "");
In other words, "remove everything after, and including, the question mark character". The ? has to be enclosed in square brackets because otherwise it has a special meaning.
I would agree with others answers that you should avoid using regex wherever possible, but if you did want to use it for this scenario you could use the following
Pattern regex = Pattern.compile("([^\\?]*)\\?{1}");
Matcher m = regex.matcher(str);
if (m.find()) {
result = m.group(1);
}
where str is your input string.
EDIT:
Description of regex match any group of characters that are not a "?" and have a single "?" after the group
The Pattern ".*(?=\?)" should work as well. ?= is a positive lookahead, which means the mattern matches everything that comes before a quotation mark, but not the quotation mark itself.

Java: regex - how do i get the first quote text

As a beginner with regex i believe im about to ask something too simple but ill ask anyway hope it won't bother you helping me..
Lets say i have a text like "hello 'cool1' word! 'cool2'"
and i want to get the first quote's text (which is 'cool1' without the ')
what should be my pattern? and when using matcher, how do i guarantee it will remain the first quote and not the second?
(please suggest a solution only with regex.. )
Use this regular expression:
'([^']*)'
Use as follows: (ideone)
Pattern pattern = Pattern.compile("'([^']*)'");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Or this if you know that there are no new-line characters in your quoted string:
'(.*?)'
when using matcher, how do i guarantee it will remain the first quote and not the second?
It will find the first quoted string first because it starts seaching from left to right. If you ask it for the next match it will give you the second quoted string.
If you want to find first quote's text without the ' you can/should use Lookahead and Lookbehind mechanism like
(?<=').*?(?=')
for example
System.out.println("hello 'cool1' word! 'cool2'".replaceFirst("(?<=').*?(?=')", "ABC"));
//out -> hello 'ABC' word! 'cool2'
more info
You could just split the string on quotes and get the second piece (which will be between the first and second quotes).
If you insist on regex, try this:
/^.*?'(.*?)'/
Make sure it's set to multiline, unless you know you'll never have newlines in your input. Then, get the subpattern from the result and that will be your string.
To support double quotes too:
/^.*?(['"])(.*?)\1/
Then get subpattern 2.

Java replaceAll regex With Similar Result

Alright folks, my brain is fried. I'm trying to fix up some EMLs with bad boundaries by replacing the incorrect
--Boundary_([ArbitraryName])
lines with more proper
--Boundary_([ArbitraryName])--
lines, while leaving already correct
--Boundary_([ThisOneWasFine])--
lines alone. I've got the whole message in-memory as a String (yes, it's ugly, but JavaMail dies if it tries to parse these), and I'm trying to do a replaceAll on it. Here's the closest I can get.
//Identifie bondary lines that do not end in --
String regex = "^--Boundary_\\([^\\)]*\\)$";
Pattern pattern = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(targetString);
//Store all of our unique results.
HashSet<String> boundaries = new HashSet<String>();
while (matcher.find())
boundaries.add(s);
//Add "--" at the end of the Strings we found.
for (String boundary : boundaries)
targetString = targetString.replaceAll(Pattern.quote(boundary),
boundary + "--");
This has the obvious problem of replacing all of the valid
--Boundary_([WasValid])--
lines with
--Boundary_([WasValid])----
However, this is the only setup I've gotten to even perform the replacement. If I try changing Pattern.quote(boundary) to Pattern.quote(boundary) + "$", nothing is replaced. If I try just using matcher.replaceAll("$0--") instead of the two loops, nothing is replaced. What's an elegant way to achieve my aim and why does it work?
There's no need to iterate through the matches with find(); that's part of what replaceAll() does.
s = s.replaceAll("(?im)^--Boundary_\\([^\\)]*\\)$", "$0--");
The $0 in the replacement string is a placeholder whatever the regex matched in this iteration.
The (?im) at the beginning of the regex turns on CASE_INSENSITIVE and MULTILINE modes.
You can try something like this:
String regex = "^--Boundary_\\([^\\)]*\\)(--)?$";
then see if the string ends with -- and replace only ones that don't.
Assuming all the strings are on there own line this works:
"(?im)^--Boundary_\\([^)]*\\)$"
Example script:
String str = "--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n";
System.out.println(str.replaceAll("(?im)^--Boundary_\\([^)]*\\)$", "$0--"));
Edit: changed from JavaScript to Java, must have read too fast.(Thanks for pointing it out)

Categories