Java replaceAll regex With Similar Result - java

Alright folks, my brain is fried. I'm trying to fix up some EMLs with bad boundaries by replacing the incorrect
--Boundary_([ArbitraryName])
lines with more proper
--Boundary_([ArbitraryName])--
lines, while leaving already correct
--Boundary_([ThisOneWasFine])--
lines alone. I've got the whole message in-memory as a String (yes, it's ugly, but JavaMail dies if it tries to parse these), and I'm trying to do a replaceAll on it. Here's the closest I can get.
//Identifie bondary lines that do not end in --
String regex = "^--Boundary_\\([^\\)]*\\)$";
Pattern pattern = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(targetString);
//Store all of our unique results.
HashSet<String> boundaries = new HashSet<String>();
while (matcher.find())
boundaries.add(s);
//Add "--" at the end of the Strings we found.
for (String boundary : boundaries)
targetString = targetString.replaceAll(Pattern.quote(boundary),
boundary + "--");
This has the obvious problem of replacing all of the valid
--Boundary_([WasValid])--
lines with
--Boundary_([WasValid])----
However, this is the only setup I've gotten to even perform the replacement. If I try changing Pattern.quote(boundary) to Pattern.quote(boundary) + "$", nothing is replaced. If I try just using matcher.replaceAll("$0--") instead of the two loops, nothing is replaced. What's an elegant way to achieve my aim and why does it work?

There's no need to iterate through the matches with find(); that's part of what replaceAll() does.
s = s.replaceAll("(?im)^--Boundary_\\([^\\)]*\\)$", "$0--");
The $0 in the replacement string is a placeholder whatever the regex matched in this iteration.
The (?im) at the beginning of the regex turns on CASE_INSENSITIVE and MULTILINE modes.

You can try something like this:
String regex = "^--Boundary_\\([^\\)]*\\)(--)?$";
then see if the string ends with -- and replace only ones that don't.

Assuming all the strings are on there own line this works:
"(?im)^--Boundary_\\([^)]*\\)$"
Example script:
String str = "--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n";
System.out.println(str.replaceAll("(?im)^--Boundary_\\([^)]*\\)$", "$0--"));
Edit: changed from JavaScript to Java, must have read too fast.(Thanks for pointing it out)

Related

How can I push regex matches to array in java?

I've currently got a string, of which I want to use certain parts. With these parts I want to do various things, like pushing them to an array or showing them in a text area.
Fist I try to split method. It delete my regex matches and prints other part of string. I want to delete other part and print the regex match.
How can I do this?
For example:
There are lot of youtube links like this
https://www.youtube.com/watch?v=qJuoXM7G322&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7
I want to take only simple video link with this expression
"https:\\/\\/www.youtube.com\\/watch\\?v=.{11}"
when I use this code :
String ytLink = linkArea.getText();
String regexp = "https:\\/\\/www.youtube.com\\/watch\\?v=.{11}";
String[] tokenVal;
tokenVal = ytLink.split(regexp);
System.out.println("Count of Links : "+tokenVal.length);
for (String t : tokenVal) {
System.out.println(t);
}
It prints
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7"
I want to output be like this:
"https://www.youtube.com/watch?v=SATL2mTfZO0"
"when I Right this code :"
You are splitting the string with that regular expression, which is not the correct tool for the job.
It is dividing your example string into:
"" // The bit before the separator.
"https://www.youtube.com/watch?v=qJuoXM7G322" // The separator
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
but then discarding the separator, so you'd get back a 2-element array containing:
"" // The bit before the separator.
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
If you want to get the thing that matches the regex, you'd need to use Pattern and Matcher:
Pattern pattern = Pattern.compile("https:\\/\\/www.youtube.com\\/watch\\?v=.{11}");
Matcher matcher = pattern.matcher(ytLink);
if (matcher.find()) {
System.out.println(matcher.group());
}
(I don't entirely trust your escaped backslashes in your regular expression; however the pattern is not really important to the principle)
You can negate your regex using the negative lookaround: (?!pattern)
See also : How to negate the whole regex?

Split string between words and quotation marks

I currently have this string:
"display_name":"test","game":"test123"
and I want to split the string so I can get the value test. I have looked all over the internet and tried some things, but I couldn't get it to work.
I found that splitting using quotation marks could be done using this regex: \"([^\"]*)\". So I tried this regex: display_name:\":\"([^\"]*)\"game\", but this returned null. I hope that someone could explain me why my regex didn't work and how it should be done.
You forget to include the ",comma before "game" and also you need to remove the extra colon after display_name
display_name\":\"([^\"]*)\",\"game\"
or
\"display_name\":\"([^\"]*)\",\"game\"
Now, print the group index 1.
DEMO
Matcher m = Pattern.compile("\"display_name\":\"([^\"]*)\",\"game\"").matcher(str);
while(m.find())
{
System.out.println(m.group(1))
}
I think you could do it easier, like this:
/(\w)+/g
This little regex will take all your strings.
Your java code should be something like:
Pattern pattern = Pattern.compile("(\w)+");
Matcher matcher = pattern.matcher(yourText);
while (matcher.find()) {
System.out.println("Result: " + matcher.group(2));
}
I also want to note as #AbishekManoharan noted that it looks like JSON

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish
From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");
You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well
The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";
You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

Java Match string with optional hyphen

I am trying to match a series of string thats looks like this:
item1 = "some value"
item2 = "some value"
I have some strings, though, that look like this:
item-one = "some new value"
item-two = "some new value"
I am trying to parse it using regular expressions, but I can't get it to match the optional hyphen.
Here is my regex string:
Pattern p = Pattern.compile("^(\\w+[-]?)\\w+?\\s+=\\s+\"(.*)\"");
Matcher m = p.matcher(line);
m.find();
String option = m.group(1);
String value = m.group(2);
May someone please tell me what I could be doing wrong.
Thank you
I suspect that main reason of your problem is that you are expecting w+? to make w+ optional, where in reality it will make + quantifier reluctant so regex will still try to find at least one or more \\w here, consuming last character from ^(\\w+.
Maybe try this way
Pattern.compile("^(\\w+(?:-\\w+)?)\\s+=\\s+\"(.*?)\"");
in (\\w+(?:-\\w+)?) -> (?:-\\w+) part will create non-capturing group (regex wont count it as group so (.*?) will be group(2) even if this part will exist) and ? after it will make this part optional.
in \"(.*?)\" *? is reluctant quantifier which will make regex to look for minimal match that exist between quotation marks.
Demo
Your problem is that you have the ? in the wrong place:
Try this regex:
^((\\w+-)?\\w+)\\s*=\\s*\"([^\"]+)\"
But use groups 1 and 3.
I've cleaned up the regex a bit too
This regex should work for you:
^\w[\w-]*(?<=\w)\s*=\s*\"([^"]*)\"
In Java:
Pattern p = Pattern.compile("^\\w[\\w-]*(?<=\\w)\\s*=\\s*\"([^\"]*)\"");
Live Demo: http://www.rubular.com/r/0CvByDnj5H
You want something like this:
([\w\-]+)\s*=\s*"([^"]*)"
With extra backslashes for Java:
([\\w\\-]+)\\s*=\\s*\"([^\"]*)\"
If you expect other symbols to start appearing in the variable name, you could make it a character class like [^=\s] to accept any characters not = or whitespace, for example.

Replace a word that is not on a string

I'm trying to replace a word in a file whenever it appears except when it is contained in a string:
So I should replace this in
The test in this line consists in ...
But should not match in :
The test "in this line" consist in ...
This is what I'm trying:
line.replaceAll( "\\s+this\\s+", " that ")
But it fails with this scenario so I tried using:
line.replaceAll( "[^\"]\\s+this\\s+", " that ")
But doesn't work either.
Any help would be appreciated
This seems to work (in so far as I understand your requirements from the examples provided):
(?!.*\s+this\s+.*\")\s+this\s+
http://rubular.com/r/jZvR4XEbRf
You may need to adjust the escaping for java.
This is a bit better actually:
(?!\".*\s+this\s+)(?!\s+this\s+.*\")\s+this\s+
The only reliable way to do this is to search for EITHER a complete, quoted sequence OR the search term. You do this with one regex, and after each match you determine which one you matched. If it's the search term, you replace it; otherwise you leave it alone.
That means you can't use replaceAll(). Instead you have to use the appendReplacement() and appendTail() methods like replaceAll() itself does. Here's an example:
String s = "Replace this example. Don't replace \"this example.\" Replace this example.";
System.out.println(s);
Pattern p = Pattern.compile("\"[^\"]*\"|(\\bexample\\b)");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
if (m.start(1) != -1)
{
m.appendReplacement(sb, "REPLACE");
}
}
m.appendTail(sb);
System.out.println(sb.toString());
output:
Replace this example. Don't replace "this example." Replace this example.
Replace this REPLACE. Don't replace "this example." Replace this REPLACE.
See demo online
I'm assuming every quotation mark is significant and they can't be escaped--in other words, that you're working with prose, not source code. Escaped quotes can be dealt with, but it greatly complicates the regex.
If you really must use replaceAll(), there is a trick where you use a lookahead to assert that the match is followed by an even number of quotes. But it's really ugly, and for large texts you might find it prohibitively expensive, performance-wise.

Categories