Replace all characters before and after specific characters - java

I need to replace all characters in a string which come before an open parenthesis but come after an asterisk:
Input:
1.2.3 (1.234*xY)
Needed Output:
1.234
I tried the following:
string.replaceAll(".*\\(|\\*.*", "");
but I ran into an issue here where Matcher.matches() is false even though there are two matches... What is the most elegant way to solve this?

You could try matching the whole string, and replace with capture group 1
^[^(]*\(([^*]+)\*.*
The pattern matches:
^ start of string
[^(]*\( Match any char except ( and then match (
([^*]+) Capture in group 1 matching any char except *
\*.* Match an asterix and the rest of the line
Regex demo | Java demo
String string = "1.2.3 (1.234*xY)";
System.out.println(string.replaceFirst("^[^(]*\\(([^*]+)\\*.*", "$1"));
Output
1.234

You may use this regex to match:
[^(]*\(|\*.*
and replace with an empty string.
RegEx Demo
RegEx Demo:
[^(]*\(: Match 0 or more characters that are not ( followed by a (
|: OR
\*.*: Match * and everything after that
Java Code:
String s = "1.2.3 (1.234*xY)";
String r = s.replaceAll("[^(]*\\(|\\*.*", "");
//=> "1.234"

With your shown samples and attempts please try following regex:
^.*?\(([^*]*)\*\S+\)$
Here is the Regex Online Demo and here is the Java code Demo for used regex.
Explanation: Adding detailed explanation for used Regex.
^ ##Matching starting of the value here.
.*?\( ##Using lazy match here to match till ( here.
( ##Creating one and only capturing group of this regex here.
[^*]* ##Matching everything till * here.
) ##Closing capturing group here.
\* ##Matching * here.
\S+ ##Matching non-spaces 1 or more occurrences here.
\)$ ##Matching literal ) here at the end of the value.

Related

Java Regex : Replace characters between two specific characters with equal number of another character

Replacing all characters between starting character "+" and end character "+" with the equal number of "-" characters.
My specific situation is as follows:
Input: +-+-+-+
Output: +-----+
String s = = "+-+-+-+";
s = s.replaceAll("-\\+-","---")
This is not working. How can I achieve the output in Java? Thanks in advance.
You can use this replacement using look around assertions:
String repl = s.replaceAll("(?<=-)\\+(?=-)", "-");
//=> +-----+
RegEx Demo
(?<=-)\\+(?=-) will match a + if it is surrounded by - on both sides. Since we are using lookbehind and lookahead therefore we are not consuming characters, these are only assertions.
The matches you have are overlapping, look:
+-+-+-+
^^^
Match found and replaced by "---"
+---+-+
^
+-- Next match search continues from here
WARNING: No more matches found!
To make sure there is a hyphen free for the next match, you need to wrap the trailing - with a positive lookahead and use -- as replacement pattern:
String s = = "+-+-+-+";
s = s.replaceAll("-\\+(?=-)","--")
See the regex demo.

Erase any string that doesn't match a pattern using replaceall()

I need to replace ALL characters that don't follow a pattern with "".
I have strings like:
MCC-QX-1081
TEF-CO-QX-4949
SPARE-QX-4500
So far the closest I am using the following regex.
String regex = "[^QX,-,\\d]";
Using the replaceAll String method I get QX1081 and the expected result is QX-1081
You're using a character class which matches single characters, not patterns.
You want something like
String resultString = subjectString.replaceAll("^.*?(QX-\\d+)?$", "$1");
which works as long as nothing follows the QX-digits part in your strings.
Put the dash at the end of the regex: [^QX,\d-]
Next you just have to substring to filter out the first dash.
Don't know exactly what you expect for all strings but if you want to match a dash in a character class then it must be set as last character.
You are using a character class where you have to either escape the hyphen or put it at the start or at the end like [^QX,\d-] or else you are matching a range from a comma to a comma. But changing that will give you -QX-1081 which is not the desired result.
You could match your pattern and then replace with the first capturing group $1:
^(?:[A-Z]+-)+(QX-\d+)$
In Java you have to double escape matching a digit \\d
That will match:
^ Start of the string
(?:[A-Z]+-)+ Repeat 1+ times one or more uppercase charactacters followed by a hyphen
(QX-\d+) Capture in a group QX- followed by 1+ digits
$ End of the string
For example:
String result = "MCC-QX-1081".replaceAll("^(?:[A-Z]+-)+(QX-\\d+)$", "$1");
System.out.println(result); // QX-1081
See the Regex demo | Java demo
Note that if you are doing just 1 replacement, you could also use replaceFirst

Regex to find time durations

I have read many of the regex questions on stackoverflow, but they didn't help me to develop my own code.
What I need is like the following. I am parsing texts which have already been parsed using Stanford Tagger. Now, I am trying to remove the time durations in some parts of the texts: 1) The phrase starts with the date (e.g. 1999_CARD Tom_NN was_VP) 2) when the time duration follows this format: 2/1999_CARD -_- 01/01/2000_CARD (or similar ones).
I have developed a code. But it's wrongly removing some other parts. I don't know why. My regex is like the following
String regex = "(\\s|\\b.*?_(CARD|CD)\\s([^A-Za-z0-9])+_([^A-Za-z0-9])+(.*?)+_(CARD|CD))|(\\b.*?_(CARD|CD))";
Pattern pattern2 = Pattern.compile(regex);
Matcher m2 = pattern2.matcher(chunkPhrase);
if (m2.find()) {
chunkPhrase = chunkPhrase.replace(m2.group(0), "");
}
For example, in the following phrase, it finds something (but it shouldn't)
·_NNP Research_NNP of_IN Symbian_NNP OS_NNP 7.0_CD s_NNS
After removing the time duration in the above phrase, I'm left with · s_NNS which is not what I want.
To make it more clear what I expect the code, here are some examples:
1/1/2002_CD -_- 1/2/2003_CD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
1/1/2002_CARD -_- 1/2/2003_CARD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
2000_CARD I_NN was_VP working_NP here_ADV
after applying the code, I expect:
I_NN was_VP working_NP here_ADV
For this one:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
after applying the code, I expect:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
Meanwhile, I use java.
Update: To clarify better: If a number occurs AT THE BEGINNING, it must be removed. Otherwise, it must be remained. If it follows the second format (e.g. 1999_CD -_- 2000_CARD), it must be removed, indifferent if it occurs at the beginning or middle or end of the phrase.
Can anyone help what is wrong with my code?
You can use this regex:
final String regex = "\\b(?:\\d{1,2}/*\\d{1,2}/)?\\d{4}_(?:CARD|CD)(?:\\h*[-_]+)?\\h*";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(input);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll("");
System.out.println("Substitution result: " + result);
RegEx Demo
RegEx Breakup:
\b - Word boundary
(?: - Start non-capturing group
\d{1,2}/*\d{1,2}/ - Match mm/dd part of a date
)? - End non-capturing group (optional)
\d{4} - Match 4 digits of year
_ - Match a literal _
(?:CARD|CD) - Match CARD or CD
(?: - Start non-capturing group
\h*[-_]+ - Match horizontal whitespace followed by 1 or more - or _
)? - End non-capturing group (optional)
\h* - Match 0 or more horizontal whitespaces
Based on the examples you have provided, the following regex will capture the required time durations
((?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})_(?:CARD|CD) (?:-_- )?)
Details
(?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4}) // match minimum of 2 digits or a date in xx/xx/xx[xx] format
_(?:CARD|CD) // match _CARD or _CD
(?:-_- )? // match -_- , if it exists
The ?: at the beginning mean these are non-capturing groups. The parentheses around the whole thing is the capturing group
See demo here

what is missing in my java regex?

I want to fetch
http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png
from
url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
I have tried this code:
String a = "";
Pattern pattern = Pattern.compile("url(.*)");
Matcher matcher = pattern.matcher(imgpath);
if (matcher.find()) {
a = (matcher.group(1));
}
return a;
but a == (http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_639_o_4746_precious_image_1419867529.png)
how can I fine tune it?
Why use a regular expression to begin with?
Given
final String s = "url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)";
If the string is always the same format a simple substring(4,s.length()-1) would be better.
That said, if you insist on a regular expression:
You have to escape the ( with \( so in Java ( you have to escape the \ ) it would be \\( same with the ).
Then you can get the grouping with url\\((.+)\\), test it here!
Learn to use RegEx101.com before coming here, it will point out errors like this immediately.
As you already seem to know ( and )` represents groups which means that in regex
url(.*)
(.*) will place everything after url in group 1, which in case of
url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
will be
(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
If you want to exclude ( and ) from match you need to add their literals to regex, which means you need to escape them. There are many things to do it, like adding \ before each of them, or surrounding them with [ ].
Other problem with your regex is that .* finds maximal potential match but since . represents any character (except line separators) it can also include ( and ). To solve this problem you can make * quantifier reluctant by adding ? after it so your final regex can be written as string
"url\\((.*?)\\)"
---------------
url
\\( - ( literal
(.*?) - group 1
\\) - ) literal
or you can use instead of . character class which will accept all characters except ) like
"url\\(([^)]*)\\)"
Try this regex:
url\((.*?)\)
The outermost parentheses are escaped so they will be matched literally. The inner parentheses are for capturing a group. The question mark after the .* is to make the match lazy, so the first closing parenthesis found will end the group.
Note that to use this regex in Java, you'll have to additionally escape the backslashes in order to express the above regex as a string literal:
String regex = "url\\((.*?)\\)";
You need to escape the () to match the parenthesis in the string, and then add another set of () around the part you want to pull out in group 1, the actual url. I also changed the part inside the parenthesis to [^)]*, which will match everything until it finds a ). See below:
url\(([^)]*)\)

Java Regex lookahead takes too much time

I'm trying to create a proper regex for my problem and apparently ran into weird issue.
Let me describe what I'm trying to do..
My goal is to remove commas from both ends of the string. E,g, string , ,, ,,, , , Hello, my lovely, world, ,, , should become just Hello, my lovely, world.
I have prepared following regex to accomplish this:
(\w+,*? *?)+(?=(,?\W+$))
It works like a charm in regex validators, but when I'm trying to run it on Android device, matcher.find() function hangs for ~1min to find a proper match...
I assume, the problem is in positive lookahead I'm using, but I couldn't find any better solution than just trim commas separately from the beginning and at the end:
output = input.replaceAll("^(,?\\W?)+", ""); //replace commas at the beginning
output = output.replaceAll("(,?\\W?)+$", ""); //replace commas at the end
Is there something I am missing in positive lookahead in Java regex? How can I retrieve string section between commas at the beginning and at the end?
You don't have to use a lookahead if you use matching groups. Try regex ^[\s,]*(.+?)[\s,]*$:
EDIT: To break it apart, ^ matches the beginning of the line, which is technically redundant if using matches() but may be useful elsewhere. [\s,]* matches zero or more whitespace characters or commas, but greedily--it will accept as many characters as possible. (.+?) matches any string of characters, but the trailing question mark instructs it to match as few characters as possible (non-greedy), and also capture the contents to "group 1" as it forms the first set of parentheses. The non-greedy match allows the final group to contain the same zero-or-more commas or whitespaces ([\s,]*). Like the ^, the final $ matches the end of the line--useful for find() but redundant for matches().
If you need it to match spaces only, replace [\s,] with [ ,].
This should work:
Pattern pattern = Pattern.compile("^[\\s,]*(.+?)[\\s,]*$");
Matcher matcher = pattern.matcher(", ,, ,,, , , Hello, my lovely, world, ,, ,");
if (!matcher.matches())
return null;
return matcher.group(1); // "Hello, my lovely, world"

Categories