Regex: Select first specific word occurrences inside enclosed elements - java

I have a String containing url paths:
...
/test/section/1.png
"/test/section/test/2.png" "/test/section/test/2.png"
(/test/section/test/3.png)
...
I want to get all first "test" occurrences of enclosed url elements in quotes or parenthesis.
Until now i have accomplished to get the first occurance of each String with the '"' or '(':
(\(|\")(\/orbeon\/)
Matches are presented with bold.
Current output:
/test/section/1.png
"/test/ section/test/2.png" "/test/ section/test/2.png"
(/test/ section/test/3.png)
Desired output:
/test/section/1.png
" /test/ section/test/2.png" " /test/ section/test/2.png"
( /test/ section/test/3.png)
How can i exclude the char before matching word?
Caution! I want only the first word occurance of each enclosed url path:
Corner case: /test/ section/test/2.png
Using this regex with java

Your current (\(|\")(\/orbeon\/) regex matches ( or " into Group 1 and /orbeon/ into Group 2.
Thus, when you execute matcher.find(), you will need to access Group 2 using matcher.group(2).
Else, use a lookbehind: Pattern.compile("(?<=[(\"])/orbeon/"), and you will have access to the necessary text with matcher.group() or matcher.group(0). The (?<=[(\"]) positive lookbehind will assert the presence of ( or " before /orbeon/, and if not present, there won't be any match.

Related

Regex for matching strings between '=' and '/' or "=" and end of string

I am looking for regex which can help me replace strings like
source=abc/task=cde/env=it --> source='abc'/task='cde'/env='it'
To be more precise, I want to replace a string which starts with = and ends with either / or end of the string with ''
Tried code like this
"source=abc/task=cde/env=it".replaceAll("=(.*?)/","'$1'")
But that results in
source'abc'task'cde'env=it
Using lookahead and look behind:
(?<==)([^/]*)((?=/)|$)
Lookbehind allows you to specify what comes before your match. In this case an equals: (?<==).
The main match in my regex looks for any non-slash character, zero or more times: ([^/]*)
Lookahead allows you to specify what comes after your match. In this case, a slash: (?=/).
The $ matches the end of the line, so that the last item in your test data becomes quoted. ((?=/)|$) combines with this with the lookahead, meaning "either a slash comes after the match or this is the end of the line".
Here it is in action in a test.
#Test
public void test_quote_items() {
String regex = "(?<==)([^/]*)((?=/)|$)";
String actual = "source=abc/task=cde/env=it".replaceAll(regex,"'$1'");
String expected = "source='abc'/task='cde'/env='it'";
assertEquals(expected, actual);
}
Try
String input = "source=abc/task=cde/env=it".replaceAll("=(.*?)(/|$)","='$1'/");
The problems I found are that you are not replacing the =
and also the / is not there for the end of String, that also needs to be replaced when found.
output
source='abc'/task='cde'/env='it'/
If you don't want the last '/', that is trivial to remove isn't it.

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

Regular expression: Replace everything before first occurence

I have the following regular expression that I'm using to remove the dev. part of my URL.
String domain = "dev.mydomain.com";
System.out.println(domain.replaceAll(".*\\.(?=.*\\.)", ""));
Outputs: mydomain.com but this is giving me issues when the domains are in the vein of dev.mydomain.com.pe or dev.mydomain.com.uk in those cases I am getting only the .com.pe and .com.uk parts.
Is there a modifier I can use on my regex to make sure it only takes what is before the first . (dot included)?
Desired output:
dev.mydomain.com -> mydomain.com
stage.mydomain.com.pe -> mydomain.com.pe
test.mydomain.com.uk -> mydomain.com.uk
You may use
^[^.]+\.(?=.*\.)
See the regex demo and the regex graph:
Details
^ - start of string
[^.]+ - 1 or more chars other than dots
\. - a dot
(?=.*\.) - followed with any 0 or more chars other than line break chars as many as possible and then a ..
Java usage example:
String result = domain.replaceFirst("^[^.]+\\.(?=.*\\.)", "");
Following regex will work for you. It will find first part (if exists), captures rest of the string as 2nd matching group and replaces the string with 2nd matching group. .*? is non-greedy search that will match until it sees first dot character.
(.*?\.)?(.*\..*)
Regex Demo
sample code:
String domain = "dev.mydomain.com";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "stage.mydomain.com.pe";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "test.mydomain.com.uk";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "mydomain.com";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
output:
mydomain.com
mydomain.com.pe
mydomain.com.uk
mydomain.com

how to check regex starts and ends with regex

I am having the regex for capturing string if they are in between double quote and not start or end with /.
But the regex solution which I wanted.
The regex should not capture
Condition 1. Capture text between two double or single quotes.
Condition 2. But it shouldn't capture if starts with [ and ends with ]
Condition 3. But it shouldn't if starts with /" and ends with /' or starts with /" and ends with /'
Example:
REGEX: \"(\/?.)*?\"
Input: Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson("test"), "m2m:cin.as"),"payloads_ul.test"),"[/"Dimming Value/"]",input["test"]["in"])
output:
captured output:
1. "test"
2. "m2m:cin.as"
3. "payloads_ul.test"
4. [/"Dimming Value/"]
5. "test"
6. "in"
Expected result:
1. "test"
2. "m2m:cin.as"
3. "payloads_ul.test"
4. [/"Dimming Value/"]
Condition 1 explanation:
Capture the text between double or single quotes.
example:
input : "test","m2m:cin.as"
output: "test","m2m:cin.as"
Condition 2 explanation:
If the regex is between starts with [ and ends with ] but it is having double or single quote then also it should not capture.
example:
input: ["test"]
output: it should not capture
Condition 3 explanation:
In the above-expected result for the input "[/"Dimming Value/"]" there is a two-time double quote but is capturing only one excluding /". So, the output is [/"Dimming Value/"]. Like this, I want if /' (single quote preceded by /).
Note:
For input "[/"Dimming Value/"]" or '[/'Dimming Value/']', here although the text is between double quote and single quote and having [ and ] it should not ignore the string. The output should be [/"Dimming Value/"].
As I understood, you want to capture text between double quotes, except:
if initial double quotes prefixed by [ or final double quotes suffixed by ]
doubles quotes prefixed by / should not be the begin or end of matched text
I don't know if you want also capture text between single quotes, because you text is not complete clear.
For create a non capture group with negative matching of prefixed chars, you need a group of type Negative Lookbehind, with syntax (?<!prefix that you dont want), but this is not present on java or javascript regex engine.
The best regex that I build to return what you want for you example (but only work on PHP or python (you can check it on site regex101.com or similar)) is:
(?<![\[/])\"(?!\])(\/?.)*?\"(?![\]/])
I added the restriction for don't match if initial double quotes suffixed by ] to prevent match "][" on text ["test"]["in"]
Anyway, this will not solve your problem, since will not work within java or javascript engine!
Do you have any way to process the results, and exclude the bad matches?
If so, you can match bad prefix and bad suffix and exclude it from the results:
[\[]?\"(\/?.)*?\"[\]]?
this will return:
"test"
"m2m:cin.as"
"payloads_ul.test"
"[/"Dimming Value/"]"
["test"]
["in"]
Full javascript code, including pos processing:
'Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson("test"), "m2m:cin.as"),"payloads_ul.test"),"[/"Dimming Value/"]",input["test"]["in"])'
.match(/[\[]?\"(\/?.)*?\"[\]]?/g).filter(s => !s.startsWith('[') && !s.endsWith(']'))
this will return:
"test"
"m2m:cin.as"
"payloads_ul.test"
"[/"Dimming Value/"]"
EDIT:
equivalent java code:
CharSequence yourStringHere = "Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson(\"test\"), \"m2m:cin.as\"),\"payloads_ul.test\"),\"[/\"Dimming Value/\"]\",input[\"test\"][\"in\"])";
Matcher m = Pattern.compile("[\\[]?\\\"(\\/?.)*?\\\"[\\]]?")
.matcher(yourStringHere);
while (m.find()) {
String s = m.group();
if (!s.startsWith("[") && !s.endsWith("]")) {
allMatches.add(s);
}
}

How to remove comma after a word pattern in java

Please help me out to get the specific regex to remove comma after a word pattern in java.
Assume, I would like to delete comma after each pattern where the pattern is <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG> now I want my output to be <Word$TAG> <Word$TAG> <Word$TAG> <Word$TAG> . if I used .replaceAll(), it will replace all commas, but in my <Word$TAG> Word may have a comma(,).
For example, Input.txt is as follows
mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E
and Output.txt
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E
You could use ", " as search and replace it with " " (space) as below:
one.replace(", ", " ");
If you think, you have "myString, ,,," or multiple spaces in between, then you could use replace all with regex like
one.replaceAll(",\\s+", " ");
(?<=[^,\s]),
Try this.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/5
Match the data you want, not the one you don't want.
You probably want ([^ ]+), and keep the bracketed data, separated by whitespace.
You might even want to narrow it down to ([^ ]+§[^ ]+),. Usually, stricter is better.
You could use a positive lookahead assertion to match all the commas which are followed by a space or end of the line anchor.
String s = "mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E";
System.out.println(s.replaceAll(",(?=\\s|$)",""));
Output:
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E

Categories