Java Regex, Split String at punctuation marks except in brackets - java

I have this String: "Hello, my Name is [[Peter.java]]."
The desired split is: [Hello, my, Name, is, [[Peter.java]]]
I split at punktuation marks but completly ignore things in these brackets.
I tried:
string.split("(?!\\[\\[.*\\]\\])\\s*(\\,|\\.|\\s)\\s*")
but this doesnt work because the output is [Hello, my, Name, is, [[Peter, java]]]. Can you help me?
Other examples:
"Hello. My name is [[Peter.java]]" --> [Hello, My, name, is, [[Peter.java]]]
"Hi. How, [[are,you]]" --> [Hi, How, [[are,you]]]

You can use this regex to split:
[.,\s]+(?!\w+])
Working demo
The code:
public void testRegex() {
String str = "Hello. my Name is [[Peter.java]].";
String[] arr = str.split("[.,\\s]+(?!\\w+])");
System.out.println(Arrays.toString(arr));
}
// Output: [Hello, my, Name, is, [[Peter.java]]]
Edit: as HamZa pointed in his comment, the regex above fails is the string is something, like this]. So, to leverage the usage of SKIP & FAIL pcre feature, this regex can be improved by using:
\[\[.*?\]\] # Match our brackets
(*SKIP)(*FAIL) # Skip that match and proceed further
| # or
[\s.,]+ # any character of: whitespace (\n, \r, \t,
\f, and " "), '.', ',' (1 or more times)
Working demo

Instead of using String.split, you'll probably want to use a different sort of regex.
/\[\[(.*?)\]\]|(\w+)\W/g
Online demo
Then use a matcher to iterate through the matches.

Related

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

ReplaceAll String java regex

I have a string that contains the following:
"Hello my name is (0.9%) and (15%) bye (10.5%) also (C9, B6)"
I want to replaceAll so I get rid of the brackets containing percentages but not the other numbers like so:
"Hello my name is and bye also C9, B6"
Currently I have this but it removes all my numbers, Any idea how I could fix it:
.replaceAll("[\\([0-9]\\%)]","");
Something like this maybe?
"Hello my name is (0.9%) and (15%) bye (10.5%) also (C9, B6)".replaceAll("\\((?:\\d|\\.)+%\\)", "")
Demo
This here also deletes a single whitespace after each parenthesized percentage:
"Hello my name is (0.9%) and (15%) bye (10.5%) also (C9, B6)".replaceAll("\\((?:\\d|\\.)+%\\) ", "")
Gives:
Hello my name is and bye also (C9, B6)
Remove the outer brackets and add matching to digits before and after an optional dot.
.replaceAll("\\([0-9]+?\\.?[0-9]+?\\%\\)", "");
Note: You can change the "+" to "*" if you want to allow a leading or trailing dot.

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Replacing quotes in a Java String only on specific places

We have a String as below.
\config\test\[name="sample"]\identifier["2"]\age["3"]
I need to remove the quotes surrounding the numbers. For example, the above string after replacement should look like below.
\config\test\[name="sample"]\identifier[2]\age[3]
Currently I'm trying with the regex as below
String.replaceAll("\"\\\\d\"", "");
This is replacing the numbers also. Please help to find out a regex for this.
You can use replaceAll with this regex \"(\d+)\" so you can replace the matching of \"(\d+)\" with the capturing group (\d+) :
String str = "\\config\\test\\[name=\"sample\"]\\identifier[\"2\"]\\age[\"3\"]";
str = str.replaceAll("\"(\\d+)\"", "$1");
//----------------------^____^------^^
Output
\config\test\[name="sample"]\identifier[2]\age[3]
regex demo
Take a look about Capturing Groups
We can try doing a blanket replacement of the following pattern:
\["(\d+)"\]
And replacing it with this:
\[$1\]
Note that we specifically target quoted numbers only appearing in square brackets. This minimizes the risk of accidentally doing an unintended replacement.
Code:
String input = "\\config\\test\\[name=\"sample\"]\\identifier[\"2\"]\\age[\"3\"]";
input = input.replaceAll("\\[\"(\\d+)\"\\]", "[$1]");
System.out.println(input);
Output:
\config\test\[name="sample"]\identifier[2]\age[3]
Demo here:
Rextester
You can use:
(?:"(?=\d)|(?<=\d)")
and replace it with nothing == ( "" )
fast test:
echo '\config\test\[name="sample"]\identifier["2"]\age["3"]' | perl -lpe 's/(?:"(?=\d)|(?<=\d)")//g'
the output:
\config\test\[name="sample"]\identifier[2]\age[3]
test2:
echo 'identifier["123"]\age["456"]' | perl -lpe 's/(?:"(?=\d)|(?<=\d)")//g'
the output:
identifier[123]\age[456]
NOTE
if you have only a single double quote " it works fine; otherwise you should add quantifier + for both beginning and end "
test3:
echo '"""""1234234"""""' | perl -lpe 's/(?:"+(?=\d)|(?<=\d)"+)//g'
the output:
1234234

How to remove comma after a word pattern in java

Please help me out to get the specific regex to remove comma after a word pattern in java.
Assume, I would like to delete comma after each pattern where the pattern is <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG> now I want my output to be <Word$TAG> <Word$TAG> <Word$TAG> <Word$TAG> . if I used .replaceAll(), it will replace all commas, but in my <Word$TAG> Word may have a comma(,).
For example, Input.txt is as follows
mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E
and Output.txt
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E
You could use ", " as search and replace it with " " (space) as below:
one.replace(", ", " ");
If you think, you have "myString, ,,," or multiple spaces in between, then you could use replace all with regex like
one.replaceAll(",\\s+", " ");
(?<=[^,\s]),
Try this.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/5
Match the data you want, not the one you don't want.
You probably want ([^ ]+), and keep the bracketed data, separated by whitespace.
You might even want to narrow it down to ([^ ]+§[^ ]+),. Usually, stricter is better.
You could use a positive lookahead assertion to match all the commas which are followed by a space or end of the line anchor.
String s = "mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E";
System.out.println(s.replaceAll(",(?=\\s|$)",""));
Output:
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E

Categories