regular expression using [:punct:] function in java - java

I am using 'punct' function to replace special characters in a
String ex: ' REPLACE (REGEXP_REPLACE (colum1, '[[:punct:]]' ), ' ', '')) AS OUPUT ' as part of SQL String in java, But I want particular special character '-' not to be replaced? can you suggest best way to do this?

Acc. to Character Classes and Bracket Expressions:
ā€˜[:punct:]ā€™
Punctuation characters; in the ā€˜Cā€™ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
Hence, use
[][!"#$%&'()*+,./:;<=>?#\\^_`{|}~]
Make sure you escape the ' correctly in the string literal.
A shortened expression with ranges will look like
[!-,.-/:-#[-`{-~]
See a regex test here (the - is between , and ., thus you need to use two !-,.-/ ranges in the above expression to exclude the hyphen).

Related

match repeated blocks separated by ~

To match the following text:
text : SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES
It has this format:<code1>~<description1>\<code2>~<description2>\<code3>~<description3>....<codeN>~<descriptionN>
I used this regex: [A-Z0-9 ]+~[A-Z0-9 ]+(?:\\[A-Z0-9 ]+~[A-Z0-9 ]+)+
So:
case 1. SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES (Match: OK)
case 2. SS~B66\88~PRELIMINARY PAGES~HEADING PAGES (No Match: OK because I removed the code 'M01')
case 3. SS~B66~PRELIMINARY PAGES\M01~HEADING PAGES (No Match: OK because I removed the code '88')
More examples:
SS~B66\88~MEKLKE\M01~MOIIE
B~A310\0~PRELIM#INARY\00-00~HEADING
My problem is that <code> and <description> can accept any type of characters, so when I replaced my regex with:
My new regex .+~.+(?:\\.+~.+)+ , but it can match case 2 and case 3.
Thank you for your help.
Instead of using [A-Z0-9 ] which would not match all the allowed chars, or .+ which would match too much, you can use a negated character class [^~\\] matching any char except \ and ~ to set the boundaries for the matched parts.
^[^~]+~[^~\\]+(?:\\[^~]+~[^~\\]+)+$
^ Start of string
[^~]+~ Match any char other than ~, then match ~
[^~\\]+ Repeat matching 1+ times any char other than ~ and `
(?: Non capture group
\\[^~]+~[^~\\]+ Match \ and a ~ between other chars than ~ before and ~ \ after
)+ Close the group and repeat 1 or more times to match at least a \
$ End of string
Regex demo (The demo contains \n to not cross the newlines in the example data)

high-level regular expression with not

Hi regular expression experts,
I have the following text
<[~UNKNOWN:a-z\.]> <[~UNKNOWN:A-Z\-0-9]> <[~UNKNOWN:A-Z\]a-z]
And the following reg expr
\[\~[^\[\~\]]*\]
It works fine for the 1st and 2nd group in the text but not for the 3rd one.
The 1st group is
[~UNKNOWN:a-z\.]
The 2nd is
[~UNKNOWN:A-Z\-0-9]
and the 3rd one is
[~UNKNOWN:A-Z\]a-z]
However the reg exp finds the following text
[~UNKNOWN:A-Z\]
I understand why and I know that I have to add the following rule to the reg exp:
starting with '[' and '~' characters and ending with ']' UNLESS there is a '\' in front of ']'. So I should add a NOT expression but not sure how.
Could anybody please help?
Thanks,
V.
Why not simply:
<([^>]+)>?
Regex Demo
This should work (first line pattern, second line your pattern (ignore whitespace), third line my changes):
\[\~(?:[^\[\~\]]|(?<=\\)\])*(?<!\\)\]
\[\~ [^\[\~\]] * \]
(?: |(?<=\\)\]) (?<!\\)
Your regex:
\[\~ # Literal characters [~
[^ # Character group, NONE of the following:
\[\~\] # [ or ~ or ]
]* # 0 or more of this character group
\] # Followed by ]
Your pattern in words: [~, everything in between, up to the next ], as long as there is no [ or ~ or ] in there.
My pattern , only relevant changes explained:
\[\~
(?: # Non capturing group
[^\[\~\]]
| # OR
(?<=\\)\] # ], preceded by \
)*
(?<!\\)\] # ], not preceded by \
In words: Same as yours, plus ] may be contained if it is preceded by \, and the closing ] may not be preceded by \

Get equations from string

I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+

Regex - trying to match / \ | , or newline, error says invalid escape character

I'm actually trying to split a string on any of the following :
/
\
|
,
\n
Here's the regex I'm using, which gives the 'invalid escape character' error :
String delims = "[\\\\\|\\/\\n,]+";
String[] list1 = str1.split(delims);
I've tried a few more versions of this, trying to get the number of \'s right. What's the right way to do this?
"[/\\|\n,\\\\]+"
Some of these you need to double escape
/ matches /
\\| matches |
\n matches new line
, matches ,
\\\\ matches \
To create \ literal in regex engine you need to write it with four \ in string, so you have one \ extra
"[\\\\\|\\/\\n,]+";
1234^
here
Also you don't need to escape / in Java regex engine, and you don't need to pass \n as \\n (\n literal will be also accepted) you can so try with
String delims = "[\\\\|/\n,]+";

Split a string on commas not contained within double-quotes with a twist

I asked this question earlier and it was closed because it was a duplicate, which I accept and actually found the answer in the question Java: splitting a comma-separated string but ignoring commas in quotes, so thanks to whoever posted it.
But I've since run into another issue. Apparently what I need to do is use "," as my delimiter when there are zero or an even number of double-quotes, but also ignore any "," contained in brackets.
So the following:
"Thanks,", "in advance,", "for("the", "help")"
Would tokenize as:
Thanks,
in advance,
for("the", "help")
I'm not sure if there's anyway to modify the current regex I'm using to allow for this, but any guidance would be appreciated.
line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
Sometimes it is easier to match what you want instead of what you don't want:
String s = "\"Thanks,\", \"in advance,\", \"for(\"the\", \"help\")\"";
String regex = "\"(\\([^)]*\\)|[^\"])*\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(s.substring(m.start(),m.end()));
}
Output:
"Thanks,"
"in advance,"
"for("the", "help")"
If you also need it to ignore closing brackets inside the quotes sections that are inside the brackets, then you need this:
String regex = "\"(\\((\"[^\"]*\"|[^)])*\\)|[^\"])*\"";
An example of a string which needs this second, more complex version is:
"foo","bar","baz(":-)",":-o")"
Output:
"foo"
"bar"
"baz(":-)",":-o")"
However, I'd advise you to change your data format if at all possible. This would be a lot easier if you used a standard format like XML to store your tokens.
A home-grown parser is easily written.
For example, this ANTLR grammar takes care of your example input without much trouble:
parse
: line*
;
line
: Quoted ( ',' Quoted )* ( '\r'? '\n' | EOF )
;
Quoted
: '"' ( Atom )* '"'
;
fragment
Atom
: Parentheses
| ~( '"' | '\r' | '\n' | '(' | ')' )
;
fragment
Parentheses
: '(' ~( '(' | ')' | '\r' | '\n' )* ')'
;
Space
: ( ' ' | '\t' ) {skip();}
;
and it would be easy to extend this to take escaped quotes or parenthesis into account.
When feeding the parser generated by that grammar to following two lines of input:
"Thanks,", "in advance,", "for("the", "help")"
"and(,some,more)","data , here"
it gets parsed like this:
If you consider to use ANTLR for this, I can post a little HOW-TO to get a parser from that grammar I posted, if you want.

Categories