I am currently writing a java program with regular expression but I am struggling as I am pretty new in regex.
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]/ ]*";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)";
The target syntax is key(value)+key(value)+key(value). Key is alphanumeric and value is allowed to be any combination.
This has been okay so far. However, I have a problem with '(', ')' in value. If I place '(' or ')' in the value, value includes all the rest.
e.g. number(abc(kk)123)+status(open) returns key:number, value:abc(kk)123)+status(open
It is supposed to be two pairs of key-value.
Can you guys suggest to improve the expression above?
Not possible with regular expressions at all, sorry. If you want to count opening and closing parantheses, regular expressions are, in general, not good enough. The language you are trying to parse is not a regular language.
Of course, there may be ways around that limitation. We cannot know that if you give us as little context as you did.
Get the matched group from index 1 and 2
([a-zA-Z0-9]+)\((.*?)\)(?=\+|$)
Here is online demo
The above regex pattern looks of for )+ as delimiter between keys and values.
Note: The above regex pattern will not work if value contains )+ for example number(abc(kk)+123+4+4)+status(open)
Sample code:
String str = "number(abc(kk)123)+status(open)";
Pattern p = Pattern.compile("([a-zA-Z0-9]+)\\((.*?)\\)(?=\\+|$)");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1) + ":" + m.group(2));
}
output:
number:abc(kk)123
status:open
Someone posted an answer with a working solution regex: ([a-zA-z0-9]+)\((.*?)\)(?=\+|$) - This works great. When I tested on online regex tester site and came back, the post had gone. Is it right solution? I am wondering why the answer has been deleted.
See this golfed regex:
([^\W_]+)\((.*?)\)(?![^+])
You can use a shorthanded character class [^\W_] instead of [a-zA-Z0-9].
You can use a negative lookahead assertion (?![^+]) to match without backtracking.
However, this is not a practical solution as )+ within inner elements will break: number(abc(kk)+5+123+4+4)+status(open)
This is the case where Java, which has the regex implementation that doesn't support recursion, is disadvantaged. As I mentioned in this thread, the practical approach would be to use a workaround (copy-paste regex), or build your own finite state machine to parse it.
Also, you have a typographical error in your original regex. [a-zA-z0-9]+ has a range "A-z". You meant to type "A-Z".
I'll do a little assumption that you're able to add a + at the end of your chunk
i.e. number(abc(kk)123)+status(open)+
If it is possible you'll have it work with:
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]\\(\\)/ ]*?";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)+";
The changes are on line 2 adding the ( ) with escaping and replacing * by *?
The ? turn off the greedy matching and try to keep the shortest match (reluctant operator).
On line 3 adding a + at the end of the mask to help separate the key(value) fields.
Related
I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.
My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);
What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
I need to cut certain strings for an algorithm I am making. I am using substring() but it gets too complicated with it and actually doesn't work correctly. I found this topic how to cut string with two regular expression "_" and "."
and decided to try with split() but it always gives me
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 0
+
^
So this is the code I have:
String[] result = "234*(4-5)+56".split("+");
/*for(int i=0; i<result.length; i++)
{
System.out.println(result[i]);
}*/
Arrays.toString(result);
Any ideas why I get this irritating exception ?
P.S. If I fix this I will post you the algorithm for cutting and then the algorithm for the whole calculator (because I am building a calculator). It is gonna be a really badass calculator, I promise :P
+ in regex has a special meaning. to be treated as a normal character, you should escape it with backslash.
String[] result = "234*(4-5)+56".split("\\+");
Below are the metacharaters in regex. to treat any of them as normal characters you should escape them with backslash
<([{\^-=$!|]})?*+.>
refer here about how characters work in regex.
The plus + symbol has meaning in regular expression, which is how split parses it's parameter. You'll need to regex-escape the plus character.
.split("\\+");
You should split your string like this: -
String[] result = "234*(4-5)+56".split("[+]");
Since, String.split takes a regex as delimiter, and + is a meta-character in regex, which means match 1 or more repetition, so it's an error to use it bare in regex.
You can use it in character class to match + literal. Because in character class, meta-characters and all other characters loose their special meaning. Only hiephen(-) has a special meaning in it, which means a range.
+ is a regex quantifier (meaning one or more of) so needs to be escaped in the split method:
String[] result = "234*(4-5)+56".split("\\+");
I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.
I have the following pattern:
(COMPANY) -277.9887 (ASP,) -277.9887 (INC.)
I want the final output to be:
COMPANY ASP, INC.
Currently I have the following code and it keeps returning the original pattern ( I assume because the group all falls between the first '(' and last ')'
Pattern p = Pattern.compile("((.*))",Pattern.DOTALL);
Matcher matcher = p.matcher(eName);
while(matcher.find())
{
System.out.println("found match:"+matcher.group(1));
}
I am struggling to get the results I need and appreciate any help. I am not worried about concatenating the results after I get each group, just need to get each group.
Pattern p = Pattern.compile("\\((.*?)\\)",Pattern.DOTALL);
Your .* quantifier is 'greedy', so yes, it's grabbing everything between the first and last available parenthesis. As chaos says, tersely :), using the .*? is a non-greedy quantifier, so it will grab as little as possible while still maintaining the match.
And you need to escape the parenthesis within the regex, otherwise it becomes another group. That's assuming there are literal parenthesis in your string. I suspect what you referred to in the initial question as your pattern is in fact your string.
Query: are "COMPANY", "ASP," and "INC." required?
If you must have values for them, then you want to use + instead of *, the + is 1-or-more, the * is zero-or-more, so a * would match the literal string "()"
eg: "((.+?))"
Tested with Java 8:
/** * Below Pattern returns the string inside Parenthesis.
* Description about casting regular expression: \(+\s*([^\s)]+)\s*\)+
* \(+ : Exactly matches character "(" at least once
* \s* : matches zero to any number white character.
* ( : Start of Capturing group
* [^\s)]+: match any number of character except ^, ) and spaces.
* ) : Closing of capturing group.
* \s*: matches any white character(0 to any number of character)
* \)*: Exactly matches character ")" at least once.
private static Pattern REGULAR_EXPRESSION = Pattern.compile("\\(+\\s*([^\\s)]+)\\s*\\)+");
Not a direct answer to your question but I recommend you use RegxTester to get to the answer and any future question quickly. It allows you to test in realtime.
If your strings are always going to look like that, you could get away with just using a couple calls to replaceAll instead. This seems to work for me:
String eName = "(COMPANY) -277.9887 (ASP,) -277.9887 (INC.)";
String eNameEdited = eName.replaceAll("\\).*?\\("," ").replaceAll("\\(|\\)","");
System.out.println(eNameEdited);
Probably not the most efficient thing in the world, but fairly simple.