Regex for circle and polygon string with decimal/integer values - java

I'm trying to create regex patterns to be used in Java for the following two strings:
CIRCLE ( (187.8562 ,-88.562 ) , 0.774 )
and
POLYGON ( (17.766 55.76676,77.97666 -32.866888,54.97799 54.2131,67.666777 24.9771,17.766 55.76676) )
Please note that
one/more white spaces may exist anywhere.Exceptions are not between alphabets.And not between any digits of a number. [UPDATED]
CIRCLE and POLYGON words are fixed but are not case sensitive.[UPDATED]
For the 2nd string the number of point set are not fixed.Here I've given 5 set of points for simplicity.
points are set of decimal/integer numbers [UPDATED]
positive decimal number can have a + sign [UPDATED]
leading zero is not mandatory for a decimal number [UPDATED]
For polygon atleast 3 point set are required.And also first & last point set will be the same (enclosed polygon) [UPDATED]
Any help or suggestion will be appreciated.
I've tried as:
(CIRCLE)(\\s+)(\\()(\\s+)(\\()(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(,)(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(\\))(\\s+)(,)(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(\\))
Could you please provide me the working regex pattern for those two string?

I suggest you to remove space from your string before submitting it to the regex.
Circle:
CIRCLE\(\(-?\d+\.\d+,-?\d+\.\d+\),[-]?\d+\.\d+\)
Polygon:
POLYGON\(\((-?\d+\.\d+\s+-?\d+\.\d+,)+-?\d+\.\d+\s+-?\d+\.\d+\)\)
Circle including spaces:
CIRCLE\s*\(\s*\(\s*-?\d+\.\d+\s*,\s*-?\d+\.\d+\s*\)\s*,\s*-?\d+\.\d+\s*\)
Polygon including spaces:
POLYGON\s*\(\s*\(\s*(-?\d+\.\d+\s+-?\d+\.\d+\s*,\s*)+\s*-?\d+\.\d+\s+-?\d+\.\d+\s*\)\s*\)
Circle including spaces updated:
/CIRCLE\s*\(\s*\(\s*[+-]?\d*\.\d+\s*,\s*[+-]?\d*\.\d+\s*\)\s*,\s*[+-]?\d*\.\d+\s*\)/i
Polygon including spaces updated:
/POLYGON\s*\(\s*\(\s*([+-]?\d*\.\d+)\s+([+-]?\d*\.\d+)\s*(,\s*[+-]?\d*\.\d+\s+[+-]?\d*\.\d+)+\s*,\s*\1\s+\2\s*\)\s*\)/i

UPDATED ANSWER:
This match examples from question and comments:
(CIRCLE|POLYGON)([( ]+)([+ \-\.]?(\d+)?([ \.]\d+[ ,)]+))+

Any help or suggestion will be appreciated.
My suggestion is to break it up into pieces. Just as you'd want to break up a large, complex function into smaller functions so that each part is easy to see and understand, you want to break up a large, complex regex pattern into smaller patterns for the same reason. For example:
private interface Patterns {
String UNSIGNED_INTEGER = "(?:0|[1-9]\\d*+)";
String DECIMAL_PART = "(?:[.]\\d++)";
String UNSIGNED_NUMBER_WITH_INTEGER_PART =
"(?:" + UNSIGNED_INTEGER + DECIMAL_PART + "?+)";
String UNSIGNED_NUMBER =
"(?:" + UNSIGNED_NUMBER_WITH_INTEGER_PART + "|" + DECIMAL_PART ")";
String NUMBER = "(?:[+-]?+" + UNSIGNED_NUMBER + ")";
String SPACE_SEPARATED_PAIR = "(?:" + NUMBER + "\\s++" + NUMBER + ")";
String OPTIONAL_SPACE = "(?:\\s*+)";
String LPAREN = "(?:" + OPTIONAL_SPACE + "[(]" + OPTIONAL_SPACE + ")";
String RPAREN = "(?:" + OPTIONAL_SPACE + "[)]" + OPTIONAL_SPACE + ")";
String COMMA = "(?:" + OPTIONAL_SPACE + "," + OPTIONAL_SPACE + ")";
Pattern CIRCLE = Pattern.compile(
OPTIONAL_SPACE + "CIRCLE" + OPTIONAL_SPACE + LPAREN +
LPAREN +
NUMBER + COMMA + NUMBER +
RPAREN + COMMA +
NUMBER +
RPAREN + OPTIONAL_SPACE,
Pattern.CASE_INSENSITIVE);
Pattern POLYGON = Pattern.compile(
OPTIONAL_SPACE + "POLYGON" + OPTIONAL_SPACE + LPAREN +
LPAREN +
NUMBER_PAIR + "(?:" + COMMA + NUMBER_PAIR + "){3,}+" +
RPAREN
RPAREN + OPTIONAL_SPACE,
Pattern.CASE_INSENSITIVE);
}
Notes:
The above is not tested. My goal was to show you how to do this maintainably, rather than to simply do it for you. (It should work as-is, though, unless I have typos or whatnot.)
Note the pervasive use of non-capture groups (?:...). This allows each subpattern to be a separate module; for example, something like COMMA + "+" is well-defined as meaning "one or more commas, plus optional spaces".
Also note the pervasive use of possessive quantifiers like ?+ and *+ and ++. It's easier to tell what is matched by a given occurrence of NUMBER when you know that NUMBER will never "stop short" before a trailing digit. (Imagine having a function whose behavior depended on the code that runs after it. That would be confusing, right? Well, the non-possessive quantifiers can change their meaning depending on what follows, which can have similarly confusing results for large, complex regexes.) This also has considerable performance benefits in the event of a near-match.
I made no attempt to detect the "And also first & last point set will be the same (enclosed polygon)" case. Regexes are not suited to this, since regexes are string-description language, and "same" in this case is not a string concept but a mathematical one. (It's easy to tell that 1 +0.3 is equivalent to +1.0 .30 if you use something like BigDecimal to store the actual values; but to try to express that using a regex would be pure folly.)

Related

Regular Expression That Contains All Of The Specific Letters In Java

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?
Java-specific answer
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
Original regex-based answer
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string
(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L
See Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB
Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB
I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.
You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.
Please have a look at this answer. You need to use Pattern.MULTILINE.

How "STAR" not considered as a Quantifier in regular expersion?

There is no problem for the following model of IP(for example) :
255.3.3.6
by this RE(from: http://www.mkyong.com/regular-expressions/how-to-validate-ip-address-with-regular-expression/):
"^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])$";
but I want to have an IP-pattern to handle one IP like following model:
255.*.3.100
OR
*.*.3.100
OR
*.*.*.*
(any places in the IP, can be a star)
i use this pattern:
"^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.|(\\*)\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.|(\\*)\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.|(\\*)\\." +
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])|(\\*)\\.$";
but it dose not work.
I think star in my pattern considered as one Quantifier .
what should I do?please help me.
The asterisk is an additional alternative. Compose without repetitions.
String group = "(?:[01]?\\d\\d?|2[0-4]\\d|25[0-5]|\\*)";
String patstr = "^" + group + "(\\." + group + "){3}$";
Pattern pat = Pattern.compile( patstr );
Matcher mat = pat.matcher( args[0] );
System.out.println( mat.matches() );
The grammar represented by OP's regular expression can be written as
IP ::= DP
|APDP
|APDP
|APD
|AP
D ::= Number
P ::= '.'
A ::= '*'
Note that the operator | separates alternatives; thus no valid address is matching, and no address where a number is replaced by an asterisk.

What does regex "\\p{Z}" mean?

I am working with some code in java that has an statement like
String tempAttribute = ((String) attributes.get(i)).replaceAll("\\p{Z}","")
I am not used to regex, so what is the meaning of it? (If you could provide a website to learn the basics of regex that would be wonderful) I've seen that for a string like
ept as y it gets transformed into eptasy, but this doesn't seem right. I believe the guy who wrote this wanted to trim leading and trailing spaces maybe.
It removes all the whitespace (replaces all whitespace matches with empty strings).
A wonderful regex tutorial is available at regular-expressions.info.
A citation from this site:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
The OP stated that the code fragment was in Java. To comment on the statement:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
the sample code below shows that this does not apply in Java.
public static void main(String[] args) {
// some normal white space characters
String str = "word1 \t \n \f \r " + '\u000B' + " word2";
// various regex patterns meant to remove ALL white spaces
String s = str.replaceAll("\\s", "");
String p = str.replaceAll("\\p{Space}", "");
String b = str.replaceAll("\\p{Blank}", "");
String z = str.replaceAll("\\p{Z}", "");
// \\s removed all white spaces
System.out.println("s [" + s + "]\n");
// \\p{Space} removed all white spaces
System.out.println("p [" + p + "]\n");
// \\p{Blank} removed only \t and spaces not \n\f\r
System.out.println("b [" + b + "]\n");
// \\p{Z} removed only spaces not \t\n\f\r
System.out.println("z [" + z + "]\n");
// NOTE: \p{Separator} throws a PatternSyntaxException
try {
String t = str.replaceAll("\\p{Separator}","");
System.out.println("t [" + t + "]\n"); // N/A
} catch ( Exception e ) {
System.out.println("throws " + e.getClass().getName() +
" with message\n" + e.getMessage());
}
} // public static void main
The output for this is:
s [word1word2]
p [word1word2]
b [word1
word2]
z [word1
word2]
throws java.util.regex.PatternSyntaxException with message
Unknown character property name {Separator} near index 12
\p{Separator}
^
This shows that in Java \\p{Z} removes only spaces and not "any kind of whitespace or invisible separator".
These results also show that in Java \\p{Separator} throws a PatternSyntaxException.
First of all, \p means you are going to match a class, a collection of character, not single one. For reference, this is Javadoc of Pattern class. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Unicode scripts, blocks, categories and binary properties are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.
And then Z is the name of a class (collection,set) of characters. In this case, it's abbreviation of Separator . Separator containts 3 sub classes: Space_Separator(Zs), Line_Separator(Zl) and Paragraph_Separator(Zp).
Refer here for which characters those classes contains here: Unicode Character Database or
Unicode Character Categories
More document: http://www.unicode.org/reports/tr18/#General_Category_Property

REGEX to format phone number in java

given a phone number with spaces and + allowed, how would you right a regular expression to format it so that non-digits and extra spaces are removed?
I have this so far
String num = " Ken's Phone is + 123 2213 123 (night time)";
System.out.println(num.replaceAll("[^\\d|+|\\s]", "").replaceAll("\\s\\s+", " ").replaceAll("\\+ ", "\\+").trim());
Would you simplify it so that the same result is obtained?
Thank you
I would put trim() first, or at least before you replace every multiple spaces.
Also keep in mind that \s means whitespaces: [ \t\n\x0B\f\r], if you only mean ' ' then use it.
A nicer way to express that you only want at least two spaces to be replaced would be
replaceAll("\\s{2,}", " ")
First extract the number-with-spaces part, then compress multiple spaces to single spaces. then finally remove all spaces that follow a plus sign:
String numberWithSpaces = str.replaceAll("^[^\\d+]*([+\\d\\s]+)[^\\d]*$", "$1").replaceAll("\\s+", " ").replaceAll("\\+\\s*", "+");
I tested this code and it works.
You can simplify it as:
num.replaceAll("[^\\d+\\s]", "") // [^\\d|+|\\s] => [^\\d+\\s]
.replaceAll("\\s{2,}", " ") // \\s\\s+ => \\s{2,}
.replaceAll("\\+\\s", "+") // \\+ => +
.trim()

How do I find a group of words using Reg-ex?

Here is the code:
String Str ="Animals \n" +
"Dog \n" +
"Cat \n" +
"Fruits \n" +
"Apple \n" +
"Banana \n" +
"Watermelon \n" +
"Sports \n" +
"Soccer \n" +
"Volleyball \n";
The Str basically has 3 categories (Animals, Fruits, Sports). Each of them in separate line. Using Regular Expression, how do I find the Fruits' contents, which will give me the output like this:
Apple
Banana
Watermelon
I would like an explanation that goes with your answer as well, so that I will have a better understand about this problem.
Thanks. :)
Assuming that you want to extract the text between the word "Fruits" and the word "Sports" you could use a regular expression with a capturing group. This way, if a string matches then you still have to extract the group that contains the text that you want.
For example:
Pattern p = Pattern.compile("Fruits(.*?)Sports", Pattern.DOTALL);
// The string "Fruits" ------^ ^ ^ ^
// Capture everything in between --^ ^ ^
// The string "Sports" -----------------^ ^
// This tells the regex to treat newlines ^
// like normal characters ---------------------^
See the railroad diagram below:
Alternatively, you can use a more advanced regular expression using positive lookahead and lookbehinds. This means that you can make your regular expression still look for text between the words "Fruit" and "Sports" but not consider those strings themselves as part of the match.
Pattern p = Pattern.compile("(?<!Fruits).*?(?=Sports)", Pattern.DOTALL);
I would start by splitting the string into an array of words (String[] words = Regex.Split(Str, "\n");), then loop through the words array, adding elements to their proper categories as you go along, switching between the categories as you see headings.

Categories