I have a regular expression that I want to match a latitude/longitude pair in a variety of fashions, e.g.
123 34 42
-123* 34' 42"
123* 34' 42"
+123* 34' 42"
45* 12' 22"N
45 12' 22"S
90:00:00.0N
I want to be able to match these in a pair such that
90:00:00.0N 180:00:00.0E is a latitude/longitude pair.
or
45* 12' 22"N 46* 12' 22"E is a latitude/longitude pair (1 degree by 1 degree cell).
or
123* 34' 42" 124* 34' 42" is a latitude/longitude pair
etc
Using the below regular expression, when I type in 123, it matches. I suppose this is true since 123 00 00 is a valid coordinate. However, I want to use this regular expression to match pairs in the same format above
"([-|\\+]?\\d{1,3}[d|D|\u00B0|\\s](\\s*\\d{1,2}['|\u2019|\\s])?"
+ "(\\s*\\d{1,2}[\"|\u201d|\\s])?\\s*([N|n|S|s|E|e|W|w])?\\s?)"
I am using Java.
* denotes a degree.
What am I doing wrong in my regular expression?
Well, for one thing, you're filling your character sets with a bunch of unnecessary pipe characters - alternation is implied in a [] pair. Additional cleanup: + doesn't need to be escaped in a character class. Your regular expression seems to be addressing a bigger problem statement than you gave us - you make no mention of d or D as matchable character. And you've made pretty much the entire back half of your RegEx optional. Going off of what I think your original problem statement is, I built the following regular expression:
^\s*([+-]?\d{1,3}\*?\s+\d{1,2}'?\s+\d{1,2}"?[NSEW]?|\d{1,3}(:\d{2}){2}\.\d[NSEW]\s*){1,2}$
It's a bit of a doozy, but I'll break it down for you, or anyone who happens across this in the future (Hello, future!).
^
Start of string, simple.
\s*
Any amount of whitespace - even none.
(
Denotes the beginning of a group - we'll get back to that.
[-+]?
An optional sign
\d{1,3}
1 to three digits
\*?
An optional Asterisk - the escape here is key for an asterisk, but if you want to replace this with the unicode codepoint for an actual degree, you won't need it.
\s+
At least one character of whitespace
\d{1,2}
1 or two digits.
'?
Optional apostrophe
\s+\d{1,2}+
You've seen these before, but there's a new curveball - there's a plus after the {1,2} quantifier! This makes it a possessive quantifier, meaning that the matcher won't give up its matches for this group to make another one possible. This is almost exclusively here to prevent 1 1 11 1 1 from matching, but can be used to increase speed anywhere you're 100% sure you don't need to be able to backtrack.
"?
Optional double quote. You'll have to escape this in Java.
[NSEW]?
An optional cardinal direction, designated by letter
|
OR - you can match everything in the group before this, or everything in the group after this.
\d{1,3}
Old news.
(:\d{2})
A colon, followed by two characters...
{2}
twice!
\.\d
Decimal point, followed by a single digit.
[NSEW]
Same as before, but this time it's mandatory.
\s*)
Some space, and finally the end of the group. Now, the first group has matched an entire longitude/latitude denotation, with an arbitrary amount of space at the end. Followed closely by:
{1,2}
Do that one, or two times - to match a single or a pair, then finally:
$
The end of the string.
This isn't perfect, but it's pretty close, and I think it answers the original problem statement. Plus, I feel my explanation has demystified it enough that you can edit it to further suit your needs. The one thing it doesn't (and won't) do, is enforce that the first coordinate matches the second in style. That's just too much to ask of Regular Expressions.
Doubters: Here it is in action. Please, enjoy.
Generally, I dont think that this is a good approach.
In your interface try to have DMS coordinates in one specific format.
The User should enter this in 3 separate text fields.
Further this regex is not very maintainable.
There are much more possibilities to notate a DMS coordinate,
you even cannot imagine. Humans are creative.
Eg:
Put N,S in front
or: North, 157 deg 50 min 55.796 sec
or: from wiki: The NGS now says in 1993 that point was 21-18-02.54891 N 157-50-45.90280 W
I'm not a RE wizard but with your formats you'd need to have some kind of convention for which pair comes first (probably latitude) if you're doing parsing from a single text box.
From there, you have six numeric fields (deg, min, sec for each, possibly with a decimal point), two signs (+ or - for each) and up to two hemispheres (one for each).
As far as I can see, parsing these 8-10 fields from your input would occur in the same order each time if you demanded only that the latitude is first, and the longitude second. The rest of the symbols (save the decimal point(s)) can be treated essentially as separators.
Does that make it easier?
Related
I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?
The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.
according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$
I am parsing documents which contain large amounts of formatted numbers, an example being:
Frc consts -- 1.4362 1.4362 5.4100
IR Inten -- 0.0000 0.0000 0.0000
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 1 0.40 -0.20 0.23 -0.30 -0.18 0.36 0.06 0.42 0.26
These are separated lines all with a significant leading space and there may or may not be significant trailing whitespace). They consist of 72,72, 78, 78, and 78 characters. I can deduce the boundaries between fields. These are describable (using fortran format (nx = nspaces, an = n alphanum, in = integer in n columns, fm.n = float of m characters with n places after the decimal point) by:
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a4,a4,3(2x,3a7))
(1x,2i4,3(2x,3f7.2))
(1x,2i4,3(2x,3f7.2))
I have potentially several thousand different formats (which I can autogenerate or farm out) and am describing them by regular expressions describing the components. Thus if regf10_4 represents a regex for any string satisfying the f10.4 constraint I can create a regex of the form:
COMMENTS
(\s
.{14}
\s
regf10_4,
\s{13}
regf10_4,
\s{13}
regf10_4,
)
I would like to know whether there are regexes that satisfy re-use in this way. There is a wide variety in the way computers and humans create numbers that are compatible with, say f10.4. I believe the following are all legal input and/or output for fortran (I do not require suffixes of the form f or d as in 12.4f) [the formatting in SO should be read as no leading spaces for the first, one for the second, etc.]
-1234.5678
1234.5678
// missing number
12345678.
1.
1.0000000
1.0000
1.
0.
0.
.1234
-.1234
1E2
1.E2
1.E02
-1.0E-02
********** // number over/underflow
They have to be robust against the content of the neighbouring fields (e.g. only examine precisely 10 characters in a precise position. Thus the following are legal for (a1,f5.2,a1):
a-1.23b // -1.23
- 1.23. // 1.23
3 1.23- // 1.23
I am using Java so need regex constructs compatible with Java 1.6 (e.g. not perl extensions)
As I understand it, each line comprises one or more fixed-width fields, which may contain labels, spaces, or data of different kinds. If you know the widths and types of the fields, extracting their data is a simple matter of substring(), trim() and (optionally) Whatever.parseWhatever(). Regexes can't make that job any easier--in fact, all they can do is make it a lot harder.
Scanner doesn't really help, either. True, it has predefined regexes for various value types, and it does the conversions for you, but it still needs to be told which type to look for each time, and it needs the fields to be separated by a delimiter it can recognize. Fixed-width data doesn't require delimiters, by definition. You might be able to fake the delimiters by doing a lookahead for however many characters should be left in the line, but that's just another way of making the job harder than it needs to be.
It sounds like performance is going to be a major concern; even if you could make a regex solution work, it would probably be too slow. Not because regexes are inherently slow, but because of the contortions you would have to go through to make them fit the problem. I suggest you forget about regexes for this job.
You could start with this and go from there.
This regex matches all the numbers you've provided.
Unfortunatly, it also matches the 3 in 3 1.23-
// [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)?
//
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:\.[0-9]*)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:[eE][-+]?[0-9]+)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character present in the list “eE” «[eE]»
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?");
Matcher matcher = regex.matcher(document);
while (matcher.find()) {
// matched text: matcher.group()
// match start: matcher.start()
// match end: matcher.end()
}
It is only a partial answer but I was alerted to Scanner in Java 1.5 which can scan text and interpret numbers which gives a BNF for the numbers that can be scanned and interpreted by this Java utility. In principle I imagine the BNF could be used to construct a regex.
I've written a regular expression that matches any number of letters with any number of single spaces between the letters. I would like that regular expression to also enforce a minimum and maximum number of characters, but I'm not sure how to do that (or if it's possible).
My regular expression is:
[A-Za-z](\s?[A-Za-z])+
I realized it was only matching two sets of letters surrounding a single space, so I modified it slightly to fix that. The original question is still the same though.
Is there a way to enforce a minimum of three characters and a maximum of 30?
Yes
Just like + means one or more you can use {3,30} to match between 3 and 30
For example [a-z]{3,30} matches between 3 and 30 lowercase alphabet letters
From the documentation of the Pattern class
X{n,m} X, at least n but not more than m times
In your case, matching 3-30 letters followed by spaces could be accomplished with:
([a-zA-Z]\s){3,30}
If you require trailing whitespace, if you don't you can use: (2-29 times letter+space, then letter)
([a-zA-Z]\s){2,29}[a-zA-Z]
If you'd like whitespaces to count as characters you need to divide that number by 2 to get
([a-zA-Z]\s){1,14}[a-zA-Z]
You can add \s? to that last one if the trailing whitespace is optional. These were all tested on RegexPlanet
If you'd like the entire string altogether to be between 3 and 30 characters you can use lookaheads adding (?=^.{3,30}$) at the beginning of the RegExp and removing the other size limitations
All that said, in all honestly I'd probably just test the String's .length property. It's more readable.
This is what you are looking for
^[a-zA-Z](\s?[a-zA-Z]){2,29}$
^ is the start of string
$ is the end of string
(\s?[a-zA-Z]){2,29} would match (\s?[a-zA-Z]) 2 to 29 times..
Actually Benjamin's answer will lead to the complete solution to the OP's question.
Using lookaheads it is possible to restrict the total number of characters AND restrict the match to a set combination of letters and (optional) single spaces.
The regex that solves the entire problem would become
(?=^.{3,30}$)^([A-Za-z][\s]?)+$
This will match AAA, A A and also fail to match AA A since there are two consecutive spaces.
I tested this at http://regexpal.com/ and it does the trick.
You should use
[a-zA-Z ]{20}
[For allowed characters]{for limiting of the number of characters}
I want to create a regular expression in java using standard libraries that will accommodate the following sentence:
12 of 128
Obviously the numbers can be anything though... From 1 digit to many
Also, I'm not sure how to accommodate the word "of" but I thought maybe something along the lines of:
[\d\sof\s\d]
This should work for you:
(\d+\s+of\s+\d+)
This will assume that you want to capture the full block of text as "one group", and there can be one-or-more whitespace characters in between each (if only one space, you can change \s+ to just \s).
If you want to capture the numbers separately, you can try:
(\d+)\s+of\s+(\d+)
You want this:
\d+\sof\s\d+
The relevant change from what you already had is the addition of the two plus signs. That means, that it should match multiple digits but at least one.
Sample: http://regexr.com?32cao
This regexp
"\\d+ of \\d+"
will match at least one to any number of digits, followed by string " of " followed by one to any number of digits.
I'm new to regular expressions and I need to find a regular expression that matches one or more digits [1-9] only ONE '|' sign, one or more '*' sign and zero or more ',' sign.
The string should not contain any other characters.
This is what I have:
if(this.ruleString.matches("^[1-9|*,]*$"))
{
return true;
}
Is it correct?
Thanks,
Vinay
I think you should test separately for every type of symbols rather than write complex expression.
First, test that you don't have invalid symbols - "^[0-9|*,]$"
Then test for digits "[1-9]", it should match at least one.
Then test for "\\|", "\\*" and "\\," and check the number of matches.
If all test are passed then your string is valid.
Nope, try this:
"^[1-9]+\\|\\*+,*$"
Please give us at least 10 possible matching strings of what you are looking to accept, and 10 of what you want to reject, and tell us if either this have to keep some sequence or its order doesn't matter. So we can make a reliable regex.
By now, all I can offer is:
^[1-9]+\|{1}\*+,*$
This RegEx was tested against these sample strings, accepting them:
56421|*****,,,
2|*********,,,
1|*
7|*,
18|****
123456789|*
12|********,,
1516332|**,,,
111111|*
6|*****,,,,
And it was tested against these sample strings, rejecting them:
10|*,
2***525*|*****,,,
123456,15,22*66*****4|,,,*167
1|2*3,4,5,6*
,*|173,
|*,
||12211
12
1|,*
1233|54|***,,,,
I assume your given order is strict and all conditions apply at the same time.
It looks like the pattern you need is
n-n, one or more times seperated by commas
then a bar (|)
then n*n, one or more times seperated by commas.
Here is a regular expression for that.
([1-9]{1}[0-9]*\-[0-9]+){1}
(,[1-9]{1}[0-9]*\-[0-9]+)*
\|
([1-9]{1}[0-9]*\*[0-9]+){1}
(,[1-9]{1}[0-9]*\*[0-9]+)*
But it is so complex, and does not take into account the details, such as
for the case of n-m, you want
n less than m
(I guess).
And you likely want the same number of n-m before the bar, and x*y after the bar.
Depends whether you want to check the syntax completely or not.
(I hope you do want to.)
Since this is so complex, it should be done with a set of code instead of a single regular expression.
this regex should work
"^[1-9\\|\\*,-]*$"
Assert position at the beginning of the string «^»
Match a single character present in the list below «[1-9\|*,-]»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «»
A character in the range between “1” and “9” «1-9»
A | character «\|»
A * character «*»
The character “,” «,»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»