I am parsing documents which contain large amounts of formatted numbers, an example being:
Frc consts -- 1.4362 1.4362 5.4100
IR Inten -- 0.0000 0.0000 0.0000
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 1 0.40 -0.20 0.23 -0.30 -0.18 0.36 0.06 0.42 0.26
These are separated lines all with a significant leading space and there may or may not be significant trailing whitespace). They consist of 72,72, 78, 78, and 78 characters. I can deduce the boundaries between fields. These are describable (using fortran format (nx = nspaces, an = n alphanum, in = integer in n columns, fm.n = float of m characters with n places after the decimal point) by:
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a4,a4,3(2x,3a7))
(1x,2i4,3(2x,3f7.2))
(1x,2i4,3(2x,3f7.2))
I have potentially several thousand different formats (which I can autogenerate or farm out) and am describing them by regular expressions describing the components. Thus if regf10_4 represents a regex for any string satisfying the f10.4 constraint I can create a regex of the form:
COMMENTS
(\s
.{14}
\s
regf10_4,
\s{13}
regf10_4,
\s{13}
regf10_4,
)
I would like to know whether there are regexes that satisfy re-use in this way. There is a wide variety in the way computers and humans create numbers that are compatible with, say f10.4. I believe the following are all legal input and/or output for fortran (I do not require suffixes of the form f or d as in 12.4f) [the formatting in SO should be read as no leading spaces for the first, one for the second, etc.]
-1234.5678
1234.5678
// missing number
12345678.
1.
1.0000000
1.0000
1.
0.
0.
.1234
-.1234
1E2
1.E2
1.E02
-1.0E-02
********** // number over/underflow
They have to be robust against the content of the neighbouring fields (e.g. only examine precisely 10 characters in a precise position. Thus the following are legal for (a1,f5.2,a1):
a-1.23b // -1.23
- 1.23. // 1.23
3 1.23- // 1.23
I am using Java so need regex constructs compatible with Java 1.6 (e.g. not perl extensions)
As I understand it, each line comprises one or more fixed-width fields, which may contain labels, spaces, or data of different kinds. If you know the widths and types of the fields, extracting their data is a simple matter of substring(), trim() and (optionally) Whatever.parseWhatever(). Regexes can't make that job any easier--in fact, all they can do is make it a lot harder.
Scanner doesn't really help, either. True, it has predefined regexes for various value types, and it does the conversions for you, but it still needs to be told which type to look for each time, and it needs the fields to be separated by a delimiter it can recognize. Fixed-width data doesn't require delimiters, by definition. You might be able to fake the delimiters by doing a lookahead for however many characters should be left in the line, but that's just another way of making the job harder than it needs to be.
It sounds like performance is going to be a major concern; even if you could make a regex solution work, it would probably be too slow. Not because regexes are inherently slow, but because of the contortions you would have to go through to make them fit the problem. I suggest you forget about regexes for this job.
You could start with this and go from there.
This regex matches all the numbers you've provided.
Unfortunatly, it also matches the 3 in 3 1.23-
// [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)?
//
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:\.[0-9]*)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:[eE][-+]?[0-9]+)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character present in the list “eE” «[eE]»
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?");
Matcher matcher = regex.matcher(document);
while (matcher.find()) {
// matched text: matcher.group()
// match start: matcher.start()
// match end: matcher.end()
}
It is only a partial answer but I was alerted to Scanner in Java 1.5 which can scan text and interpret numbers which gives a BNF for the numbers that can be scanned and interpreted by this Java utility. In principle I imagine the BNF could be used to construct a regex.
Related
Let's say that you have a string:
String string = "ab #1?AZa$ab #1?AZa$"
You're trying to verify that the tenth is a non-whitespace character, and that the twentieth character is the same as the tenth. Furthermore, there is corresponding verification with the 1st and 11th, the 2nd and 12th, the 3rd and 13th, etc. each with their own separate requirements (the full list is here) so you have to use 10 capturing groups. I found that the following regex still works to validate the aforementioned string:
string.matches("^([a-z])(\\w)(\\s)(\\W)(\\d)(\\D)([A-Z])([a-zA-Z])([aeiouAEIOU])(\\S)\\1\\2\\3\\4\\5\\6\\7\\8\\9\\10$") //returns true
My question regards the last backreference:
\\10
Shouldn't this be interpreted as "match with the first character" and then "match with 0" (the digit)? I don't see how this is interpreted as "match with the tenth character" without somehow grouping the 1 and 0 together into 10. Puzzlingly, surrounding the 1 and 0 with parentheses does not work.
The behavior for Java is documented in Pattern:
In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.
Pattern p = Pattern.compile("(?=[1-9][0-9]{2})[0-9]*[05]");
Matcher m = p.matcher("101");
while(m.find()){
System.out.println(m.start()+":"+ m.end()+ m.group());
}
Output------ >> 0:210
Please let me know why I am getting output of m.group() as 10 here.
As far as I understand m.group() should return nothing because [05] matches to nothing.
Your Pattern, (?=[1-9][0-9]{2})[0-9]*[05] consists of 2 parts:
(?=[1-9][0-9]{2})
and
[0-9]*[05]
The first part is a zero-width positive lookahead which searches for a number of length 3, and the first can not be 0. This matches your 101.
The second part searches for any amount of numbers and then a 0 or a 5. This matches the first 2 characters of 101, thus the result is 10.
See Java - Pattern for more information.
What your Regex is looking for is:
[1-9]:
match a single character present in the list below
1-9 a single character in the range between 1 and 9
[0-9]{2}:
match a single character present in the list below
Quantifier: {2} Exactly 2 times
0-9 a single character in the range between 0 and 9
[0-9]*:
match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
[05]:
match a single character present in the list below
05 a single character in the list 05 literally
for the String "101" this nacht the first 2 chars 101,
so you are printing.out:
System.out.println(**m.start()**+":"+ **m.end()**+ m.group());
where m.start() returns the start index of the previous match(char at
0). where m.end() returns the offset after the last character matched.
and where m.group() returns the input subsequence matched by the previous
match.
That regex was meant to match a number that's a multiple of 5 and greater than or equal to 100, but it's useless without anchors. It should be:
^(?=[1-9][0-9]{2}$)[0-9]*[05]$
The anchors ensure that both the lookahead and the main part are examining the whole of the string. But the task doesn't require a lookahead anyway; this works just fine:
^[1-9][0-9][05]$
As #AlanMoore states, there has to be an alignment.
Assertions are a self contained entity, all they have to do is Pass
to advance to the next construct.
Lets see what (?=[1-9][0-9]{2}) matches;
1111111110666
2222222222222222225666
33333333333333333333333330666
So far so good, on to the next construct.
Lets see what [0-9]*[05] matches.
What ever this matches is the final answer.
1111111110666
2222222222222222225666
33333333333333333333333330666
What to learn is that to get a cohesive answer, assertions have to be crafted to
coincide with constructs that come after them.
Here is an example of a constraint that could be applied
after the assertion.
The assertion state's it needs three digits and the first digit must be >= 1.
The constructs after the assertion state it can be any number of digit's,
as long as it ends with a 0 or 5.
This last part is distressing since it will match only the 500000
So for sure, you need at least three digits.
That can be done like this:
[0-9]{2,}[05]
This says two things
There must be at least three digits, but can be more
It must end with a 0 or 5.
That's it, put it all together, its:
(?=[1-9][0-9]{2})[0-9]{2,}[05]
Of course, this can be condensed to;
[1-9][0-9]+[05]
I've written a regular expression that matches any number of letters with any number of single spaces between the letters. I would like that regular expression to also enforce a minimum and maximum number of characters, but I'm not sure how to do that (or if it's possible).
My regular expression is:
[A-Za-z](\s?[A-Za-z])+
I realized it was only matching two sets of letters surrounding a single space, so I modified it slightly to fix that. The original question is still the same though.
Is there a way to enforce a minimum of three characters and a maximum of 30?
Yes
Just like + means one or more you can use {3,30} to match between 3 and 30
For example [a-z]{3,30} matches between 3 and 30 lowercase alphabet letters
From the documentation of the Pattern class
X{n,m} X, at least n but not more than m times
In your case, matching 3-30 letters followed by spaces could be accomplished with:
([a-zA-Z]\s){3,30}
If you require trailing whitespace, if you don't you can use: (2-29 times letter+space, then letter)
([a-zA-Z]\s){2,29}[a-zA-Z]
If you'd like whitespaces to count as characters you need to divide that number by 2 to get
([a-zA-Z]\s){1,14}[a-zA-Z]
You can add \s? to that last one if the trailing whitespace is optional. These were all tested on RegexPlanet
If you'd like the entire string altogether to be between 3 and 30 characters you can use lookaheads adding (?=^.{3,30}$) at the beginning of the RegExp and removing the other size limitations
All that said, in all honestly I'd probably just test the String's .length property. It's more readable.
This is what you are looking for
^[a-zA-Z](\s?[a-zA-Z]){2,29}$
^ is the start of string
$ is the end of string
(\s?[a-zA-Z]){2,29} would match (\s?[a-zA-Z]) 2 to 29 times..
Actually Benjamin's answer will lead to the complete solution to the OP's question.
Using lookaheads it is possible to restrict the total number of characters AND restrict the match to a set combination of letters and (optional) single spaces.
The regex that solves the entire problem would become
(?=^.{3,30}$)^([A-Za-z][\s]?)+$
This will match AAA, A A and also fail to match AA A since there are two consecutive spaces.
I tested this at http://regexpal.com/ and it does the trick.
You should use
[a-zA-Z ]{20}
[For allowed characters]{for limiting of the number of characters}
I have a regular expression that I want to match a latitude/longitude pair in a variety of fashions, e.g.
123 34 42
-123* 34' 42"
123* 34' 42"
+123* 34' 42"
45* 12' 22"N
45 12' 22"S
90:00:00.0N
I want to be able to match these in a pair such that
90:00:00.0N 180:00:00.0E is a latitude/longitude pair.
or
45* 12' 22"N 46* 12' 22"E is a latitude/longitude pair (1 degree by 1 degree cell).
or
123* 34' 42" 124* 34' 42" is a latitude/longitude pair
etc
Using the below regular expression, when I type in 123, it matches. I suppose this is true since 123 00 00 is a valid coordinate. However, I want to use this regular expression to match pairs in the same format above
"([-|\\+]?\\d{1,3}[d|D|\u00B0|\\s](\\s*\\d{1,2}['|\u2019|\\s])?"
+ "(\\s*\\d{1,2}[\"|\u201d|\\s])?\\s*([N|n|S|s|E|e|W|w])?\\s?)"
I am using Java.
* denotes a degree.
What am I doing wrong in my regular expression?
Well, for one thing, you're filling your character sets with a bunch of unnecessary pipe characters - alternation is implied in a [] pair. Additional cleanup: + doesn't need to be escaped in a character class. Your regular expression seems to be addressing a bigger problem statement than you gave us - you make no mention of d or D as matchable character. And you've made pretty much the entire back half of your RegEx optional. Going off of what I think your original problem statement is, I built the following regular expression:
^\s*([+-]?\d{1,3}\*?\s+\d{1,2}'?\s+\d{1,2}"?[NSEW]?|\d{1,3}(:\d{2}){2}\.\d[NSEW]\s*){1,2}$
It's a bit of a doozy, but I'll break it down for you, or anyone who happens across this in the future (Hello, future!).
^
Start of string, simple.
\s*
Any amount of whitespace - even none.
(
Denotes the beginning of a group - we'll get back to that.
[-+]?
An optional sign
\d{1,3}
1 to three digits
\*?
An optional Asterisk - the escape here is key for an asterisk, but if you want to replace this with the unicode codepoint for an actual degree, you won't need it.
\s+
At least one character of whitespace
\d{1,2}
1 or two digits.
'?
Optional apostrophe
\s+\d{1,2}+
You've seen these before, but there's a new curveball - there's a plus after the {1,2} quantifier! This makes it a possessive quantifier, meaning that the matcher won't give up its matches for this group to make another one possible. This is almost exclusively here to prevent 1 1 11 1 1 from matching, but can be used to increase speed anywhere you're 100% sure you don't need to be able to backtrack.
"?
Optional double quote. You'll have to escape this in Java.
[NSEW]?
An optional cardinal direction, designated by letter
|
OR - you can match everything in the group before this, or everything in the group after this.
\d{1,3}
Old news.
(:\d{2})
A colon, followed by two characters...
{2}
twice!
\.\d
Decimal point, followed by a single digit.
[NSEW]
Same as before, but this time it's mandatory.
\s*)
Some space, and finally the end of the group. Now, the first group has matched an entire longitude/latitude denotation, with an arbitrary amount of space at the end. Followed closely by:
{1,2}
Do that one, or two times - to match a single or a pair, then finally:
$
The end of the string.
This isn't perfect, but it's pretty close, and I think it answers the original problem statement. Plus, I feel my explanation has demystified it enough that you can edit it to further suit your needs. The one thing it doesn't (and won't) do, is enforce that the first coordinate matches the second in style. That's just too much to ask of Regular Expressions.
Doubters: Here it is in action. Please, enjoy.
Generally, I dont think that this is a good approach.
In your interface try to have DMS coordinates in one specific format.
The User should enter this in 3 separate text fields.
Further this regex is not very maintainable.
There are much more possibilities to notate a DMS coordinate,
you even cannot imagine. Humans are creative.
Eg:
Put N,S in front
or: North, 157 deg 50 min 55.796 sec
or: from wiki: The NGS now says in 1993 that point was 21-18-02.54891 N 157-50-45.90280 W
I'm not a RE wizard but with your formats you'd need to have some kind of convention for which pair comes first (probably latitude) if you're doing parsing from a single text box.
From there, you have six numeric fields (deg, min, sec for each, possibly with a decimal point), two signs (+ or - for each) and up to two hemispheres (one for each).
As far as I can see, parsing these 8-10 fields from your input would occur in the same order each time if you demanded only that the latitude is first, and the longitude second. The rest of the symbols (save the decimal point(s)) can be treated essentially as separators.
Does that make it easier?
I'm new to regular expressions and I need to find a regular expression that matches one or more digits [1-9] only ONE '|' sign, one or more '*' sign and zero or more ',' sign.
The string should not contain any other characters.
This is what I have:
if(this.ruleString.matches("^[1-9|*,]*$"))
{
return true;
}
Is it correct?
Thanks,
Vinay
I think you should test separately for every type of symbols rather than write complex expression.
First, test that you don't have invalid symbols - "^[0-9|*,]$"
Then test for digits "[1-9]", it should match at least one.
Then test for "\\|", "\\*" and "\\," and check the number of matches.
If all test are passed then your string is valid.
Nope, try this:
"^[1-9]+\\|\\*+,*$"
Please give us at least 10 possible matching strings of what you are looking to accept, and 10 of what you want to reject, and tell us if either this have to keep some sequence or its order doesn't matter. So we can make a reliable regex.
By now, all I can offer is:
^[1-9]+\|{1}\*+,*$
This RegEx was tested against these sample strings, accepting them:
56421|*****,,,
2|*********,,,
1|*
7|*,
18|****
123456789|*
12|********,,
1516332|**,,,
111111|*
6|*****,,,,
And it was tested against these sample strings, rejecting them:
10|*,
2***525*|*****,,,
123456,15,22*66*****4|,,,*167
1|2*3,4,5,6*
,*|173,
|*,
||12211
12
1|,*
1233|54|***,,,,
I assume your given order is strict and all conditions apply at the same time.
It looks like the pattern you need is
n-n, one or more times seperated by commas
then a bar (|)
then n*n, one or more times seperated by commas.
Here is a regular expression for that.
([1-9]{1}[0-9]*\-[0-9]+){1}
(,[1-9]{1}[0-9]*\-[0-9]+)*
\|
([1-9]{1}[0-9]*\*[0-9]+){1}
(,[1-9]{1}[0-9]*\*[0-9]+)*
But it is so complex, and does not take into account the details, such as
for the case of n-m, you want
n less than m
(I guess).
And you likely want the same number of n-m before the bar, and x*y after the bar.
Depends whether you want to check the syntax completely or not.
(I hope you do want to.)
Since this is so complex, it should be done with a set of code instead of a single regular expression.
this regex should work
"^[1-9\\|\\*,-]*$"
Assert position at the beginning of the string «^»
Match a single character present in the list below «[1-9\|*,-]»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «»
A character in the range between “1” and “9” «1-9»
A | character «\|»
A * character «*»
The character “,” «,»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»