Regular Expression (Java) anomaly - explanation sought

Regular Expression (Java) anomaly - explanation sought - java

Using Java (1.6) I want to split an input string that has components of a header, then a number of tokens. Tokens conform to this format: a ! char, a space char, then a 2 char token name (from constrained list e.g. C0 or 04) and then 5 digits. I have built a pattern for this, but it fails for one token (CE) unless I remove the requirement for the 5 digits after the token name. Unit test explains this better than I could (see below)
Can anyone help with what's going on with my failing pattern? The input CE token looks OK to me...
Cheers!
#Test
public void testInputSplitAnomaly() {
Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]\\d{5}))");
splitByRegExp(pattern);
}
#Test
public void testInputSplitWorks() {
Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]))");
splitByRegExp(pattern);
}
public void splitByRegExp(Pattern pattern) {
String input = "& 0000800429! C600080 123456789-! C000026 213 00300! 0400020 A1Y1! Q200002 13! CE00202 01 ! Q600006 020507! C400012 O00511011";
String[] tokens = pattern.split(input);
Arrays.sort(tokens);
System.out.println("-----------------------------");
for (String token : tokens) {
System.out.println(token.substring(0,11));
}
assertThat(tokens,Matchers.hasItemInArray(startsWith("! CE")));
assertThat(tokens.length,is(8));
}

I think that your mistake here is your use of square brackets. Don't forget that these indicate a character class, so [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE] doesn't do what you expect it to.
What it does do is the following:
[04|C0|Q2|Q6|C4|B[2-6] constitutes a character class, matching one of: |, [, 0, 2, 3, 4, 5, 6, B, C or Q,
the rest is interpreted as listing a set of alternatives, specificially the character class mentioned above, or Q[8-9] *or * C6 *or * CE]. That is why the CE doesn't work, because it does not have a square bracket with it.
What you are probably after is (?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)

This doesn't make any sense:
[04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]
I believe you want:
(?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)
Square brackets are only used for character classes, not general grouping. Use (?:...) or (...) for general grouping (the latter also captures).

Related

How can I split a string without knowing the split characters a-priori?

For my project I have to read various input graphs. Unfortunately, the input edges have not the same format. Some of them are comma-separated, others are tab-separated, etc. For example:
File 1:
123,45
67,89
...
File 2
123 45
67 89
...
Rather than handling each case separately, I would like to automatically detect the split characters. Currently I have developed the following solution:
String str = "123,45";
String splitChars = "";
for(int i=0; i < str.length(); i++) {
if(!Character.isDigit(str.charAt(i))) {
splitChars += str.charAt(i);
}
}
String[] endpoints = str.split(splitChars);
Basically I pick the first row and select all the non-numeric characters, then I use the generated substring as split characters. Is there a cleaner way to perform this?

Split requires a regexp, so your code would fail for many reasons: If the separator has meaning in regexp (say, +), it'll fail. If there is more than 1 non-digit character, your code will also fail. If you code contains more than exactly 2 numbers, it will also fail. Imagine it contains hello, world - then your splitChars string becomes " , " - and your split would do nothing (that would split the string "test , abc" into two, nothing else).
Why not make a regexp to fetch digits, and then find all sequences of digits, instead of focussing on the separators?
You're using regexps whether you want to or not, so let's make it official and use Pattern, while we are at it.
private static final Pattern ALL_DIGITS = Pattern.compile("\\d+");
// then in your split method..
Matcher m = ALL_DIGITS.matcher(str);
List<Integer> numbers = new ArrayList<Integer>();
// dont use arrays, generally. List is better.
while (m.find()) {
numbers.add(Integer.parseInt(m.group(0)));
}
//d+ is: Any number of digits.
m.find() finds the next match (so, the next block of digits), returning false if there aren't any more.
m.group(0) retrieves the entire matched string.

Split the string on \\D+ which means one or more non-digit characters.
Demo:
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
// Test strings
String[] arr = { "123,45", "67,89", "125 89", "678 129" };
for (String s : arr) {
System.out.println(Arrays.toString(s.split("\\D+")));
}
}
}
Output:
[123, 45]
[67, 89]
[125, 89]
[678, 129]

Why not split with [^\d]+ (every group of nondigfit) :
for (String n : "123,456 789".split("[^\\d]+")) {
System.out.println(n);
}
Result:
123
456
789

How to make a regular expression match based on a condition?

I'm trying to make a conditional regex, I know that there are other posts on stack overflow but there too specific to the problem.
The Question
How can I create a regular expression that only looks to match something given a certain condition?
An example
An example of this would be if we had a list of a string(this is in java):
String nums = "42 36 23827";
and we only want to match if there are the same amount of x's at the end of the string as there are at the beginning
What we want in this example
In this example, we would want a regex that checks if there are the same amount of regex's at the end as there are in the beginning. The conditional part: If there are x's at the beginning, then check if there are that many at the end, if there are then it is a match.
Another example
An example of this would be if we had a list of numbers (this is in java) in string format:
String nums = "42 36 23827";
and we want to separate each number into a list
String splitSpace = "Regex goes here";
Pattern splitSpaceRegex = Pattern.compile(splitSpace);
Matcher splitSpaceMatcher = splitSpaceRegex.matcher(text);
ArrayList<String> splitEquation = new ArrayList<String>();
while (splitSpaceMatcher.find()) {
if (splitSpaceMatcher.group().length() != 0) {
System.out.println(splitSpaceMatcher.group().trim());
splitEquation.add(splitSpaceMatcher.group().trim());
}
}
How can I make this into an array that looks like this:
["42", "36", "23827"]
You could try making a simple regex like this:
String splitSpace = "\\d+\\s+";
But that exludes the "23827" because there is no space after it.
and we only want to match if there are the same amount ofx`'s at the end of the string as there are at the beginning
What we want in this example
In this example, we would want a regex that checks if it is the end of the string; if it is then we don't need the space, otherwise, we do. As #YCF_L mentioned we could just make a regex that is \\b\\d\\b but I am aiming for something conditional.
Conclusion
So, as a result, the question is, how do we make conditional regular expressions? Thanks for reading and cheers!

There are no conditionals in Java regexes.
I want a regex that checks if there are the same amount of regex's at the end as there are in the beginning. The conditional part: If there are x's at the beginning, then check if there are that many at the end, if there are then it is a match.
This may or may not be solvable. If you want to know if a specific string (or pattern) repeats, that can be done using a back reference; e.g.
^(\d+).+\1$
will match a line consisting of an arbitrary number digits, any number of characters, and the same digits matched at the start. The back reference \1 matches the string matched by group 1.
However if you want the same number of digits at the end as at the start (and that number isn't a constant) then you cannot implement this using a single (Java) regex.
Note that some regex languages / engines do support conditionals; see the Wikipedia Comparison of regular-expression engines page.

I would like to use split which accept regex like so :
String[] split = nums.split("\\s+"); // ["42", "36", "23827"]
If you want to use Pattern with Matcher, then you can use String \b\d+\b with word boundaries.
String regex = "\\b\\d+\\b";
By using word boundaries, you will avoid cases where the number is part of the word, for example "123 a4 5678 9b" you will get just ["123", "4578"]

I do not see the "conditional" in the question. The problem is solvable with a straight forward regular expression: \b\d+\b.
regex101 demo
A fully fledged Java example would look something like this:
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Ideone {
public static void main(String args[]) {
final String sample = "123 45 678 90";
final Pattern pattern = Pattern.compile("\\b\\d+\\b");
final Matcher matcher = pattern.matcher(sample);
final ArrayList<String> results = new ArrayList<>();
while (matcher.find()) {
results.add(matcher.group());
}
System.out.println(results);
}
}
Output: [123, 45, 678, 90]
Ideone demo

Regular Expressions (regex) Pattern Matching

Can someone please help me to understand how does this program calculate output given below?
import java.util.regex.*;
class Demo{
public static void main(String args[]) {
String name = "abc0x12bxab0X123dpabcq0x3423arcbae0xgfaaagrbcc";
Pattern p = Pattern.compile("[a-c][abc][bca]");
Matcher m = p.matcher(name);
while(m.find()) {
System.out.println(m.start()+"\t"+m.group());
}
}
}
OUTPUT :
0 abc
18 abc
30 cba
38 aaa
43 bcc

It simply searches the String for a match according to the rules specified by "[a-c][abc][bca]"
0 abc --> At position 0, there is [abc].
18 abc --> Exact same thing but at position 18.
30 cba --> At position 30, there is a group of a, b and c (specified by [a-c])
38 aaa --> same as 30
43 bcc --> same as 30
Notice, the counting starts at 0. So the first letter is at position 0, the second ist at position 1 an so on...
For further information about Regex and it's use see: Oracle Tutorial for Regex

Lets analize:
"[a-c][abc][bca]"
This pattern looks for groups of 3 letters each.
[a-c] means that first letter has to be between a and c so it can be either a,b or c
[abc] means that second letter has to be one of following letters a,b or c co basicly [a-c]
[bca] meanst that third letter has to be either b or c or a, order rather doesnt matter here.
Everything what you needs to know is in official java regex tutorial
http://docs.oracle.com/javase/tutorial/essential/regex/

This pattern basically matches 3-character words where each letter is either a,b, or c.
It then prints out each matching 3-char sequence along with the index at which it was found.
Hope that helps.

It is printing out the place in the string, starting with 0 instead of 1, where the occurrence of each match occurs. That is the first match, "abc" happens in position 0. the second match "abc" happens at string position 18.
essentially it is matching any 3 character string that contains an 'a', 'b', and 'c'.
the pattern could be written as "[a-c]{3}" and you should get the same result.

Lets look at your sourcecode, because the regexp itself was already well explained in the other answers.
//compiles a regexp pattern, kind of make it useable
Pattern p = Pattern.compile("[a-c][abc][bca]");
//creates a matcher for your regexp pattern. This one is used to find this regexp pattern
//in your actual string name.
Matcher m = p.matcher(name);
//loop while the matcher finds a next substring in name that matches your pattern
while(m.find()) {
//print out the index of the found substring and the substring itself
System.out.println(m.start()+"\t"+m.group());
}

How do I know if a regexp has more than one possible match?

I am writing Java code that has to distinguish regular expressions with more than one possible match from regular expressions that have only one possible match.
For example:
"abc." can have several matches ("abc1", abcf", ...),
while "abcd" can only match "abcd".
Right now my best idea was to look for all unescaped regexp special characters.
I am convinced that there is a better way to do it in Java. Ideas?
(Late addition):
To make things clearer - there is NO specific input to test against. A good solution for this problem will have to test the regex itself.
In other words, I need a method who'se signature may look something like this:
boolean isSingleResult(String regex)
This method should return true if only for one possible String s1. The expression s1.matches(regex) will return true. (See examples above.)

This sounds dirty, but it might be worth having a look at the Pattern class in the Java source code.
Taking a quick peek, it seems like it 'normalize()'s the given regex (Line 1441), which could turn the expression into something a little more predictable. I think reflection can be used to tap into some private resources of the class (use caution!). It could be possible that while tokenizing the regex pattern, there are specific indications if it has reached some kind "multi-matching" element in the pattern.
Update
After having a closer look, there is some data within package scope that you can use to leverage the work of the Pattern tokenizer to walk through the nodes of the regex and check for multiple-character nodes.
After compiling the regular expression, iterate through the compiled "Node"s starting at Pattern.root. Starting at line 3034 of the class, there are the generalized types of nodes. For example class Pattern.All is multi-matching, while Pattern.SingleI or Pattern.SliceI are single-matching, and so on.
All these token classes appear to be in package scope, so it should be possible to do this without using reflection, but instead creating a java.util.regex.PatternHelper class to do the work.
Hope this helps.

If it can only have one possible match it isn't reeeeeally an expression, now, is it? I suspect your best option is to use a different tool altogether, because this does not at all sound like a job for regular expressions, but if you insist, well, no, I'd say your best option is to look for unescaped special characters.

The only regular expression that can ONLY match one input string is one that specifies the string exactly. So you need to match expressions with no wildcard characters or character groups AND that specify a start "^" and end "$" anchor.
"the quick" matches:
"the quick brownfox"
"the quick brown dog"
"catch the quick brown fox"
"^the quick brown fox$" matches ONLY:
"the quick brown fox"

Now I understand what you mean. I live in Belgium...
So this is something what work on most expressions. I wrote this by myself. So maybe I forgot some rules.
public static final boolean isSingleResult(String regexp) {
// Check the exceptions on the exceptions.
String[] exconexc = "\\d \\D \\w \\W \\s \\S".split(" ");
for (String s : exconexc) {
int index = regexp.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
// Then remove all exceptions:
String regex = regexp.replaceAll("\\\\.", "");
// Now, all the strings how can mean more than one match
String[] mtom = "+ . ? | * { [:alnum:] [:word:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]".split(" ");
// iterate all mtom-Strings
for (String s : mtom) {
int index = regex.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
return true;
}
Martijn

I see that the only way is to check if regexp matches multiple times for particular input.
package com;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AAA {
public static void main(String[] args) throws Exception {
String input = "123 321 443 52134 432";
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(input);
int i = 0;
while (matcher.find()) {
++i;
}
System.out.printf("Matched %d times%n", i);
}
}

Regex to find an integer within a string

I'd like to use regex with Java.
What I want to do is find the first integer in a string.
Example:
String = "the 14 dogs ate 12 bones"
Would return 14.
String = "djakld;asjl14ajdka;sdj"
Would also return 14.
This is what I have so far.
Pattern intsOnly = Pattern.compile("\\d*");
Matcher makeMatch = intsOnly.matcher("dadsad14 dssaf jfdkasl;fj");
makeMatch.find();
String inputInt = makeMatch.group();
System.out.println(inputInt);
What am I doing wrong?

You're asking for 0 or more digits. You need to ask for 1 or more:
"\\d+"

It looks like the other solutions failed to handle +/- and cases like 2e3, which java.lang.Integer.parseInt(String) supports, so I'll take my go at the problem. I'm somewhat inexperienced at regex, so I may have made a few mistakes, used something that Java's regex parser doesn't support, or made it overly complicated, but the statements seemed to work in Kiki 0.5.6.
All regular expressions are provided in both an unescaped format for reading, and an escaped format that you can use as a string literal in Java.
To get a byte, short, int, or long from a string:
unescaped: ([\+-]?\d+)([eE][\+-]?\d+)?
escaped: ([\\+-]?\\d+)([eE][\\+-]?\\d+)?
...and for bonus points...
To get a double or float from a string:
unescaped: ([\+-]?\d(\.\d*)?|\.\d+)([eE][\+-]?(\d(\.\d*)?|\.\d+))?
escaped: ([\\+-]?\\d(\\.\\d*)?|\\.\d+)([eE][\\+-]?(\\d(\\.\\d*)?|\\.\\d+))?

Use one of them:
Pattern intsOnly = Pattern.compile("[0-9]+");
or
Pattern intsOnly = Pattern.compile("\\d+");

Heres a handy one I made for C# with generics. It will match based on your regular expression and return the types you need:
public T[] GetMatches<T>(string Input, string MatchPattern) where T : IConvertible
{
List<T> MatchedValues = new List<T>();
Regex MatchInt = new Regex(MatchPattern);
MatchCollection Matches = MatchInt.Matches(Input);
foreach (Match m in Matches)
MatchedValues.Add((T)Convert.ChangeType(m.Value, typeof(T)));
return MatchedValues.ToArray<T>();
}
then if you wanted to grab only the numbers and return them in an string[] array:
string Test = "22$data44abc";
string[] Matches = this.GetMatches<string>(Test, "\\d+");
Hopefully this is useful to someone...

In addition to what PiPeep said, if you are trying to match integers within an expression, so that 1 + 2 - 3 will only match 1, 2, and 3, rather than 1, + 2 and - 3, you actually need to use a lookbehind statement, and the part you want will actually be returned by Matcher.group(2) rather than just Matcher.group().
unescaped: ([0-9])?((?(1)(?:[\+-]?\d+)|)(?:[eE][\+-]?\d+)?)
escaped: ([0-9])?((?(1)(?:[\\+-]?\\d+)|)(?:[eE][\\+-]?\\d+)?)
Also, for things like someNumber - 3, where someNumber is a variable name or something like that, you can use
unescaped: (\w)?((?(1)(?:[\+-]?\d+)|)(?:[eE][\+-]?\d+)?)
escaped: (\\w)?((?(1)(?:[\\+-]?\\d+)|)(?:[eE][\\+-]?\\d+)?)
Although of course that wont work if you are parsing a string like The net change to blahblah was +4

the java spec actually gives this monster of a regex for parsing doubles.
however it is considered bad practice, just trying to parse with the intended type, and catching the error, tends to be slightly more readable.
DOUBLE_PATTERN = Pattern
.compile("[\\x00-\\x20]*[+-]?(NaN|Infinity|((((\\p{Digit}+)(\\.)?((\\p{Digit}+)?)"
+ "([eE][+-]?(\\p{Digit}+))?)|(\\.((\\p{Digit}+))([eE][+-]?(\\p{Digit}+))?)|"
+ "(((0[xX](\\p{XDigit}+)(\\.)?)|(0[xX](\\p{XDigit}+)?(\\.)(\\p{XDigit}+)))"
+ "[pP][+-]?(\\p{Digit}+)))[fFdD]?))[\\x00-\\x20]*");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expression (Java) anomaly - explanation sought - java

This doesn't make any sense: [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE] I believe you want: (?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE) Square brackets are only used for character classes, not general grouping. Use (?:...) or (...) for general grouping (the latter also captures).

Related

How can I split a string without knowing the split characters a-priori?

How to make a regular expression match based on a condition?

Regular Expressions (regex) Pattern Matching

How do I know if a regexp has more than one possible match?

Regex to find an integer within a string

Categories

Resources