I am trying to split a string by a delimiter only in certain situations.
To be more specific, I want to split the conditions of a split statement.
I want to be able to split
"disorder == 1 or ( x < 100)"
into
"disorder == 1"
"(x < 100)"
If I use split("or") I would get a split inside disorder too :
"dis"
"der == 1"
"( x < 100)"
And if I try to use regex like split("[ )]or[( ]") I would lose the parentheses from ( x < 100) :
"disorder == 1"
"x < 100)"
I am looking for a way to split the string only if the delimiter is surrounded by space or parentheses, but I want to keep the surroundings.
You want to use Lookaheads and Lookbehinds for the spaces/parenthesis, so something like this:
String input = "disorder == 1 or( x < 100)";
String[] split = input.split("(?<=[ )])or(?=[ (])");
The [ )] and [ (] mean to look at spaces or parenthesis. This can of course be replaced with any other boundary characters, or even a literal regex boundary \\b.
The (?<=...) is a positive lookbehind. So it only matches or when it has a space or ) in front of it, but doesn't remove them with the split.
The (?=...) is a lookahead. So it only matches or followed by a space or (, but doesn't remove them with the split.
Try it online.
As flakes pointed out in the comments, you can use the word boundary character.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
String x = "disorder == 1 or( x < 100)";
for(String s : x.split("\\bor\\b"))
System.out.println(s);
Result:
disorder == 1
( x < 100)
For a solution using lookahead/lookbehind, see Kevins excellent answer.
I'm not entirely sure what you are doing this for: The example you presented gives a somehow a very small view of what you want to do and what you want to do this for. Correct me if I'm wrong but it seems that you want to parse arbitrary expressions of some kind of programming language.
In general you can't approach things like this in such a simple way. This is an expression. It has a hierarchical structure. No simple splitting - even not with RegEx - will work here in general as RegEx can not honor this hierarchical structure.
To do this properly you need to parse the expression to some extent. This is done by splitting the expression into simple tokens, rebuild the hierarchy in a (simple) tree data model and then you can analyze it in any way you want. Actually you can use RegEx to identify the individual tokens, but you need to build a tree-like data structure first before you can work with it.
Building this tree like structure is not so trivial as you have to consider the precedence of various operators within your expression. But iff (!) you have a very specific field of application - f.e. a list of expressions with some very limited structure - you might be able to use the token list directly.
Here's an example for this tokenization process. Your character sequence disorder == 1 or( x < 100) might parse into some token sequence such as this:
W:"disorder"
OP:"=="
NUM:"1"
W:"or"
B:"("
W:"x"
OP:"<"
NUM:"100"
B:")"
Now you can identify the word "or" and deal with the expression the way you want.
The trick then would be to perform reasonable tokenization. For this I recommend to define a set of regular expressions, each one recognizing either a number, a word or some operator or bracket. Process each string by looking at the next characters with each individual RegEx, try to match these characters with these regular expressions. If you have a match, emit a token as you have found one, then advance to the position in your character sequence after the match to then continue with the rest of your character sequence.
If you have been able to pass through the character sequence (and emitting tokens), then parsing is successfully completed. If you fail with all individual RegExes provided, there is some syntactical problem in the input data. After tokenization you can further do with your tokens whatever you want.
Looks like you need to have a more complex regular expression where the word "or" plus a single preceding and succeeding character are non alphabetic. For example:
((.+)+(\Wor\W)+)+
Something like this, where you identify the pattern of characters, a separating non-word character, the literal word "or", and another separating non-word character. This may not be the exact form you need, but something similar to this that captures the pattern would probably work for you.
You can just replace the or into anything else that's not in the string and split it by that charachter.
For example:
String [] n = input.replace("or(",":(").split(":");
Related
I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.
I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.
I am trying to isolate the words, brackets and => and <=> from the following input:
(<=>A B) OR (C AND D) AND(A AND C)
So far I've come to isolating just the words (see Scanner#useDelimeter()):
sc.useDelimeter("[^a-zA-Z]");
Upon using :
sc.useDelimeter("[\\s+a-zA-Z]");
I get the output just the brackets.
which I don't want but want AND ).
How do I do that? Doing \\s+ gives the same result.
Also, how is a delimiter different from regex? I'm familiar with regex in PHP. Is the notation used the same?
Output I want:
(
<=>
A
(and so on)
You need a delimitimg regex that can be zero width (because you have adjacent terms), so look-arounds are the only option. Try this:
sc.useDelimeter("((?<=[()>])\\s*)|(\\s*\\b\\s*)");
This regex says "after a bracket or greater-than or at a word boundary, discarding spaces"
Also note that the character class [\\s+a-zA-Z] includes the + character - most characters lose any special regex meaning when inside a character class. It seems you were trying to say "one or more spaces", but that's not how you do that.
Inside [] the ^ means 'not', so the first regex, [^a-zA-Z], says 'give me everything that's not a-z or A-Z'
The second regex, [\\s+a-zA-Z], says 'give me everything that is space, +, a-z or A-Z'. Note that "+" is a literal plus sign when in a character class.
I need to cut certain strings for an algorithm I am making. I am using substring() but it gets too complicated with it and actually doesn't work correctly. I found this topic how to cut string with two regular expression "_" and "."
and decided to try with split() but it always gives me
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 0
+
^
So this is the code I have:
String[] result = "234*(4-5)+56".split("+");
/*for(int i=0; i<result.length; i++)
{
System.out.println(result[i]);
}*/
Arrays.toString(result);
Any ideas why I get this irritating exception ?
P.S. If I fix this I will post you the algorithm for cutting and then the algorithm for the whole calculator (because I am building a calculator). It is gonna be a really badass calculator, I promise :P
+ in regex has a special meaning. to be treated as a normal character, you should escape it with backslash.
String[] result = "234*(4-5)+56".split("\\+");
Below are the metacharaters in regex. to treat any of them as normal characters you should escape them with backslash
<([{\^-=$!|]})?*+.>
refer here about how characters work in regex.
The plus + symbol has meaning in regular expression, which is how split parses it's parameter. You'll need to regex-escape the plus character.
.split("\\+");
You should split your string like this: -
String[] result = "234*(4-5)+56".split("[+]");
Since, String.split takes a regex as delimiter, and + is a meta-character in regex, which means match 1 or more repetition, so it's an error to use it bare in regex.
You can use it in character class to match + literal. Because in character class, meta-characters and all other characters loose their special meaning. Only hiephen(-) has a special meaning in it, which means a range.
+ is a regex quantifier (meaning one or more of) so needs to be escaped in the split method:
String[] result = "234*(4-5)+56".split("\\+");
I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""