Using Scanner.useDelimeter() in Java to isolate tokens in an expression - java

I am trying to isolate the words, brackets and => and <=> from the following input:
(<=>A B) OR (C AND D) AND(A AND C)
So far I've come to isolating just the words (see Scanner#useDelimeter()):
sc.useDelimeter("[^a-zA-Z]");
Upon using :
sc.useDelimeter("[\\s+a-zA-Z]");
I get the output just the brackets.
which I don't want but want AND ).
How do I do that? Doing \\s+ gives the same result.
Also, how is a delimiter different from regex? I'm familiar with regex in PHP. Is the notation used the same?
Output I want:
(
<=>
A
(and so on)

You need a delimitimg regex that can be zero width (because you have adjacent terms), so look-arounds are the only option. Try this:
sc.useDelimeter("((?<=[()>])\\s*)|(\\s*\\b\\s*)");
This regex says "after a bracket or greater-than or at a word boundary, discarding spaces"
Also note that the character class [\\s+a-zA-Z] includes the + character - most characters lose any special regex meaning when inside a character class. It seems you were trying to say "one or more spaces", but that's not how you do that.

Inside [] the ^ means 'not', so the first regex, [^a-zA-Z], says 'give me everything that's not a-z or A-Z'
The second regex, [\\s+a-zA-Z], says 'give me everything that is space, +, a-z or A-Z'. Note that "+" is a literal plus sign when in a character class.

Related

Can we replace a specific part of a string literal using any predefined function and regex

I want to replace "&" with a random word "$d" in a given sentence.
Can we replace only those words which start with & and are followed by a single character and a space?
Example:-
Input:-
Two literals are &a and &b and also check &abc and &bac here.
Output:-
Two literals are $da and $db and also check &abc and &bac here.
In the above example in input, the only words that should be replaced are &a and &b(not the complete word should be replaced, only just the '&' in both the words) because these two random words start with & and are followed by a single character and a space.
In the case of the replaceAll() function, it replaces the entire word when I used regex:-
String str="Two literals are &a and &b and also check &abc and &bac here.";
str = str.replaceAll("\\&[a-zA-Z]{1}\\s", "\\$d");
System.out.println(str);
//output for this:-Two literals are $d and $d and also check &abc and &bac here.
//expected output:-Two literals are $da and $db and also check &abc and &bac here.
The correct code for this would be
str.replaceAll("&([a-zA-Z]\\s)", "\\$d$1")
This is an example of backreferencing captured groups in regex, and a here is a nice reference for it. Additionally, here's a relevant StackOverflow question about it.
Essentially, the match inside the parentheses ([a-zA-Z]\\s) matches a single letter and a space. The value of this match can be referenced with $1 since it is of capturing group 1.
So we replace &(a ) with $d(a ) (brackets here to demonstrate what is captured). Credit to u/rzwitserloot for reminding me that OP wants $ not &.
You presumably want a concept called look-ahead: You can match on things being there without 'consuming' it. You can even match on things NOT being there. That's what you want here: Match &[a-z], but only if looking ahead past that, we do NOT see another letter:
for (String test : List.of("Two literals are &a and &bcd", "A literal is &a", "How about &a?")) {
System.out.println(str.replaceAll("&(?=[a-zA-Z](?![a-zA-Z]))", "\\$d"));
}
Perhaps instead you want the single letter thing to just be on any word break (i.e. &z00 should NOT turn into $dz00, even though there is no letter after the z. Then I suggest:
"&(?=[a-zA-Z]\\b)"
That's a lot simpler to read!
A few notes:
(?=x) is 'positive lookahead'. It doesn't itself match anything but makes the match fail if x is not immediately following the match.
(?!x) is 'negative lookahead'. It doesn't itself match anything but makes the match fail if x is immediately following the match.
$ has special meaning in the replacement part so we need to escape it.
\\b is regexpese for 'word break': Doesn't match any characters, but fails if we aren't on a 'word break'. Spaces, dots, end-of-input, end-of-line, a dash, an ampersand - many things are word breaks.
We don't want to match those letters because if we do, they would be replaced.

Tokenizing an infix string in Java

I'm implementing the Shunting Yard Algorithm in Java, as a side project to my AP Computer Science class. I've implemented a simple one in Javascript, with only basic arithmetic expressions (addition, subtraction, multiplication, division, exponentiation). To split that into an array, what I did was find each of the operators (+-*/^), as well as numbers and parentheses, and I put a space around them, and then I split it into an array. For example, the infix string 4+(3+2) would be made into 4 + ( 3 + 2 ), and then split on whitespace.
However, I feel that this method is very slow, and it gets increasingly harder and inefficient to implement as you start to add mathematical functions, such as sine, cosine, tangent, absolute value, and others.
What would be the best way to split a string like sin(4+3)-8 into an array ["sin","(" 4,"+",3,")","-",8]?
I could use regex for this, but I don't really understand them well, and I'm trying to learn them, so if that would be the best solution to them, could the answerer please explain what it does?
Try .spliting on the regex
(?<=[^\.a-zA-Z\d])|(?=[^\.a-zA-Z\d])
It will split the string at any place that is either preceded or followed by a non-alphanumeric character or period.
(?<=[^\.a-zA-Z\d]) is a positive lookbehind. It matches the place between two characters, if the preceding string matches the sub-regex contained within (?<=...).
[^\.a-zA-Z\d] is a negated character class. It matches a single character that is not contained within [^...].
\. matches the character ..
a-z matches any lowercase character between a and z.
A-Z is the same, but for uppercase.
\d is the equivalent of [0-9], so it matches any digit.
| is the equivalent of an "or". It makes the regex match either the preceding half of the regex or the following half.
(?=[^\.a-zA-Z\d]) is the same as the first half of the regex, except that it is a positive lookahead. It matches the place between two characters, if the following string matches the sub-regex contained within (?=...).
You can implement this regex in java like this:
String str = "sin(4+3)-8";
String[] parts = str.split("(?<=[^\\.a-zA-Z\\d])|(?=[^\\.a-zA-Z\\d])");
Result:
["sin","(" 4,"+",3,")","-","8"]

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

How to make a regular expression that matches tokens with delimiters and separators?

I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.

Regular expression to match strings enclosed in square brackets or double quotes

I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""

Categories