How can tokenize this string in java?

How can tokenize this string in java? - java

How can I split these simple mathematical expressions into seperate strings?
I know that I basically want to use the regular expression: "[0-9]+|[*+-^()]" but it appears String.split() won't work because it consumes the delimiter tokens as well.
I want it to split all integers: 0-9, and all operators *+-^().
So, 578+223-5^2
Will be split into:
578
+
223
-
5
^
2
What is the best approach to do that?

You could use StringTokenizer(String str, String delim, boolean returnDelims), with the operators as delimiters. This way, at least get each token individually (including the delimiters). You could then determine what kind of token you're looking at.

Going at this laterally, and assuming your intention is ultimately to evaluate the String mathematically, you might be better off using the ScriptEngine
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
public class Evaluator {
private ScriptEngineManager sm = new ScriptEngineManager();
private ScriptEngine sEngine = sm.getEngineByName("js");
public double stringEval(String expr)
{
Object res = "";
try {
res = sEngine.eval(expr);
}
catch(ScriptException se) {
se.printStackTrace();
}
return Double.parseDouble( res.toString());
}
}
Which you can then call as follows:
Evaluator evr = new Evaluator();
String sTest = "+1+9*(2 * 5)";
double dd = evr.stringEval(sTest);
System.out.println(dd);
I went down this road when working on evaluating Strings mathematically and it's not so much the operators that will kill you in regexps but complex nested bracketed expressions. Not reinventing the wheel is a) safer b) faster and c) means less complex and nested code to maintain.

This works for the sample string you posted:
String s = "578+223-5^2";
String[] tokens = s.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
The regex is made up entirely of lookaheads and lookbehinds; it matches a position (not a character, but a "gap" between characters), that is either preceded by a digit and followed by a non-digit, or preceded by a non-digit and followed by a digit.
Be aware that regexes are not well suited to the task of parsing math expressions. In particular, regexes can't easily handle balanced delimiters like parentheses, especially if they can be nested. (Some regex flavors have extensions which make that sort of thing easier, but not Java's.)
Beyond this point, you'll want to process the string using more mundane methods like charAt() and substring() and Integer.parseInt(). Or, if this isn't a learning exercise, use an existing math expression parsing library.
EDIT: ...or eval() it as #Syzygy recommends.

You can't use String.split() for that, since whatever characters match the specified pattern are removed from the output.
If you're willing to require spaces between the tokens, you can do...
"578 + 223 - 5 ^ 2 ".split(" ");
which yields...
578
+
223
-
5
^
2

Here's a short Java program that tokenizes such strings. If you're looking for evaluation of expression I can (shamelessly) point you at this post: An Arithemetic Expressions Solver in 64 Lines
import java.util.ArrayList;
import java.util.List;
public class Tokenizer {
private String input;
public Tokenizer(String input_) { input = input_.trim(); }
private char peek(int i) {
return i >= input.length() ? '\0' : input.charAt(i);
}
private String consume(String... arr) {
for(String s : arr)
if(input.startsWith(s))
return consume(s.length());
return null;
}
private String consume(int numChars) {
String result = input.substring(0, numChars);
input = input.substring(numChars).trim();
return result;
}
private String literal() {
for (int i = 0; true; ++i)
if (!Character.isDigit(peek(i)))
return consume(i);
}
public List<String> tokenize() {
List<String> res = new ArrayList<String>();
if(input.isEmpty())
return res;
while(true) {
res.add(literal());
if(input.isEmpty())
return res;
String s = consume("+", "-", "/", "*", "^");
if(s == null)
throw new RuntimeException("Syntax error " + input);
res.add(s);
}
}
public static void main(String[] args) {
Tokenizer t = new Tokenizer("578+223-5^2");
System.out.println(t.tokenize());
}
}

You only put the delimiters in the split statement. Also, the - mean range and has to be escaped.
"578+223-5^2".split("[*+\\-^()]")

You need to escape the -. I believe the quantifiers (+ and *) lose their special meaning, as do parentheses in a character class. If it doesn't work, try escaping those as well.

Here is my tokenizer solution that allows for negative numbers (unary).
So far it has been doing everything I needed it to:
private static List<String> tokenize(String expression)
{
char c;
List<String> tokens = new ArrayList<String>();
String previousToken = null;
int i = 0;
while(i < expression.length())
{
c = expression.charAt(i);
StringBuilder currentToken = new StringBuilder();
if (c == ' ' || c == '\t') // Matched Whitespace - Skip Whitespace
{
i++;
}
else if (c == '-' && (previousToken == null || isOperator(previousToken)) &&
((i+1) < expression.length() && Character.isDigit(expression.charAt((i+1))))) // Matched Negative Number - Add token to list
{
currentToken.append(expression.charAt(i));
i++;
while(i < expression.length() && Character.isDigit(expression.charAt(i)))
{
currentToken.append(expression.charAt(i));
i++;
}
}
else if (Character.isDigit(c)) // Matched Number - Add to token list
{
while(i < expression.length() && Character.isDigit(expression.charAt(i)))
{
currentToken.append(expression.charAt(i));
i++;
}
}
else if (c == '+' || c == '*' || c == '/' || c == '^' || c == '-') // Matched Operator - Add to token list
{
currentToken.append(c);
i++;
}
else // No Match - Invalid Token!
{
i++;
}
if (currentToken.length() > 0)
{
tokens.add(currentToken.toString());
previousToken = currentToken.toString();
}
}
return tokens;
}

You have to escape the "()" in Java, and the '-'
myString.split("[0-9]+|[\\*\\+\\-^\\(\\)]");

Related

How to find first character after second dot java

Do you have any ideas how could I get first character after second dot of the string.
String str1 = "test.1231.asdasd.cccc.2.a.2";
String str2 = "aaa.1.22224.sadsada";
In first case I should get a and in second 2.
I thought about dividing string with dot, and extracting first character of third element. But it seems to complicated and I think there is better way.

How about a regex for this?
Pattern p = Pattern.compile(".+?\\..+?\\.(\\w)");
Matcher m = p.matcher(str1);
if (m.find()) {
System.out.println(m.group(1));
}
The regex says: find anything one or more times in a non-greedy fashion (.+?), that must be followed by a dot (\\.), than again anything one or more times in a non-greedy fashion (.+?) followed by a dot (\\.). After this was matched take the first word character in the first group ((\\w)).

Usually regex will do an excellent work here. Still if you are looking for something more customizable then consider the following implementation:
private static int positionOf(String source, String target, int match) {
if (match < 1) {
return -1;
}
int result = -1;
do {
result = source.indexOf(target, result + target.length());
} while (--match > 0 && result > 0);
return result;
}
and then the test is done with:
String str1 = "test..1231.asdasd.cccc..2.a.2.";
System.out.println(positionOf(str1, ".", 3)); -> // prints 10
System.out.println(positionOf(str1, "c", 4)); -> // prints 21
System.out.println(positionOf(str1, "c", 5)); -> // prints -1
System.out.println(positionOf(str1, "..", 2)); -> // prints 22 -> just have in mind that the first symbol after the match is at position 22 + target.length() and also there might be none element with such index in the char array.

Without using pattern, you can use subString and charAt method of String class to achieve this
// You can return String instead of char
public static char returnSecondChar(String strParam) {
String tmpSubString = "";
// First check if . exists in the string.
if (strParam.indexOf('.') != -1) {
// If yes, then extract substring starting from .+1
tmpSubString = strParam.substring(strParam.indexOf('.') + 1);
System.out.println(tmpSubString);
// Check if second '.' exists
if (tmpSubString.indexOf('.') != -1) {
// If it exists, get the char at index of . + 1
return tmpSubString.charAt(tmpSubString.indexOf('.') + 1);
}
}
// If 2 '.' don't exists in the string, return '-'. Here you can return any thing
return '-';
}

You could do it by splitting the String like this:
public static void main(String[] args) {
String str1 = "test.1231.asdasd.cccc.2.a.2";
String str2 = "aaa.1.22224.sadsada";
System.out.println(getCharAfterSecondDot(str1));
System.out.println(getCharAfterSecondDot(str2));
}
public static char getCharAfterSecondDot(String s) {
String[] split = s.split("\\.");
// TODO check if there are values in the array!
return split[2].charAt(0);
}
I don't think it is too complicated, but using a directly matching regex is a very good (maybe better) solution anyway.
Please note that there might be the case of a String input with less than two dots, which would have to be handled (see TODO comment in the code).

You can use Java Stream API since Java 8:
String string = "test.1231.asdasd.cccc.2.a.2";
Arrays.stream(string.split("\\.")) // Split by dot
.skip(2).limit(1) // Skip 2 initial parts and limit to one
.map(i -> i.substring(0, 1)) // Map to the first character
.findFirst().ifPresent(System.out::println); // Get first and print if exists
However, I recommend you to stick with Regex, which is safer and a correct way to do so:
Here is the Regex you need (demo available at Regex101):
.*?\..*?\.(.).*
Don't forget to escape the special characters with double-slash \\.
String[] array = new String[3];
array[0] = "test.1231.asdasd.cccc.2.a.2";
array[1] = "aaa.1.22224.sadsada";
array[2] = "test";
Pattern p = Pattern.compile(".*?\\..*?\\.(.).*");
for (int i=0; i<array.length; i++) {
Matcher m = p.matcher(array[i]);
if (m.find()) {
System.out.println(m.group(1));
}
}
This code prints two results on each line: a, 2 and an empty lane because on the 3rd String, there is no match.

A plain solution using String.indexOf:
public static Character getCharAfterSecondDot(String s) {
int indexOfFirstDot = s.indexOf('.');
if (!isValidIndex(indexOfFirstDot, s)) {
return null;
}
int indexOfSecondDot = s.indexOf('.', indexOfFirstDot + 1);
return isValidIndex(indexOfSecondDot, s) ?
s.charAt(indexOfSecondDot + 1) :
null;
}
protected static boolean isValidIndex(int index, String s) {
return index != -1 && index < s.length() - 1;
}
Using indexOf(int ch) and indexOf(int ch, int fromIndex) needs only to examine all characters in worst case.
And a second version implementing the same logic using indexOf with Optional:
public static Character getCharAfterSecondDot(String s) {
return Optional.of(s.indexOf('.'))
.filter(i -> isValidIndex(i, s))
.map(i -> s.indexOf('.', i + 1))
.filter(i -> isValidIndex(i, s))
.map(i -> s.charAt(i + 1))
.orElse(null);
}

Just another approach, not a one-liner code but simple.
public class Test{
public static void main (String[] args){
for(String str:new String[]{"test.1231.asdasd.cccc.2.a.2","aaa.1.22224.sadsada"}){
int n = 0;
for(char c : str.toCharArray()){
if(2 == n){
System.out.printf("found char: %c%n",c);
break;
}
if('.' == c){
n ++;
}
}
}
}
}
found char: a
found char: 2

Returning a string minus a specific character between specific characters

I am going through the Java CodeBat exercises. Here is the one I am stuck on:
Look for patterns like "zip" and "zap" in the string -- length-3, starting with 'z' and ending with 'p'. Return a string where for all such words, the middle letter is gone, so "zipXzap" yields "zpXzp".
Here is my code:
public String zipZap(String str){
String s = ""; //Initialising return string
String diff = " " + str + " "; //Ensuring no out of bounds exceptions occur
for (int i = 1; i < diff.length()-1; i++) {
if (diff.charAt(i-1) != 'z' &&
diff.charAt(i+1) != 'p') {
s += diff.charAt(i);
}
}
return s;
}
This is successful for a few of them but not for others. It seems like the && operator is acting like a || for some of the example strings; that is to say, many of the characters I want to keep are not being kept. I'm not sure how I would go about fixing it.
A nudge in the right direction if you please! I just need a hint!

Actually it is the other way around. You should do:
if (diff.charAt(i-1) != 'z' || diff.charAt(i+1) != 'p') {
s += diff.charAt(i);
}
Which is equivalent to:
if (!(diff.charAt(i-1) == 'z' && diff.charAt(i+1) == 'p')) {
s += diff.charAt(i);
}

This sounds like the perfect use of a regular expression.
The regex "z.p" will match any three letter token starting with a z, having any character in the middle, and ending in p. If you require it to be a letter you could use "z[a-zA-Z]p" instead.
So you end up with
public String zipZap(String str) {
return str.replaceAll("z[a-zA-Z]p", "zp");
}
This passes all the tests, by the way.
You could make the argument that this question is about raw string manipulation, but I would argue that that makes this an even better lesson: applying regexes appropriately is a massively useful skill to have!

public String zipZap(String str) {
//If bigger than 3, because obviously without 3 variables we just return the string.
if (str.length() >= 3)
{
//Create a variable to return at the end.
String ret = "";
//This is a cheat I worked on to get the ending to work easier.
//I noticed that it wouldn't add at the end, so I fixed it using this cheat.
int minusAmt = 2;
//The minus amount starts with 2, but can be changed to 0 when there is no instance of z-p.
for (int i = 0; i < str.length() - minusAmt; i++)
{
//I thought this was a genius solution, so I suprised myself.
if (str.charAt(i) == 'z' && str.charAt(i+2) == 'p')
{
//Add "zp" to the return string
ret = ret + "zp";
//As long as z-p occurs, we keep the minus amount at 2.
minusAmt = 2;
//Increment to skip over z-p.
i += 2;
}
//If it isn't z-p, we do this.
else
{
//Add the character
ret = ret + str.charAt(i);
//Make the minus amount 0, so that we can get the rest of the chars.
minusAmt = 0;
}
}
//return the string.
return ret;
}
//If it was less than 3 chars, we return the string.
else
{
return str;
}
}

Add separator in string using regex in Java

I have a string (for example: "foo12"), and I want to add a delimiting character in between the letters and numbers (e.g. "foo|12"). However, I can't seem to figure out what the appropriate code is for doing this in Java. Should I use a regex + replace or do I need to use a matcher?

A regex replace would be just fine:
String result = subject.replaceAll("(?<=\\p{L})(?=\\p{N})", "|");
This looks for a position right after a letter and right before a digit (by using lookaround assertions). If you only want to look for ASCII letters/digits, use
String result = subject.replaceAll("(?i)(?<=[a-z])(?=[0-9])", "|");

Split letters and numbers and concatenate with "|". Here is a one-liner:
String x = "foo12";
String result = x.replaceAll("[0-9]", "") + "|" + x.replaceAll("[a-zA-Z]", "");
Printing result will output: foo|12

Why even use regex? This isn't too hard to implement on your own:
public static String addDelimiter(String str, char delimiter) {
StringBuilder string = new StringBuilder(str);
boolean isLetter = false;
boolean isNumber = false;
for (int index = 0; index < string.length(); index++) {
isNumber = isNumber(string.charAt(index));
if (isLetter && isNumber) {
//the last char was a letter, and now we have a number
//so here we adjust the stringbuilder
string.insert(index, delimiter);
index++; //We just inserted the delimiter, get past the delimiter
}
isLetter = isLetter(string.charAt(index));
}
return string.toString();
}
public static boolean isLetter(char c) {
return 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z';
}
public static boolean isNumber(char c) {
return '0' <= c && c <= '9';
}
The advantage of this over regex is that regex can easily be slower. Additionally, it is easy to change the isLetter and isNumber methods to allow for inserting the delimiter in different places.

Split a Java String

I'm having a little trouble with Java regexes. I have a string like this
a + 4 * log(3/abs(1 – x)) + sen(-b/4 + PI)
and i need to split this in the following tokens:
{"a", "+", "4", "*", "log", "(3/abs(1 - x))", "+", "sen", "(-b/4 + PI)"}
Any idea?
I tried this PHP regex, but for some reason it won't work on Java
[a-z]+(\((?>[^()]+|(?1))*\))|[a-z]+|\d+|\/|\-|\*|\+

Match All vs Splitting
Matching and splitting are two sides of the same coin. This is quite tricky because Java does not support recursion and we have some nested parentheses. But this should do the trick:
Java
\(.*?\)(?![^(]*\))|[^\s(]+
See demo.
To iterate over all the matches:
Pattern regex = Pattern.compile("\\(.*?\\)(?![^(]*\\))|[^\\s(]+");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// the match: regexMatcher.group()
}
Explanation
\(.*?\)(?![^(]*\)) matches an opening parenthesis and everything up to a closing parenthesis that is not followed by an opening par and more closing pars. This works for the (simple(nesting)) in your expression, but would not work for (this(kind)of(nesting)) (see PHP solution)
| OR...
[^\s(]+ any chars that are not spaces or an opening par
PHP Option with Recursion
In PHP, we can use recursion to match the nested constructs more precisely (this will overcome the Java problem with (this(kind)of(nesting)):
(\((?:[^()]++|(?1))*\))|[^\s(]+

I have written a small java program to split instead of using regular expression spli, see if this can help
import java.util.ArrayList;
public class Test2 {
public static void main(String args[]) {
System.out.println(splitExp("a + 4 * log(3/abs(1 – x)) + sen(-b/4 + PI)"));
}
private static ArrayList<String> splitExp(String exp) {
StringBuilder chString = new StringBuilder();
ArrayList<String> arrL = new ArrayList<String>();
for (int i = 0 ; i < exp.length() ; i++ ) {
char ch = exp.charAt(i);
if(ch == ' ')
continue;
if(( ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')) {
chString = chString.append(String.valueOf(ch));
}
else {
if (chString.length() > 0) {
arrL.add(chString.toString());
chString = new StringBuilder();
}
arrL.add(String.valueOf(ch));
}
}
return arrL;
}
}

Java Regex help extracting with negative lookahead

I have the reg ex \\(.*?\\) to match what ever inside the parenthesis from my text
e.g. ((a=2 and age IN (15,18,56)) and (b=3 and c=4))
my output should only contain:
a=2 and age IN (15,18,56)
b=3 and c=4
I have tried using negative lookahead, not to match .*(?!IN)\\(.*?\\) but not returning what I expect. Can any body help with where I am going wrong?

You will need to parse nested expressions, and regular expressions alone cannot do that for you. A regular expression will only catch the innermost expressions with \\(([^(]*?)\\)
You can use the Pattern and Matcher classes to code a more complex solution.
Or you can use a parser. For Java, there's ANTL.
I just coded something that might help you:
public class NestedParser {
private final char opening;
private final char closing;
private String str;
private List<String> matches;
private int matchFrom(int beginIndex, boolean matchClosing) {
int i = beginIndex;
while (i < str.length()) {
if (str.charAt(i) == opening) {
i = matchFrom(i + 1, true);
if (i < 0) {
return i;
}
} else if (matchClosing && str.charAt(i) == closing) {
matches.add(str.substring(beginIndex, i));
return i + 1;
} else {
i++;
}
}
return -1;
}
public NestedParser(char opening, char closing) {
this.opening = opening;
this.closing = closing;
}
public List<String> match(String str) {
matches = new ArrayList<>();
if (str != null) {
this.str = str;
matchFrom(0, false);
}
return matches;
}
public static void main(String[] args) {
NestedParser parser = new NestedParser('(', ')');
System.out.println(parser.match(
"((a=2 and age IN (15,18,56)) and (b=3 and c=4))"));
}
}

It's not clear what you want in terms of nested brackets (eg. ((a = 2 and b = 3)): is this valid or not?)
This regex gets you most of the way there:
(\(.*?\)+)
On the input you specified, it matches two groups:
((a=2 and age IN (15,18,56))
(b=3 and c=4)) (notice the double-bracket at the end).
It will return everything, including nested brackets. Another variation will return only singly-bracketed expressions:
(\([^(]*?\))
The easiest way to test this is through Rubular.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can tokenize this string in java? - java

You could use StringTokenizer(String str, String delim, boolean returnDelims), with the operators as delimiters. This way, at least get each token individually (including the delimiters). You could then determine what kind of token you're looking at.

You can't use String.split() for that, since whatever characters match the specified pattern are removed from the output. If you're willing to require spaces between the tokens, you can do... "578 + 223 - 5 ^ 2 ".split(" "); which yields... 578 + 223 - 5 ^ 2

You only put the delimiters in the split statement. Also, the - mean range and has to be escaped. "578+223-5^2".split("[*+\\-^()]")

You need to escape the -. I believe the quantifiers (+ and *) lose their special meaning, as do parentheses in a character class. If it doesn't work, try escaping those as well.

You have to escape the "()" in Java, and the '-' myString.split("[0-9]+|[\\*\\+\\-^\\(\\)]");

Related

How to find first character after second dot java

Returning a string minus a specific character between specific characters

Add separator in string using regex in Java

Split a Java String

Java Regex help extracting with negative lookahead

Categories

Resources