Tokenizing a String in Java - java

I am currently working on a project to parse strings into individual tokens. This is all and well, but I am stuck (obviously). I currently would like to test if a string matches a keyword (i.e. true, false, class, inherits, etc) and return that string as a token (please note I am not using the StringTokenizer that the Java API provides, IIRC).
Here is my code that pertains to the keywords:
if (isAlpha(ch)){
System.out.println("Current character:" + ch);
letters += ch;
charIndex--;
while(isDigit(nextChar()) || isAlpha(nextChar())){
Character c = nextChar();
System.out.println("C:" + c);
letters += c;
charIndex++;
System.out.println("After another character:" + letters);
}
if(letters == "class"){
token = new Token(token.CLASS, letters);
}
if(letters == "inherits"){
token = new Token(token.INHERIT, letters);
}
if(letters == "if"){
token = new Token(token.IF, letters);
} //etc etc
I have isolated the problem in my testing to strings that end in an alphabetic character (i.e. "funny"; this token/string would be an identifier). It keeps looping on the "y" and prints "Current character: y." It works with strings like "funny6" though. I may be missing the obvious (having to do with whitespace, perhaps), but any advice would be appreciated.
Thank you!

Related

How to process percentage with rhino script?

I am making a calculator app and it basically builds a string that uses rhino to process an equation . But the problem is that I can't find a way to handle percentage . It does not accepts % symbol.
I need to replace all instances of it with "?/100*?" inside the string where the ?st is the number preceding the percentage and ?nd number used with percentage .
example "5 + 3% + 2" --> "5 + 5/100*3 + 2" .
The issue is I can't know what kind of number could be expected , it could even be something like (5+4)-(3+1)% or a long decimal. Since the percentage is inside a string I do not use variables .
Here is the example below (method that is invoked when equals button is pressed) :
btnEquals.setOnClickListener(v -> {
process = tvInput.getText().toString();
process = process.replaceAll("%", "?/100*?"); // this is the problem line
rhinoAndroidHelper = new RhinoAndroidHelper(this);
context = rhinoAndroidHelper.enterContext();
context.setOptimizationLevel(-1);
String finalResult = "";
try {
Scriptable scriptable = context.initStandardObjects();
finalResult = context.evaluateString(scriptable, process, "javascript", 1, null).toString();
} catch (Exception e) {
finalResult = "0";
}
tvOutput.setText(finalResult);
tvInput.setText("");
});
I am using this helper library : https://github.com/F43nd1r/rhino-android
I do not have experience in working with rhino so I do not know if there is a simple solution . I think that the only solution would be to build a complex string parsing method that will check what precedes the percentage and to reformat it . Is there any other way regarding rhino ?
Here is a string formatting method I wrote . It can properly handle any equation that contains a single % :
private void format(String s){
int newIndex = s.indexOf("%");
int nextIndex = newIndex;
StringBuilder percentage = new StringBuilder();
StringBuilder value = new StringBuilder();
char character;
String result = "";
StringBuilder newProcess = new StringBuilder(s);
boolean done = true;
boolean firstPartDone = false;
boolean firstSymbol = false;
while(done){
while (!firstPartDone){
if(nextIndex == 0){
done = false;
}
nextIndex--;
character = s.charAt(nextIndex);
if(character == '+' | character == '/' | character == '*' | character == '-'){
firstPartDone = true;
}else{
percentage.insert(0, character);
}
}
if(nextIndex == 0){
done = false;
}
character = s.charAt(nextIndex);
if(character == '+' | character == '/' | character == '*' | character == '-'){
if(!firstSymbol){
// value.insert(0, character);
nextIndex--;
firstSymbol = true;
}else{
done = false;
}
}else{
value.insert(0, character);
nextIndex--;
}
}
// percentage.append("%");
System.out.println(percentage);
System.out.println(value);
result = value + "/100*" + percentage;
String percent = percentage.toString();
String percentToReplace = percent.concat("%");
String finalString = newProcess.toString();
finalString = finalString.replace(percentToReplace, result);
process = finalString;
}
The problem with the method above is that it is still missing support to detect and handle brackets like (5+4)-(4+2)% . Which I can write . But it makes errors when there are two % present .
For instance : 5+10%-4+50% will become 5+ 5/10010 - 4 + 4/10050 .
The above equation will end up producing bad results for some reason . What would be necessary is inclusion of brackets etc ..
I just hoped that there was an easier way .
Rhino is a javascript engine without DOM support. If vanilla javascript can do it, Rhino can do it.
Since the meaning of % is context dependent, I don't think there's any easy way to use Rhino (vanilla javascript) to solve it. You need to parse it or find a library that will parse it. You can use javascript libraries with Rhino https://groups.google.com/g/mozilla.dev.tech.js-engine.rhino/c/fS8KQelY0bs
FYI - Baleudung says Rhino is obsolete and Nashorn is faster.
Original response:
Try using process = process.replaceAll("%", "/100");
replaceAll accepts a regex and every time the regex is found it replaces it with exactly the replacement value. In your first example it will be changed it to 5+3?/100*?+2, which is not valid javascript
/ has priority over + so it will adjust the values to percentages before adding the numbers.
Depending on how you want the calculator to work, this may have issues when multiple % are used. For example, 5+3%% or (1-3)+(1+2%)%. You need to decide if you want 3%% to mean 3/100/100 or not. If you want to reject %%, you could handle this in the input stage, by not letting %% be entered or after the input stage by parsing it to turn %% into %.

Why I cannot get the string without tokens with the program I have written?

Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
Queue q=new LinkedList();
for(int i=0;i<s.length();i++){
int x=(int)s.charAt(i);
if(x<65 || (x>90 && x<97) || x>122) {
q.add(s.charAt(i));
}
}
System.out.println(q.peek());
String redex="";
while(!q.isEmpty()) {
redex+=q.remove();
}
String[] x=s.split(redex,-1);
for(String y:x) {
if(y!=null)
System.out.println(y);
}
scan.close();
I am trying to print the string "my name is NLP and I, so, works:fine;"yes"." without tokens such as {[]}+-_)*&%$ but it just prints out all the String as it is, and I don't understand the problem?
This is 3 answers in one:
For your initial problem
For a solution without regex
For a correct use of Scanner (this is up to you).
First
When you use a regex build from whatever character you got under the hand, you should quote it:
String[] x=s.split(Pattern.quote(redex),-1);
That would be the usual problem, but the second problem is that you are building a regexp range but you are omitting the [] making the range, so it can work as is:
String[] x=s.split("[" + Pattern.quote(redex) + "]",-1);
This one may work, but may fail if Pattern.quote don't quote - and - is found in between two characters making a range such as : $-!.
This would means: character in range starting at $ from !. It may fail if the range is invalid and my example may be invalid ($ may be after !).
Finally, you may use:
String redex = q.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
This regexp should match the unwanted character.
Second:
For the rest, the other answer point out another problem: you are not using the Character.isXXX method to check for valid characters.
Firstly, be wary that some method does not use char but code points. For example, isAlphabetic use code points. A code points is simply a representation of a character in a multibyte encoding. There some unicode character which take two char.
Secondly, I think your problem lies in the fact you are not using the right tool to split your words.
In pseudo code, this should be:
List<String> words = new ArrayList<>();
int offset = 0;
for (int i = 0, n = line.length(); i < n; ++i) {
// if the character fail to match, then we switched from word to non word
if (!Character.isLetterOrDigit(line.charAt(i)) {
if (offset != i) {
words.add(line.substring(offset, i));
}
offset = i + 1; // next char
}
}
if (offset != line.length()) {
words.add(line.substring(offset));
}
This would:
- Find transition from word to non word and change offset (where we started)
- Add word to the list
- Add the last token as ending word.
Last
Alternatively, you may also play with Scanner class since it allows you to input a custom delimiter for its hasNext(): https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
I quote the class javadoc:
The scanner can also use delimiters other than whitespace. This
example reads several items in from a string:
String input = "1 fish 2 fish red fish blue fish";
Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
System.out.println(s.nextInt());
System.out.println(s.nextInt());
System.out.println(s.next());
System.out.println(s.next());
s.close();
As you guessed, you may pass on any delimiter and then use hasNext() and next() to get only valid words.
For example, using [^a-zA-Z0-9] would split on each non alpha/digit transition.
As noted in the comment, the condition x<65 will catch all sorts of special characters you're not interested in. Using Character's built-in methods will help you write this condition in a clearer, bug-free way:
x = s.charAt(i);
if (Character.isLetter(x) || Character.isWhiteSpace(x)) {
q.add(x);
}

match exact the same words between 2 strings

I would like to compare and match exactly one word (characters and length) between two strings.
This is what I have:
String wordCompare = "eagle:1,3:7;6\nBasils,45673:ewwsk\nlola:flower:1:2:b";
String lolo = scanner.nextLine();
if ( motCompare.toLowerCase().indexOf(lolo.toLowerCase()) != -1 ) {
System.out.println("Bingo !!!");
} else {
System.out.println("not found !!!");
}
If I type eagle:1,3:7;6 it should display Bingo !!!
If I type eagle:1,3 it still displays Bingo !!! which is wrong, it should display Not found.
If I type eagle:1,3:7;6 Basils,45673:ewwsk or eagle:1,3:7;6\nBasils,45673:ewwsk it should also display Not Found. Length of the typed word should be acknowledged between \n.
If I type Basils,45673:ewwsk, it displays bingo !!!
It looks like what you're wanting is an exact match, with the words being split by the newline character. With that assumption in mind, I would recommend splitting the string out into an array and then loading that into a HashSet like so:
boolean search(String wordDictionary, String search){
String[] options = wordDictionary.split("\n");
HashSet<String> searchSet = new HashSet<String>(Arrays.asList(options));
return searchSet.contains(search);
}
If the search function returns true, it has found whatever word you're searching for, if not, it hasn't.
Installing it in your code will look something like this:
String wordCompare = "eagle:1,3:7;6\nBasils,45673:ewwsk\nlola:flower:1:2:b";
String lolo = scanner.nextLine();
if(search(wordCompare, lolo))
System.out.println("Bingo!!!");
else
System.out.println("Not found.");
(For the record, you'd probably be better off with more clear variable names)
As #Grey has already mentioned within his answer, since you have a newline tag (\n) between your phrases you can Split the String using the String.split() method into a String Array and then compare the elements of that Array for equality with what the User supplies.
The code below is just another example of how this can be done. It also allows for the option to Ignore Letter case:
boolean ignoreCase = false;
String userString = "Basils,45673:ewwsk";
String isInString = "'" + userString + "' Was Not Found !!!";
String wordCompare = "eagle:1,3:7;6\nBasils,45673:ewwsk\nlola:flower:1:2:b";
String[] tmp = wordCompare.split("\n");
for (int i = 0; i < tmp.length; i++) {
// Ternary used for whether or not to ignore letter case.
if (!ignoreCase ? tmp[i].trim().equals(userString) :
tmp[i].trim().equalsIgnoreCase(userString)) {
isInString = "Bingo !!!";
break;
}
}
System.out.println(isInString);
Thank you,
The thing is I am not allowed to use regular expression nor tables.
so basing on your suggestions I made this code :
motCompare.toLowerCase().indexOf(lolo.toLowerCase(), ' ' ) != -1 ||
motCompare.toLowerCase().lastIndexOf(lolo.toLowerCase(),' ' ) != -1)
as a condition for a do while loop.
Could you please confirm if it is correct ?
Thank you.

Java efficiently replace unless matches complex regular expression

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?
I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!
At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.

Tokenizing an algebraic expression in string format

I"m trying to take a string that represents a full algebraic excpression, such as x = 15 * 6 / 3 which is a string, and tokenize it into its individual components. So the first would be x, then =, then 15, then *, 6, / and finally 3.
The problem I am having is actually parsing through the string and looking at the individual characters. I can't think of a way to do this without a massive amount of if statements. Surely there has to be a better way tan specifically defining each individual case and testing for it.
For each type of token, you'll want to figure out how to identify:
when you're starting to read a particular token
if you're continuing to read the same token, or if you've started a different one
Let's take your example: x=15*6/3. Let's assume that you cannot rely on the fact that there are spaces in between each token. In that case, it's trivial: your new token starts when you reach a space.
You can break down the character types into letters, digits, and symbols. Let's call the token types Variable, Operator, and Number.
A letter indicates a Variable token has started. It continues until you read a non-letter.
A symbol indicates the start of an Operator token. I only see single symbols, but you can have groups of symbols correspond to different Operator tokens.
A digit indicates the start of a Number token. (Let's assume integers for now.) The Number token continues until you read a non-digit.
Basically, that's how a simple symbolic parser works. Now, if you add in negative numbers (where the '-' symbol can have multiple meanings), or parentheses, or function names (like sin(x)) then things get more complicated, but it amounts to the same set of rules, now just with more choices.
create regular expression for each possible element: integer, variable, operator, parentheses.
combine them using the | regular expression operator into one big regular expression with capture groups to identify which one matched.
in a loop match the head of the remaining string and break off the matched part as a token. the type of the token depends on which sub-expression matched as described in 2.
or
use a lexer library, such as the one in antlr or javacc
This is from my early expression evaluator that takes an infix expression like yours and turns it into postfix to evaluate. There are methods that help the parser but I think they're pretty self documenting. Mine uses symbol tables to check tokens against. It also allows for user defined symbols and nested assignments and other things you may not need/want. But it shows how I handled your issue without using niceties like regex which would simplify this task tremendously. In addition everything shown is of my own implementation - stack and queue as well - everything. So if anything looks abnormal (unlike Java imps) that's because it is.
This section of code is important not to answer your immediate question but to show the necessary work to determine the type of token you're dealing with. In my case I had three different types of operators and two different types of operands. Based on either the known rules or rules I chose to enforce (when appropriate) it was easy to know when something was a number (starts with a number), variable/user symbol/math function (starts with a letter), or math operator (is: /,*,-,+) . Note that it only takes seeing the first char to know the correct extraction rules. From your example, if all your cases are as simple, you'd only have to handle two types, operator or operand. Nonetheless the same logic will apply.
protected Queue<Token> inToPostParse(String exp) {
// local vars
inputExp = exp;
offset = 0;
strLength = exp.length();
String tempHolder = "";
char c;
// the program runs in a loop so make sure you're dealing
// with an empty queue
q1.reset();
for (int i = offset; tempHolder != null && i < strLength; ++i) {
c = exp.charAt(i);
// Spaces are useless so skip them
if (c == ' ') { continue; }
// If c is a letter
if ((c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')) {
// Here we know it must be a user symbol possibly undefined
// at this point or an function like SIN, ABS, etc
// We extract, based on obvious rules, the op
tempHolder = extractPhrase(i); // Used to be append sequence
if (ut.isTrigOp(tempHolder) || ut.isAdditionalOp(tempHolder)) {
s1.push(new Operator(tempHolder, "Function"));
} else {
// If not some math function it is a user defined symbol
q1.insert(new Token(tempHolder, "User"));
}
i += tempHolder.length() - 1;
tempHolder = "";
// if c begins with a number
} else if (c >= '0' && c <= '9') {
try {
// Here we know that it must be a number
// so we extract until we reach a non number
tempHolder = extractNumber(i);
q1.insert(new Token(tempHolder, "Number"));
i += tempHolder.length() - 1;
tempHolder = "";
}
catch (NumberFormatException nfe) {
return null;
}
// if c is in the math symbol table
} else if (ut.isMathOp(String.valueOf(c))) {
String C = String.valueOf(c);
try {
// This is where the magic happens
// Here we determine the "intersection" of the
// current C and the top of the stack
// Based on the intersection we take action
// i.e., in math do you want to * or + first?
// Depending on the state you may have to move
// some tokens to the queue before pushing onto the stack
takeParseAction(C, ut.findIntersection
(C, s1.showTop().getSymbol()));
}
catch (NullPointerException npe) {
s1(C);
}
// it must be an invalid expression
} else {
return null;
}
}
u2();
s1.reset();
return q1;
}
Basically I have a stack (s1) and a queue (q1). All variables or numbers go into the queue. Any operators trig, math, parens, etc.. go on the stack. If the current token is to be put on the stack you have to check the state (top) to determine what parsing action to take (i.e., what to do based on math precedence). Sorry if this seems like useless information. I imagine if you're parsing a math expression it's because at some point you plan to evaluate it. IMHO, postfix is the easiest so I, regardless of input format, change it to post and evaluate with one method. If your O is different - do what you like.
Edit: Implementations
The extract phrase and number methods, which you may be most interested in, are as follows:
protected String extractPhrase(int it) {
String phrase = new String();
char c;
for ( ; it < inputExp.length(); ++it) {
c = inputExp.charAt(it);
if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9')) {
phrase += String.valueOf(c);
} else {
break;
}
}
return phrase;
}
protected String extractNumber(int it) throws NumberFormatException {
String number = new String();
int decimals = 0;
char c;
for ( ; it < strLength; ++it) {
c = inputExp.charAt(it);
if (c >= '0' && c <= '9') {
number += String.valueOf(c);
} else if (c == '.') {
++decimals;
if (decimals < 2) {
number += ".";
} else {
throw new NumberFormatException();
}
} else {
break;
}
}
return number;
}
Remember - By the time they enter these methods I've already been able to deduce what type it is. This allows you to avoid the seemingly endless while-if-else chain.
Are components always separated by space character like in your question? if so, use algebricExpression.split(" ") to get a String[] of components.
If no such restrictions can be assumed, a possible solution can be to iterate over the input, and switch the Character.getType() of the current index, somthing like that:
ArrayList<String> getExpressionComponents(String exp) {
ArrayList<String> components = new ArrayList<String>();
String current = "";
int currentSequenceType = Character.UNASSIGNED;
for (int i = 0 ; i < exp.length() ; i++) {
if (currentSequenceType != Character.getType(exp.charAt(i))) {
if (current.length() > 0) components.add(current);
current = "";
currentSequenceType = Character.getType(exp.charAt(i));
}
switch (Character.getType(exp.charAt(i))) {
case Character.DECIMAL_DIGIT_NUMBER:
case Character.MATH_SYMBOL:
case Character.START_PUNCTUATION:
case Character.END_PUNCTUATION:
case Character.LOWERCASE_LETTER:
case Character.UPPERCASE_LETTER:
// add other required types
current = current.concat(new String(new char[] {exp.charAt(i)}));
currentSequenceType = Character.getType(exp.charAt(i));
break;
default:
current = "";
currentSequenceType = Character.UNASSIGNED;
break;
}
}
return components;
}
You can easily change the cases to meet with other requirements, such as split non-digit chars to separate components etc.

Categories