Converting a String representing a mathemetical expression into an array - java

I want to convert a String such as 1+40.2+(2) into a String array [1, +, 40.2, +, (, 2, )] in order to use it as a parameter for a Shunting Yard algorithm in my Calculator class.
The input will be entered without spaces, so I can't just use input.split("\\s+"). I have come up with a long process involving ArrayLists, StringBuilders, and stacks, but I was wondering if there was an easier way to do this.
input.split("") won't work, since it would return [1, +, 4, 0, ., 2, +, (, 2, )]. This is actually the starting point of my current process, and I can post the pseudocode for it, if anyone is interested (although I'm having problems actually implementing my pseudocode).
Any advice or help is appreciated. Thanks!

I really like the first answer, but if you want to try using Regex as suggested in second comment, here's a Regex that will match each element of your equation one by one so you can append to your list. Note that it assumes that all of the string consists of are decimal point numbers, operators, and parenthesis.
[0-9\.]+|[+\-*/]|[()]
Note that in character classes, any character except ^-]\ is a literal so that's why the character classes look a bit funny. To construct the corresponding Java pattern, use
Pattern.compile("[0-9\\.]+|[+\\-*/]|[()]")
Example:
String s = "1+40.2+(2)";
Pattern p = Pattern.compile("[0-9\\.]+|[+\\-*/]|[()]");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output:
1
+
40.2
+
(
2
)

The replaceAll string method should be able to help you. Use this to surround the tokens you want to pull out with a special dividing character (I arbitrarily chose ':', but any character/string you're confident won't actually be in the input will work). Then you can split on that character.
String s = "1+40.2+(2)";
String dividingToken = ":";
String[] sSplit = s.replaceAll("\\+", dividingToken + "+" + dividingToken)
.replaceAll("\\(", dividingToken + "(" + dividingToken)
.replaceAll("\\)", dividingToken + ")" + dividingToken)
.split(dividingToken);
for(String str: sSplit){
System.out.println(str);
}
Output:
1
+
40.2
+
(
2
)
You could easily loop .replaceAll over an array of tokens (["+", "-", "*", ...]) that you want to split up. Just remember to add "//" before it in replace all because many of them have special regex meaning, whereas you actually want to match "+".

Related

How to remove spaces from string only if it occurs once between two words but not if it occurs thrice?

I am a beginner working on a diff and regenerate algorithm but for Strings. I store the patch in a file. To regenerate the new string from old I use that file. Although the code works, I face a problem when using space.
I use replaceAll(" ", ""); for removing spaces. This is fine when the string is [char][space][char], but creates problem when it is like [space][space][space]. Here, I want that the space be retained(only one).
I thought of doing replaceAll(" ", " ");. But this would leave spaces in type [char][space][char]. I am using scanner to scan through the string.
Is there a way to achieve this?
Input Output
c => c
cc => cc
c c => cc
c c => This is not possible. Since there will be padding of one space for each character
c c => c c
We can also split the string on where there are more than one white space, then join the resulting array by into a string using the Stream and Collector API.
Also we would replace the single spaces by using replaceAll() in a Stream#map operation:
String test = " this is a test of space in string ";
//using the pattern \\s{n,} for splitting at multi spaces
String[] arr = test.split("\\s{2,}");
String s = Arrays.stream(arr)
.map(str -> str.replaceAll(" ", ""))
.collect(Collectors.joining(" "));
System.out.println(s);
Output:
this isatestof spaceinstring
You could use lookarounds to do your replacement:
String newText = text
.replaceAll("(?<! ) (?! )", "")
.replaceAll(" +", " ");
The first replaceAll removes any space not surrounded by spaces; the second one replaces the remaining sequences of spaces by a single one.
Ideone example. Sequences of two or more spaces become a single space, and single spaces are removed.
Lookarounds
A lookaround in the context of regular expressions is a collective term for lookbehinds and lookaheads. These are so-called zero-width assertions, that means they match a certain pattern, but do not actually consume characters. There are positive and negative lookarounds.
A short example: the pattern Ira(?!q) matches the substring Ira, but only if it's not followed by a q. So if the input string is Iraq, it won't match, but if the input string is Iran, then the match is Ira.
More info:
https://www.regular-expressions.info/lookaround.html
If you want to replace any group of space by one you could use:
value.replaceAll("\\s+", " ")
I had to use two replacements:
String e = "a b c";
e = e.replaceAll("([A-Z|a-z])\\s([A-Z|a-z])", "$1$2");
e = e.replaceAll(" "," ");
System.out.println(e);
Which prints
ab c
The first one replaces any letter-space-letter combo with just the two letters, and then the second replaces any triple-space with a single space.
The first replacement is using backreferences. $1 refers to the part inside the first set of parenthesis that matches the first letter, and $2 refers to the part inside the second set of parenthesis.
If you have leading/trailing spaces on the input, you can call trim() before doing the replacements.
e = e.trim()

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Java split returns white spaces in result

I'm using the function "split" on this string:
p(80,2)
I would like to obtain just the two numbers, so this is what I do:
String[] split = msg.msgContent().split("[p(,)]")
The regex is correct (or at least, I think so) since it splits the two numbers and puts them in the vector "split", but it turns out that this vector has a length of 4, and the first two positions are occupied by white spaces.
In fact, if I print each vector position, this is the result:
Split:
80
2
I've tried adding \\s to the regex to match with white spaces, but since there are none in my string, it didn't work.
You don't need split here, just use a simple regex to extract the digits from your string:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(msg.msgContent());
while (m.find()) {
String number = m.group();
// add to array
}
Note that String#split takes a regex, and the regex you passed doesn't match the pattern you're looking for.
You might want to read the documentation of Pattern and Matcher for more information about the solution above.
split accepts a regular expression as parameter, and this is a character class: [p(,)].
Given that your code is splitting on all characters in the class:
"p(80,2)" will return an array {"", "80", "2"}
I know is not very beautiful:
List<String> collect = Pattern.compile("[^\\d]+")
.splitAsStream(s)
.filter(s -> s.length() > 0)
.collect(Collectors.toList());
Since you're splitting on p and (, the first two characters of your string are resulting in splits. I would split on the comma after replacing the p, (, and ). Like this:
String x = "p(80,2)";
String [] y = x.replaceAll("[p()]", "").split(",");
Split it's not really what you need here, but if you want to use it you can do something like that:
"p(80,2)".replace("p(", "").replace(")", "").split(",")
Results with
[80, 2]

Java regexp: splitting on "/" that is not at the beginning of a string

I want to split a/bc/de/f as [a, bc, de, f] but /a/bc/de/f as [/a, bc, de, f].
Is there a way to split on / which is not at the beginning of the string? (I'm having a bad Regexp day.)
(?!^)/ seems to work:
public class Funclass{
public static void main(String [] args) {
String s = "/firstWithSlash/second/third/forth/fifth/";
String[] ss = s.split("(?!^)/");
for (String s_ : ss)
System.out.println(s_);
}
}
output:
/firstWithSlash
second
third
forth
fifth
As #user unknown commented, this seems to be a wrong expression, it should be (?<!^)/ to indicate negative lookbehind.
The simplest solution is probably just to split s.substring(1) on /, and then prepend s.charAt(0) to the first result.
Other than that, since the split regex is not anchored, it would be challenging to do. You'd want to split on "something that isn't the start of the line, followed by a slash" - i.e. [^^ ]/ - but this would mean that the character preceding the slash was stripped out too. In order to do this you'd need negative look-behind, but I don't think that syntax is supported in the String.split regexes.
Edit: According to the Pattern javadocs it seems that Java does support negative lookbehind, and the following regex may do the job:
s.split("(?<!^)/");
A quick test indicates that this does indeed do what you want.
Couldn't you just add a check at the beginning to see if there's a slash in the beginning?
if( str.charAt(0) == '/' ) {
arr = str.substring(1).split( "/" );
arr[0] = "/"+arr[0];
} else
arr = str.split( "/" );
Or a little simpler:
arr = str.charAt(0) + str.substring(1).split( "/" );
If the first is a slash, it'll just slap on a slash at the beginning of the first token. If there's only one character in the first token (that doesn't begin with a slash), then the first array element is the empty string and it'll still work.

Need to split a string into two parts in java

I have a string which contains a contiguous chunk of digits and then a contiguous chunk of characters. I need to split them into two parts (one integer part, and one string).
I tried using String.split("\\D", 1), but it is eating up first character.
I checked all the String API and didn't find a suitable method.
Is there any method for doing this thing?
Use lookarounds: str.split("(?<=\\d)(?=\\D)")
String[] parts = "123XYZ".split("(?<=\\d)(?=\\D)");
System.out.println(parts[0] + "-" + parts[1]);
// prints "123-XYZ"
\d is the character class for digits; \D is its negation. So this zero-matching assertion matches the position where the preceding character is a digit (?<=\d), and the following character is a non-digit (?=\D).
References
regular-expressions.info/Lookarounds and Character Class
Related questions
Java split is eating my characters.
Is there a way to split strings with String.split() and include the delimiters?
Alternate solution using limited split
The following also works:
String[] parts = "123XYZ".split("(?=\\D)", 2);
System.out.println(parts[0] + "-" + parts[1]);
This splits just before we see a non-digit. This is much closer to your original solution, except that since it doesn't actually match the non-digit character, it doesn't "eat it up". Also, it uses limit of 2, which is really what you want here.
API links
String.split(String regex, int limit)
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.
There's always an old-fashioned way:
private String[] split(String in) {
int indexOfFirstChar = 0;
for (char c : in.toCharArray()) {
if (Character.isDigit(c)) {
indexOfFirstChar++;
} else {
break;
}
}
return new String[]{in.substring(0,indexOfFirstChar), in.substring(indexOfFirstChar)};
}
(hope it works with digit-only or char-only Strings too - can't test it here - if not, take it as a general idea)

Categories