Want to string split specifically in Java - java

I want to string split the following String
String ToSplit = "(2*(sqrt 9))/5";
into the following array of String:
String[] Splitted = {"(", "2", "*", "(", "sqrt", "9", ")", ")", "/", "5"};
As you can see the string ToSplit doesn't have spaces and I am having a hard time splitting the word " sqrt " from the rest of the elements because it is a full word. I am doing:
String[] Splitted = ToSplit.split("");
and the word " sqrt " is splitted into {"s", "q", "r", "t"} (obviously) and I want it splitted as the whole word to get the String splitted as shown above
How can I separate the word " sqrt " (as 1 element) from the others ?
Thanks in advance.

Here is a working solution which splits on lookarounds. See below the code for an explanation.
String input = "(2*(sqrt 9))/5";
String[] parts = input.split("(?<=[^\\w\\s])(?=\\w)|(?<=\\w)(?=[^\\w\\s])|(?<=[^\\w\\s])(?=[^\\w\\s])|\\s+");
for (String part : parts) {
System.out.println(part);
}
(
2
*
(
sqrt
9
)
)
/
5
There are four terms in the regex alternation, and here is what each one does:
(?<=[^\\w\\s])(?=\\w)
split if what precedes is neither a word character nor whitespace, AND
what follows is a word character
e.g. split (2 into ( and 2
(?<=\\w)(?=[^\\w\\s])
split if what precedes is a word character AND
what follows is neither a word character nor whitespace
e.g. split 9) into 9 and )
(?<=[^\\w\\s])(?=[^\\w\\s])
split between two non word/whitespace characters
e.g. split )/ into ) and /
\\s+
finally, also split and consume any amount of whitespace
as a separator between terms
e.g. sqrt 9 becomes sqrt and 9

Related

How do I skip splitting when white space occurs?

I want to split using ";" as delimiter and put outcome into the list of strings, for example
Input:
sentence;sentence;sentence
should produce:
[sentence, sentence, sentence]
Problem is some strings are like this:
"sentence; continuation;new sentence", and for such I'd like the outcome to be: [sentence; continuation, new sentence].
I'd like to skip splitting when there is whitespace after (or before) semicolon.
Example string I'd like to split:
String sentence = "Ogłoszenie o zamówieniu;2022/BZP 00065216/01;"Dostawa pojemników na odpady segregowane (900 sztuk o pojemności 240 l – kolor żółty; 30 sztuk o pojemności 1100 l – kolor żółty).";Zakład Wodociągów i Usług Komunalnych EKOWOD Spółka z ograniczoną odpowiedzialnością"
I tried:
String[] splitted = sentence.split(";\\S");
But this cuts off the first character of each sentence.
You can use a regex negative lookahead/lookbehind for this.
String testString = "hello;world; test1 ;test2";
String[] splitString = testString.split("(?<! );(?! )"); // Negative lookahead and lookbehind
for (String s : splitString) System.out.println(s);
Output:
hello
world; test1 ;test2
Here, the characters near the start and end of the regex are saying "only split on the semicolon if there are no spaces before or after it"

Java: Substring to arrray of Strings

I have this String:
Java 2 5 22 8
I want an Array of Strings with these values:
["2"; "5"; "22"; "8"]
Is there a way to use subString() method in order to do so, or should I take another approach?
Nevertheless how many spaces there are:
String str = "Java 5 22 8";
String[] arr =
str
.replaceAll("( )+"," ")
.replaceFirst("Java ", "")
.split(" ");
for (String a : arr) {
System.out.println(a);
}
After searching a bit on split and regex, I found the answer: line.split("[ ]{2,}")
You seen to be uninterested in "Java" prefix, so you can call substring and trim (to remove the spaces around the 4 numbers):
line.substring(4).trim() // equals "2 5 22 8"
You can then split the line by one or more whitespace characters. The Java regex to do this is "\\s+", since \\s matches a whitespace character, and + means "one or more".
line.substring(4).trim().split("\\s+") // equals ["2", "5", "22", "8"]
If you want to use substring, here is some, probably, non optimal solution. You start from the beginning of the string, look for first number, than continue to find next char that is not number. Do substring between, and then continue looking for next number.
while (i<s.length()){
char c=s.charAt(i);
if(c>='0'&&c<='9'){
j=i;
char c1=s.charAt(j);
while ((c1>='0'&&c1<='9')&&j<s.length()){
c1=s.charAt(j);
j++;
}
strings[p++]=s.substring(i, j);
i=j;
}else i++;
}
I can't tell if these are seperated by a space or not, but if so you can use the .split() method on your String.
Example:
String sampleString = "2 5 22 8";
String[] stringArray = sampleString.split(" ");

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

Splitting String by multiple symbols,regex

Regex split through multiple symbols String s="He is a very very good boy, isn't he?"
String[] sa = s.split("[!, ?._'#]");
System.out.println(sa.length);
for (String string : sa) {
System.out.println(string);
}
11
He
is
a
very
very
good
boy
isn
t
he
while using
String[] sa = s.split("[!, ?._'#]+");
10
He
is
a
very
very
good
boy
isn
t
he
+ in regex ie used for one or more but how this space is coming?
This happens because the split function is creating an array element containing an empty string between the comma , and the space after boy.
arr = ['He', 'is', 'a', 'very', 'very', 'good', 'boy', '', 'isn', 't', 'he']
The function beleives there is some text between the comma and the space when it splits the text, effectively generating that empty string.
When you use the + symbol, you split by "groups" of characters, and it takes the comma and space as the splitting regular expression, not generating that empty string between those characters.

Splitting a string using RegEx matches instead of delimiters

I want to split a string like this: "1.2 5" to be tokenized to {"1", ".", "2", "5"} (order matters), I was trying to do this with String.split() using the following regex: ([0-9])\w*|\. but this is what I want to match, not the delimiters.
Is there maybe another method that does this? Is it even possible to split two words that are connected while keeping both intact? (e.g split "1.2" like the above example)
More examples:
"1 2 8" => {"1", "2", "8"}
"1 122 .8" => {"1", "122", "." "8"}
"1 2.800" => {"1", "2", "." "800"}
This regex should work (demo):
s.split("(?=\\.)(?<! )|(?<=\\.)| +")
It works by spliting on places in the string where:
the next character is a literal . (lookahead) and the preceding character is not a space (negative lookbehind)
the preceding character is a literal . (lookbehind)
there are one or more space characters
The java split function removes any matching part of the string. In the case of the lookahead/lookbehind matches, they are are zero-width so split doesn't actually consume any of the string when spliting. The zero-width match basically just marks a position in the string to split at.
This solution will works for all your given examples, and it also works for multiple spaces. Here's a demo.
In response to your comment about the (?<! ) part of the regex. Without that part, The pattern matches every space character, and the position before every . and after every .. One of your examples had a space followed by a . (e.g. "2 .8") which would split like this:
["2", "", ".", "8"]
Note the empty string in the 2nd position. This is because it has split on the space, and then found a position before a ., and split there too. The (?<! ) prevents this by saying "only split before a . if it's not preceded by a space character.
You don't need regex matching, java has a built-in StringTokenizer that is just for this.
Try this:
StringTokenizer st = new StringTokenizer("1.2 5", ". ");
while(st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
Output:
1
2
5
EDIT: and if you want to include the delimiters, use new StringTokenizer(string, delimiters, returnDelims=true). In that case, the output is:
1
.
2
5
If you just want to return the dot, but not the space, skip it in the loop.
I'd rather collect all the non-digit and non-whitespace symbols with [^\d\s] and digits with a \d:
String s = "1.2 5";
Pattern pattern = Pattern.compile("\\d+|[^\\d\\s]+");
Matcher matcher = pattern.matcher(s);
List<String> lst = new ArrayList<>();
while (matcher.find()){
lst.add(matcher.group(0));
}
System.out.println(lst); // => [1, 122, ., 8]
See the Java demo
Pattern details:
\d+ - 1 or more digits
| - or
[^\d\s]+ - one or more chars other than a whitespace or digit
And here is a regex demo.

Categories