How do I skip splitting when white space occurs? - java

I want to split using ";" as delimiter and put outcome into the list of strings, for example
Input:
sentence;sentence;sentence
should produce:
[sentence, sentence, sentence]
Problem is some strings are like this:
"sentence; continuation;new sentence", and for such I'd like the outcome to be: [sentence; continuation, new sentence].
I'd like to skip splitting when there is whitespace after (or before) semicolon.
Example string I'd like to split:
String sentence = "Ogłoszenie o zamówieniu;2022/BZP 00065216/01;"Dostawa pojemników na odpady segregowane (900 sztuk o pojemności 240 l – kolor żółty; 30 sztuk o pojemności 1100 l – kolor żółty).";Zakład Wodociągów i Usług Komunalnych EKOWOD Spółka z ograniczoną odpowiedzialnością"
I tried:
String[] splitted = sentence.split(";\\S");
But this cuts off the first character of each sentence.

You can use a regex negative lookahead/lookbehind for this.
String testString = "hello;world; test1 ;test2";
String[] splitString = testString.split("(?<! );(?! )"); // Negative lookahead and lookbehind
for (String s : splitString) System.out.println(s);
Output:
hello
world; test1 ;test2
Here, the characters near the start and end of the regex are saying "only split on the semicolon if there are no spaces before or after it"

Related

How to split a sentence into words and punctuations in java

I want to split a given sentence of type string into words and I also want punctuation to be added to the list.
For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]
With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.
String text="Sara's dog 'bit' the neighbor."
String list = text.split(" ")
the printed result is [Sara's, dog,'bit', the, neighbour.]
I don't know how to combine another regex with the above split method to separate punctuations also.
Some of the reference I have already tried but didn't work out
1.Splitting strings through regular expressions by punctuation and whitespace etc in java
2.How to split sentence to words and punctuation using split or matcher?
Example input and outputs
String input1="Holy cow! screamed Jane."
String[] output1 = [Holy,cow,!,screamed,Jane,.]
String input2="Select your 'pizza' topping {pepper and tomato} follow me."
String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]
Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.
Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
In Java 8 or earlier, you would write it like this:
List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
parts.add(m.group());
}
Explanation
\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".
\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.
\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.
That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.
Test
public class Test {
public static void main(String[] args) {
test("Sara's dog 'bit' the neighbor.");
test("Holy cow! screamed Jane.");
test("Select your 'pizza' topping {pepper and tomato} follow me.");
}
private static void test(String s) {
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(Arrays.toString(parts));
}
}
Output
[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())
Reference:
How to split a string, but also keep the delimiters?
Regular Expressions on Punctuation
ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
if(cha.equals(" ")){
chars.add(tempStr);
tempStr = "";
}
//INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
else if(cha.equals("!") || cha.equals(".")){
chars.add(cha);
}
else{
tempStr = tempStr + cha;
}
}
chars.add(str.substring(str.lastIndexOf(" "));
That?
It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.

How can I avoid splitting on a comma in brackets?

I have a string below which I want to split in String array with multiple delimiters.
The delimiters are comma (,), semicolon (;), "OR" and "AND".
But I do not want to split on a comma if it's in brackets.
Example input:
device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)
I am able to split the String with regex, but it doesn't handle commas in brackets correctly.
How can I fix this?
Pattern ptn = Pattern.compile("(,|;|OR|AND)");
String[] parts = ptn.split(query);
for(String p:parts){
System.out.println(p);
queryParams.add(p.trim());
}
You could use a negative look-ahead:.
String[] parts = input.split(",(?![^()]*\\))|;| OR | AND ")
Or an uglier (but perhaps conceptually simpler) way you could do it would be to replace any commas within brackets with a temporary placeholder, then do the split and replace the placeholders with real commas in the results.
String input = "X,Y=((A,B),C) OR Z";
Pattern pattern = Pattern.compile("\\(.*\\)");
Matcher matcher = pattern.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group().replaceAll(",", "_COMMA_"));
}
matcher.appendTail(sb);
String[] parts = sb.toString().split("(,|;| OR | AND )");
for (String part : parts) {
System.out.println(part.replace("_COMMA_", ","));
}
Prints:
X
Y=((A,B),C)
Z
Alternatively, you could write your own little tokenizer that reads the input character-by-character using charAt(index) or define a grammar for an off-the-shelf parser.
You can use negative look-ahead (?!...), which looks at the following characters, and if those characters match the pattern in brackets, the overall match will fail.
String query = "device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)";
String[] parts = query.split("\\s*(,(?![^()]*\\))|;|OR|AND)\\s*");
for(String part: parts)
System.out.println(part);
Output:
device_name==device503
device_type!=GATEWAY
site_name<site3434
country==India
location==BLR
new_name=in=(Rajesh,Suresh)
So in this case we check whether the characters following the , are 0 or more characters which aren't either ( or ), followed by a ), and if this is true, the , match fails.
This won't work if you can have nested brackets.
Note:
String also has a split method (as used above), which is useful for simplicity's sake (but would be slower than reusing the same Pattern over and over again for multiple Strings).
You can add \\s* (0 or more whitespace characters) to your regex to remove any spaces before or after a delimiter.
If you're using | without anything before or after (e.g. "a|b|c"), you don't need to put it in brackets.

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

Splitting a string on whitespaces

I'm currently trying to splice a string into a multi-line string.
The regex should select white-spaces which has 13 characters before.
The problem is that the 13 character count does not reset after the previous selected white-space. So, after the first 13 characters, the regex selects every white-space.
I'm using the following regex with a positive look-behind of 13 characters:
(?<=.{13})
(there is a whitespace at the end)
You can test the regex here and the following code:
import java.util.ArrayList;
public class HelloWorld{
public static void main(String []args){
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
for (String string : str.split("(?<=.{13}) ")) {
System.out.println(string);
}
}
}
The output of this code is as follows:
This is a test.
The
app
should
break
this
string
in
substring
on
whitespaces
after
13
characters
But it should be:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
You may actually use a lazy limiting quantifier to match the lines and then replace with $0\n:
.{13,}?[ ]
See the regex demo
IDEONE demo:
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
System.out.println(str.replaceAll(".{13,}?[ ]", "$0\n"));
Note that the pattern matches:
.{13,}? - any character that is not a newline (if you need to match any character, use DOTALL modifier, though I doubt you need it in the current scenario), 13 times at least, and it can match more characters but up to the first space encountered
[ ] - a literal space (a character class is redundant, but it helps visualize the pattern).
The replacement pattern - "$0\n" - is re-inserting the whole matched value (it is stored in Group 0) and appends a newline after it.
You can just match and capture 13 characters before white spaces rather than splitting.
Java code:
Pattern p = Pattern.compile( "(.{13}) +" );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1));
}
It will produce:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
RegEx Demo
you can do this with the .split and using regular expression. It would be like this
line.split("\\s+");
This will spilt every word with one or more whitespace.

Regex add space between all punctuation

I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.

Categories