How to split a sentence into words and punctuations in java - java

I want to split a given sentence of type string into words and I also want punctuation to be added to the list.
For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]
With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.
String text="Sara's dog 'bit' the neighbor."
String list = text.split(" ")
the printed result is [Sara's, dog,'bit', the, neighbour.]
I don't know how to combine another regex with the above split method to separate punctuations also.
Some of the reference I have already tried but didn't work out
1.Splitting strings through regular expressions by punctuation and whitespace etc in java
2.How to split sentence to words and punctuation using split or matcher?
Example input and outputs
String input1="Holy cow! screamed Jane."
String[] output1 = [Holy,cow,!,screamed,Jane,.]
String input2="Select your 'pizza' topping {pepper and tomato} follow me."
String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]

Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.
Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
In Java 8 or earlier, you would write it like this:
List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
parts.add(m.group());
}
Explanation
\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".
\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.
\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.
That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.
Test
public class Test {
public static void main(String[] args) {
test("Sara's dog 'bit' the neighbor.");
test("Holy cow! screamed Jane.");
test("Select your 'pizza' topping {pepper and tomato} follow me.");
}
private static void test(String s) {
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(Arrays.toString(parts));
}
}
Output
[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]

Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())
Reference:
How to split a string, but also keep the delimiters?
Regular Expressions on Punctuation

ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
if(cha.equals(" ")){
chars.add(tempStr);
tempStr = "";
}
//INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
else if(cha.equals("!") || cha.equals(".")){
chars.add(cha);
}
else{
tempStr = tempStr + cha;
}
}
chars.add(str.substring(str.lastIndexOf(" "));
That?
It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.

Related

How can I avoid splitting on a comma in brackets?

I have a string below which I want to split in String array with multiple delimiters.
The delimiters are comma (,), semicolon (;), "OR" and "AND".
But I do not want to split on a comma if it's in brackets.
Example input:
device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)
I am able to split the String with regex, but it doesn't handle commas in brackets correctly.
How can I fix this?
Pattern ptn = Pattern.compile("(,|;|OR|AND)");
String[] parts = ptn.split(query);
for(String p:parts){
System.out.println(p);
queryParams.add(p.trim());
}
You could use a negative look-ahead:.
String[] parts = input.split(",(?![^()]*\\))|;| OR | AND ")
Or an uglier (but perhaps conceptually simpler) way you could do it would be to replace any commas within brackets with a temporary placeholder, then do the split and replace the placeholders with real commas in the results.
String input = "X,Y=((A,B),C) OR Z";
Pattern pattern = Pattern.compile("\\(.*\\)");
Matcher matcher = pattern.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group().replaceAll(",", "_COMMA_"));
}
matcher.appendTail(sb);
String[] parts = sb.toString().split("(,|;| OR | AND )");
for (String part : parts) {
System.out.println(part.replace("_COMMA_", ","));
}
Prints:
X
Y=((A,B),C)
Z
Alternatively, you could write your own little tokenizer that reads the input character-by-character using charAt(index) or define a grammar for an off-the-shelf parser.
You can use negative look-ahead (?!...), which looks at the following characters, and if those characters match the pattern in brackets, the overall match will fail.
String query = "device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)";
String[] parts = query.split("\\s*(,(?![^()]*\\))|;|OR|AND)\\s*");
for(String part: parts)
System.out.println(part);
Output:
device_name==device503
device_type!=GATEWAY
site_name<site3434
country==India
location==BLR
new_name=in=(Rajesh,Suresh)
So in this case we check whether the characters following the , are 0 or more characters which aren't either ( or ), followed by a ), and if this is true, the , match fails.
This won't work if you can have nested brackets.
Note:
String also has a split method (as used above), which is useful for simplicity's sake (but would be slower than reusing the same Pattern over and over again for multiple Strings).
You can add \\s* (0 or more whitespace characters) to your regex to remove any spaces before or after a delimiter.
If you're using | without anything before or after (e.g. "a|b|c"), you don't need to put it in brackets.

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

Split String end with special characters - Java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?
Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+
I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]
#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

Regex add space between all punctuation

I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.

can split string method of java return the array with the delimiters as well

When we use String.Split() method on a string to split in java, it works as following:
String s = "hello my dear";
String[] ss = s.split("[ ]");
The array ss contains [Hello, my, dear] but the spaces (which are the delimiters) in this case are not the part of array.
is there some way that the delimiters may be the part of the array generated using the split method of string class in Java.
You can do it like this: -
"hello my dear".split("(?<=[ ])");
It splits on a empty string just after a whitespace. This will give you array with elements like this: -
hello_
my_
dear
_ means space.
If you want your delimiter to be separate array element, you can do it like this: -
System.out.println(Arrays.toString("a+b=c".split("(?<=[+=])|(?=[+=])")));
This now splits on empty string, which is either followed by either + or =, or preceded by either + or =. So, the all the locations where the split is performed for the above case is like this: -
a + b = c
^ ^ ^ ^ <-- Empty strings before and after your pattern - `[+=]`
So, you have 5 elements in your array.
Output: -
[a, +, b, =, c]
But you are using the wrong tool for parsing mathematical expression. You should not use Regex for this.
use StringTokenizer's overloaded constructor:
String s = "hello my dear";
StringTokenizer st = new StringTokenizer(s, " ", true);
while(st.hasMoreTokens()){
System.out.println(st.nextToken());
}
Output:
hello
my
dear
You can split on word boundaries.
String[] ts = "hello my dear".split("\\b");
System.out.println(Arrays.toString(ts));
[, hello, , my, , dear]
Alternatively
public String[] getParts(String s) {
List<String> parts = new ArrayList<String>();
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher m = pattern.matcher(s);
while (m.find()) {
parts.add(m.group());
}
return parts.toArray(new String[parts.size()]);
}
This matches with every find either a word \\w+ (small w) or a non-word character \\W (capital W).

Categories