How to split text into words without spaces and interpunct symbols? - java

I have been given a text from inputstream and first I put it into String via StringBuilder.
Then I want to split the text(now string,since it's not from inputstream) in words, but in some places in the text there are not just one space, but more spaces between the words and interpunct symbols. Programming language should be JAVA.

A simple solution would be using the String split method:
String[] words = myString.split(" ");
If you have a StringBuilder instead of a String:
String[] words = myStringBuilder.toString().split(" ");
If you would like to split by any whitespace characters, use How do I split a string with any whitespace chars as delimiters?
myString.split("\\s+");

Related

Get two different delimiters from same string

How can I use split function in java using two delimiters in the same string
I want to get the words with commas and spaces separately
String I = "hello,hi hellow,bye"
I want to get the above string splited as
String var1 = hello,bye
String var2 = hi hellow
Any suggestion is very much valued.
I would try to first split them with one of the delimiters, then for each resulting substring split with the other delimiter.

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

Split String end with special characters - Java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?
Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+
I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]
#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

Splitting string based on delimiter

A string is taken as input which is in the form of 23,4,555,67 via deadline nd another input is key yo search the element linearly ?my question is how can we recognize the elements from string separated by comma
You can split the String using split :
String[] tokens = "23,4,555,67".split(",");
String s = "23,4,555,67"
String[] tokens = s.split(",");
This will give you a string array with the numbers.
Alternatively, you can use a StringTokenizer. (java.util)
This can be used if your string is delimited by more than one characters (can be be used as well in case of single character). Your example using StringTokenizer
SrringTokenizer st = new StringTokenizer("23,4,555,67", ",");
while(st.hasMoreElements())
System.out.println(st.nextToken());

Split Strings separated by an artbitrary character

Say we would like to write a method to receive entire book in a string and an arbitrary single-character delimiter to separate strings and return an array of strings. I came up with the following implementation (Java).(suppose no consecutive delimiter etc)
ArrayList<String> separater(String book, char delimiter){
ArrayList<String> ret = new ArrayList<>();
String word ="";
for (int i=0; i<book.length(), ++i){
if (book.charAt(i)!= delimiter){
word += book.charAt(i);
} else {
ret.add(word);
word = "";
}
}
return ret;
}
Question: I wonder if there is any way to leverage String.split() for shorter solutions? Its because I could not find a general way of defining a general regex for an arbitrary character delimiter.
String.split("\\.") if the delimiter is '.'
String.split("\\s+"); if the delimiter is ' ' // space character
That measn I cold not find a general way of generating the input regex of method split() from the input character delimiter. Any suggestions?
String[] array = string.split(Pattern.quote(String.valueOf(delimiter)));
That said, The Guava Splitter is much more versatile and well-behaving than String.split().
And a note on your method: concatenating to a String in a loop is very inefficient. As Strings are immutable, it produces a lot of temporary Strings and StringBuilders. You should use a StringBuilder instead.

Categories