Split String end with special characters - Java - java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?

Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+

I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]

#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

Related

Regex to remove only special characters and not other language letters

I used a regex expression to remove special characters from name. The expression will remove all letters except English alphabets.
public static void main(String args[]) {
String name = "Özcan Sevim.";
name = name.replaceAll("[^a-zA-Z\\s]", " ").trim();
System.out.println(name);
}
Output:
zcan Sevim
Expected Output:
Özcan Sevim
I get bad result as I did it this way, the right way will be to remove special characters based on ASCII codes so that other letters will not be removed, can someone help me with a regex that would remove only special characters.
You can use \p{IsLatin} or \p{IsAlphabetic}
name = name.replaceAll("[^\\p{IsLatin}]", " ").trim();
Or to remove the punctuation just use \p{Punct} like this :
name = name.replaceAll("\\p{Punct}", " ").trim();
Outputs
Özcan Sevim
take a look at the full list of Summary of regular-expression constructs and use the one which can help you.
Use Guava CharMatcher for that :) It will be easier to read and maintain it.
name = CharMatcher.ASCII.negate().removeFrom(name);
use [\W+] or "[^a-zA-Z0-9]" as regex to match any special characters and also use String.replaceAll(regex, String) to replace the spl charecter with an empty string. remember as the first arg of String.replaceAll is a regex you have to escape it with a backslash to treat em as a literal charcter.
String string= "hjdg$h&jk8^i0ssh6";
Pattern pt = Pattern.compile("[^a-zA-Z0-9]");
Matcher match= pt.matcher(string);
while(match.find())
{
String s= match.group();
string=string.replaceAll("\\"+s, "");
}
System.out.println(string);

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

How to write a regex to split a String in this format?

I want to use [,.!?;~] to split a string, but I want to remain the [,.!?;~] to its place for example:
This is the example, but it is not enough
To
[This is the example,, but it is not enough] // length=2
[0]=This is the example,
[1]=but it is not enough
As you can see the comma is still in its place. I did this with this regex (?<=([,.!?;~])+). But I want if some special word (e.g: but) comes after the [,.!?;~], then do not split that part of string. For example:
I want this sentence to be split into this form, but how to do. So if
anyone can help, that will be great
To
[0]=I want this sentence to be split into this form, but how to do.
[1]=So if anyone can help,
[2]=that will be great
As you can see this part (form, but) is not split int the first sentence.
I've used:
Positive Lookbehind (?<=a)b to keep the delimiter.
Negative Lookahead a(?!b) to rule out stop words.
Notice how I've appended RegEx (?!\\s*(but|and|if)) after your provided RegEx. You can put all those stop words that you've to rule out (eg, but, and, if) inside the bracket separated by pipe symbol.
Also do notice that the delimiter is still in it's place.
Output
Count of tokens = 3
I want this sentence to be split into this form, but how to do.
So if anyone can help,
that will be great
Code
import java.lang.*;
public class HelloWorld {
public static void main(String[] args) {
String str = "I want this sentence to be split into this form, but how to do. So if anyone can help, that will be great";
//String delimiters = "\\s+|,\\s*|\\.\\s*";
String delimiters = "(?<=,)";
// analyzing the string
String[] tokensVal = str.split("(?<=([,.!?;~])+)(?!\\s*(but|and|if))");
// prints the number of tokens
System.out.println("Count of tokens = " + tokensVal.length);
for (String token: tokensVal) {
System.out.println(token);
}
}
}

Specific Regex Pattern

I wish to take a string input from the user and extract words or numbers like so:
String problem = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String[] solve = {"I'm", "looking", "to", "extract", "all", "6", "substrings"};
Basically, I want to extract numbers and words with complete disregard to punctuation except apostrophes. I know how to get words and strings but I can't seem to figure out this tricky part.
You could do like the below.
String s = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String parts[] = s.replaceAll("[^\\s\\w']|(?<!\\b)'|'(?!\\b)", "").split("\\s+");
System.out.println(Arrays.toString(parts));
Output:
[I'm, looking, to, extract, all, 6, substrings]
Explanation:
[^\\s\\w'] matches any character but not of space or single quote or word character.
(?<!\\b)'(?!\\b) matches the ' symbol only if it's not preceded and not followed by a word character.
replaceAll function replaces all the matched characters with an empty string.
Finally we do splitting on the resultant string according to one or more space characters.

How to split comma-separated string but exclude some words containing comma in Java

Assume that we have below string:
"test01,test02,test03,exceptional,case,test04"
What I want is to split the string into string array, like below:
["test01","test02","test03","exceptional,case","test04"]
How can I do that in Java?
This negative lookaround regex should work for you:
(?<!exceptional),|,(?!case)
Working Demo
Java Code:
String[] arr = str.split("(?<!exceptional),|,(?!case)");
Explanation:
This regex matches a comma if any one of these 2 conditions meet:
comma is not preceded by word exceptional using negative lookbehind (?<!exceptional)
comma is not followed by word case using negative lookahead (?!case)
That effectively disallows splitting on comma when it is surrounded by exceptional and case on either side.
#anubhava's answer is great—use it. For completion, here's a general solution that is applicable to many solutions and uses a beautifully simple regex:
exceptional,case|(,)
The left side of the alternation | matches complete exceptional,case. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left. We then replace these commas by something distinctive, and split on that string.
This program shows how to use the regex (see the results at the bottom of the online demo):
String subject = "somethingelse,case,test02,test03,exceptional,case,test04,exceptional,notcase";
Pattern regex = Pattern.compile("exceptional,case|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "##SplitHere##");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("##SplitHere##");
for (String split : splits) System.out.println(split);
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
How can Java understand the exceptional,case is a single word and not to split ?
Still If there would have been some other recurring character like "" you could have split it.
For ex. if It was
"test01","test02","test03","exceptional,case","test04"
You could split it using ","
So in your case it is not possible, unless you use regular expression.
Here's a dead-simple answer, don't know why I didn't think of it yesterday:
(?<!exceptional(?=,case)),
Explanation
A comma (the last character of the regex) that is not preceded by exceptional followed by ,case
String s1 = "test01.test02.test03.{i}.case.test04.test03.{i}.test03.{i}.test03.{i}";
String[] arr1 = s1.split("(?<!)\\.|\\.(?!\\{i})");
Output:
test01
test02
test03.{i}
case
test04
test03.{i}
test03.{i}
test03.{i}
You probably want to use split()
Like this:
String[] array = "test01,test02,test03,exceptional,case,test04".split(",");

Categories