Regex split through multiple symbols String s="He is a very very good boy, isn't he?"
String[] sa = s.split("[!, ?._'#]");
System.out.println(sa.length);
for (String string : sa) {
System.out.println(string);
}
11
He
is
a
very
very
good
boy
isn
t
he
while using
String[] sa = s.split("[!, ?._'#]+");
10
He
is
a
very
very
good
boy
isn
t
he
+ in regex ie used for one or more but how this space is coming?
This happens because the split function is creating an array element containing an empty string between the comma , and the space after boy.
arr = ['He', 'is', 'a', 'very', 'very', 'good', 'boy', '', 'isn', 't', 'he']
The function beleives there is some text between the comma and the space when it splits the text, effectively generating that empty string.
When you use the + symbol, you split by "groups" of characters, and it takes the comma and space as the splitting regular expression, not generating that empty string between those characters.
Related
I want to split a given sentence of type string into words and I also want punctuation to be added to the list.
For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]
With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.
String text="Sara's dog 'bit' the neighbor."
String list = text.split(" ")
the printed result is [Sara's, dog,'bit', the, neighbour.]
I don't know how to combine another regex with the above split method to separate punctuations also.
Some of the reference I have already tried but didn't work out
1.Splitting strings through regular expressions by punctuation and whitespace etc in java
2.How to split sentence to words and punctuation using split or matcher?
Example input and outputs
String input1="Holy cow! screamed Jane."
String[] output1 = [Holy,cow,!,screamed,Jane,.]
String input2="Select your 'pizza' topping {pepper and tomato} follow me."
String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]
Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.
Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
In Java 8 or earlier, you would write it like this:
List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
parts.add(m.group());
}
Explanation
\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".
\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.
\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.
That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.
Test
public class Test {
public static void main(String[] args) {
test("Sara's dog 'bit' the neighbor.");
test("Holy cow! screamed Jane.");
test("Select your 'pizza' topping {pepper and tomato} follow me.");
}
private static void test(String s) {
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(Arrays.toString(parts));
}
}
Output
[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())
Reference:
How to split a string, but also keep the delimiters?
Regular Expressions on Punctuation
ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
if(cha.equals(" ")){
chars.add(tempStr);
tempStr = "";
}
//INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
else if(cha.equals("!") || cha.equals(".")){
chars.add(cha);
}
else{
tempStr = tempStr + cha;
}
}
chars.add(str.substring(str.lastIndexOf(" "));
That?
It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.
My characters is "!,;,%,#,**,**,(,)" which get from XML. when I split it with ',', I lost the ','.
How can I do to avoid it.
I have already tried to change the comma to 'C', but it does not work.
Thre result I want is "!,;,%,#,,,(,)", but not "!,;,%,#,,(,)"
String::split use regex so you can split with this regex ((?<!,),|,(?!,)) like this :
String string = "!,;,%,#,,,(,)";
String[] split = string.split("((?<!,),|,(?!,))");
Details
(?<!,), match a comma if not preceded by a comma
| or
,(?!,) match a comma if not followed by a comma
Outputs
!
;
%
#
,
(
)
If you are trying to extract all characters from string, you can do so by using String.toCharArray()[1] :
String str = "sample string here";
char[] char_array = s.toCharArray();
If you just want to iterate over the characters in the string, you can use the character array obtained from above method or do so by using a for loop and str.charAt(i)[2] to access the character at position i.
[1] https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toCharArray()
[2]https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#charAt(int)
try this, this could be help full. First I replaced the ',' with other string and do split. After complete other string replace with ','
public static void main(String[] args) {
String str = "!,;,%,#,**,**,(,)";
System.out.println(str);
str = str.replace("**,**","**/!/**");
String[] array = str.split(",");
System.out.println(Arrays.stream(array).map(s -> s.replace("**/!/**", ",")).collect(Collectors.toList()));
}
out put
!,;,%,#,**,**,(,)
[!, ;, %, #, ,, (, )]
First, we need to define when the comma is an actual delimiter, and when it is part of a character sequence.
We need to assume that a sequence of commas surrounded by commas is an actual character sequence we want to capture. It can be done with lookarounds:
String s = "!,;,,,%,#,**,**,,,,(,)";
List<String> list = Arrays.asList(s.split(",(?!,)|(?<!,),"));
This regular expression splits by a comma that is either preceded by something that is not a comma, or followed by something that is not a comma.
Note that your formatting string, that is, every character sequence separated by a comma, is a bad design, since you require both the possibility to use a comma as sequence, and the possibility to use multiple characters to be used. That means you can combine them too!
What, for example, if I want to use these two character sequences:
,
,,,,
Then I construct the formatting string like this: ,,,,,,. It is now unclear whether , and ,,,, should be character sequences, or ,, and ,,,.
This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006
I have the String value of "Hello,World\\,,1,2,3" and I need it to be Splitted with the delimiter of ,. However, the method also needs to recognize the escpace characters as well at the time of splitting and eventually the output needs to be an array like,
[Hello, World,, 1, 2, 3]
I have a method to do it but it doesn't recognize the escape characters though. Its provided below,
public static String[] tokenize1(String record, char delimiter) {
String delim = String.valueOf(delimiter);
String[] arr = record.split(delim);
return arr;
}
You can try splitting on comma but with a negative lookbehind which asserts that the comma has not been escaped by a backslash:
String input = "Hello,World\\,,1,2,3";
String[] parts = input.split("(?<!\\\\),");
for (String part : parts) {
// uncomment to also remove backslash
// part = part.replaceAll("\\\\,", ",");
System.out.println(part);
}
Output:
Hello
World\,
1
2
3
Demo
My problem is I have a string like this
String text="UWU/CST/13/0032 F"
I want this to split by / and white spaces and put into a array.So finally the array indexes should include following
UWU,
CST,
13,
0032,
F
text.split("[/ ]"), or text.split("[/ ]", -1) if you want trailing empty tokens to be returned.
Use the string.split(separator) method that takes a String (regex expression) as an argument. Here is the documentation. http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
If you have:
String text = "UWU/CST/13/0032 F";
You can separate it by the white space first, splitting it into an array of two Strings, and then split the first String in the array by "/".
String text = "UWU/CST/13/0032 F";
String[] array = text.split(" ");
String[] other = array[0].split("/");
for (String e : array) System.out.println(e);
for (String e : other) System.out.println(e);
This code outputs:
UWU/CST/13/0032
F
UWU
CST
13
0032
Regular expressions can be used in Java to split Strings using the String.split(String) method.
For you particular situation, you should split the string on the regular expression "(\s|/)". \s matches white space while / literally matches a forward slash.
The final code for this would be:
String[] splitString = text.split("(\\s|/)");