Specific Regex Pattern - java

I wish to take a string input from the user and extract words or numbers like so:
String problem = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String[] solve = {"I'm", "looking", "to", "extract", "all", "6", "substrings"};
Basically, I want to extract numbers and words with complete disregard to punctuation except apostrophes. I know how to get words and strings but I can't seem to figure out this tricky part.

You could do like the below.
String s = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String parts[] = s.replaceAll("[^\\s\\w']|(?<!\\b)'|'(?!\\b)", "").split("\\s+");
System.out.println(Arrays.toString(parts));
Output:
[I'm, looking, to, extract, all, 6, substrings]
Explanation:
[^\\s\\w'] matches any character but not of space or single quote or word character.
(?<!\\b)'(?!\\b) matches the ' symbol only if it's not preceded and not followed by a word character.
replaceAll function replaces all the matched characters with an empty string.
Finally we do splitting on the resultant string according to one or more space characters.

Related

Splitting characters

My characters is "!,;,%,#,**,**,(,)" which get from XML. when I split it with ',', I lost the ','.
How can I do to avoid it.
I have already tried to change the comma to '&#002C', but it does not work.
Thre result I want is "!,;,%,#,,,(,)", but not "!,;,%,#,,(,)"
String::split use regex so you can split with this regex ((?<!,),|,(?!,)) like this :
String string = "!,;,%,#,,,(,)";
String[] split = string.split("((?<!,),|,(?!,))");
Details
(?<!,), match a comma if not preceded by a comma
| or
,(?!,) match a comma if not followed by a comma
Outputs
!
;
%
#
,
(
)
If you are trying to extract all characters from string, you can do so by using String.toCharArray()[1] :
String str = "sample string here";
char[] char_array = s.toCharArray();
If you just want to iterate over the characters in the string, you can use the character array obtained from above method or do so by using a for loop and str.charAt(i)[2] to access the character at position i.
[1] https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toCharArray()
[2]https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#charAt(int)
try this, this could be help full. First I replaced the ',' with other string and do split. After complete other string replace with ','
public static void main(String[] args) {
String str = "!,;,%,#,**,**,(,)";
System.out.println(str);
str = str.replace("**,**","**/!/**");
String[] array = str.split(",");
System.out.println(Arrays.stream(array).map(s -> s.replace("**/!/**", ",")).collect(Collectors.toList()));
}
out put
!,;,%,#,**,**,(,)
[!, ;, %, #, ,, (, )]
First, we need to define when the comma is an actual delimiter, and when it is part of a character sequence.
We need to assume that a sequence of commas surrounded by commas is an actual character sequence we want to capture. It can be done with lookarounds:
String s = "!,;,,,%,#,**,**,,,,(,)";
List<String> list = Arrays.asList(s.split(",(?!,)|(?<!,),"));
This regular expression splits by a comma that is either preceded by something that is not a comma, or followed by something that is not a comma.
Note that your formatting string, that is, every character sequence separated by a comma, is a bad design, since you require both the possibility to use a comma as sequence, and the possibility to use multiple characters to be used. That means you can combine them too!
What, for example, if I want to use these two character sequences:
,
,,,,
Then I construct the formatting string like this: ,,,,,,. It is now unclear whether , and ,,,, should be character sequences, or ,, and ,,,.

How to split a String sentence into words using split method in Java? [duplicate]

This question already has answers here:
How to split a string with any whitespace chars as delimiters
(13 answers)
Closed 5 years ago.
I need to split some sentences into words.
For example:
Upper sentence.
Lower sentence. And some text.
I do it by:
String[] words = text.split("(\\s+|[^.]+$)");
But the output I get is:
Upper, sentence.Lower, sentence., And, some, text.
And it should be like:
Upper, sentence., Lower, sentence., And, some, text.
Notice that I need to preserve all the characters (.,-?! etc.)
in regular expressions \W+ match one or more non word characters.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
So if you want to get the words in the sentences you can use \W+ as the splitter.
String[] words = text.split("\\W+");
this will give you following output.
Upper
sentence
Lower
sentence
And
some
text
UPDATE :
Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.
String[] words = text.split("\\s+");
I have checked following code block and confirmed that it is working with new lines too.
String text = "Upper sentence.\n" +
"Lower sentence. And some text.";
String[] words = text.split("\\s+");
for (String word : words){
System.out.println(word);
}
Replace dots, commas, etc... for a white space and split that for whitespace
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
Result: [hello, world, this, is, a, sentence]
Edit:
If is only for dots this trick should work...
String text = "hello.world this is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));
[hello., world, this, is., a, sentence.]
The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".
Simple answer for updated question
String text = "Upper sentence.\n"+
"Lower sentence. And some text.";
[just space] one or more OR new lines one or more
String[] arr1 = text.split("[ ]+|\n+");
System.out.println(Arrays.toString(arr1));
result:
[Upper, sentence., Lower, sentence., And, some, text.]
You can split the string into sub strings using the following line of code:
String[] result = speech.split("\\s");
For reference: https://alvinalexander.com/java/edu/pj/pj010006

Split String end with special characters - Java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?
Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+
I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]
#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

Splitting a string in Java when two delimiters are next to one another

In java, I have a line that was read in by a BufferedReader called str. I also have a String[] called splitStr which will contains the contents of the string split on anything that is not an alphanumeric character and the character '.
The code looks like this:
// Assume str contains a line
String[] strSplit = str.split("[^a-zA-z0-9']|\\s");
Given the string "Hello can't world, how are [you! (today)? " which has been assigned to str I would expect the following contents in my strSplit array:
strSplit = [ "Hello", "can't", "world", "how", "are", "you", "today" ]
However, I end up getting this in my strSplit array:
strSplit = [ "Hello", "can't", "world", "", "are", "[you", "", "today" ]
Essentially, when splitting the string "world, " it recognizes the world part and the delimiter , and then since there is no valid string before another delimiter, it gives me an empty string "". Also for some reason a string with brackets [] will end up in the split string.
I'm assuming this has to do with the way I set up my regex but I'm not sure what I did wrong. I'm pretty new to regex things so any help would be appreciated.
The regex has a wrong range selection
[^a-zA-z0-9']|\\s
^ This should be uppercase,
otherwise it'll select all the characters whose ASCII value is between A and z.
The range [A-z] will select all the characters as shown in the above image.
Use + quantifier both on the character class and space character.
str.split("[^a-zA-Z0-9']+|\\s+");
^ ^
This will select as many possible matches.
Regex101 Live Demo
str.split("[^a-zA-Z0-9']+")
That didn't work??

How to parse a string by words

I need to parse a string by highlighting all wards. At now I figured out how to split words with any symbols. But how to rewrite the code to discard words with numbers or any other characters? Here is my code:
String s = "AaA bbd cDef d1s s/4 +xx_x asdgag 34545rtrtr.";
Pattern p = Pattern.compile("\\b[A-Za-z]+\\b");
System.out.println(Arrays.asList(s.split(p.pattern())));
Not valid words:
*“d1s”, “s/4”, “+xx_x”, “34545rtrtr.”*
Appropriate words:
“AaA”, “bbd”, “cDef”, “asdgag”
Try something like:
"\\b[A-Za-z]+\\b"
Where,
\b marks a word boundary.
[A-Za-z] means every letter, upper or lower case
+ means "one or more".

Categories