How to split a string based on punctuation marks and whitespace?

How to split a string based on punctuation marks and whitespace? - java

I have a String that I want to split based on punctuation marks and whitespace. What should be the regex argument to the split() method?

Code with some weirdness-handling thrown in: (Notice that it skips empty tokens in the output loop. That's quick and dirty.) You can add whatever characters you need split and removed to the regex pattern. (tchrist is right. The \s thing is woefully implemented and only works in some very simple cases.)
public class SomeClass {
public static void main(String args[]) {
String input = "The\rquick!brown - fox\t\tjumped?over;the,lazy\n,,.. \nsleeping___dog.";
for (String s: input.split("[\\p{P} \\t\\n\\r]")){
if (s.equals("")) continue;
System.out.println(s);
}
}
}
INPUT:
The
quick!brown - fox jumped?over;the,lazy
,,..
sleeping___dog.
OUTPUT:
The
quick
brown
fox
jumped
over
the
lazy
sleeping
dog

try something like this:
String myString = "item1, item2, item3";
String[] tokens = myString.split(", ");
for (String t : tokens){
System.out.println(t);
}
/*output
item1
item2
item3
*/

str.split(" ,.!?;")
would be a good start for english. You need to improve it based on what you see in your data, and what language you're using.

Related

Split a string using split method

I have tried to split a string using split method, but I'm facing some problem in using split method.
String str="1-DRYBEANS,2-PLAINRICE,3-COLDCEREAL,4-HOTCEREAL,51-ASSORTEDETHNIC,GOURMET&SPECIALTY";
List<String> zoneArray = new ArrayList<>(Arrays.asList(zoneDescTemp.split(",")));
Actual output :
zoneArray = {"1-DRYBEANS","2-PLAINRICE","3-COLDCEREAL","4-HOTCEREAL","51-ASSORTEDETHNIC","GOURMET&SPECIALTY"}
Expected output :
zoneArray = {"1-DRYBEANS","2-PLAINRICE","3-COLDCEREAL","4-HOTCEREAL","51-ASSORTEDETHNIC,GOURMET&SPECIALTY"}
Any help would be appreciated.

Use split(",(?=[0-9])")
You are not just splitting by comma, but splitting by comma only if it is followed by a digit from 0-9. This is also known as positive lookahead (?=).
Take a look at this code snippet for example:
public static void main(String[] args) {
String str="1-DRYBEANS,2-PLAINRICE,3-COLDCEREAL,4-HOTCEREAL,51-ASSORTEDETHNIC,GOURMET&SPECIALTY";
String[] array1= str.split(",(?=[0-9])");
for (String temp: array1){
System.out.println(temp);
}
}
}

Use a look-ahead within your regex, one that uses comma (not in the look-ahead), followed by a number (in the look-head). \\d+ will suffice for number. The regex can look like:
String regex = ",(?=\\d+)";
For example:
public class Foo {
public static void main(String[] args) {
String str = "1-DRYBEANS,2-PLAINRICE,3-COLDCEREAL,4-HOTCEREAL,51-ASSORTEDETHNIC,GOURMET&SPECIALTY";
String regex = ",(?=\\d+)";
String[] tokens = str.split(regex);
for (String item : tokens) {
System.out.println(item);
}
}
}
what this does is split on a comma that is followed by numbers, but does not remove from the output, the numbers since they are part of the look-ahead.
For more on look-ahead, look-behind and look-around, please check out this relevant tutorial page.

Regex : Looking for dots in a sentence except inside braquets

I'm looking for a regex to split a java string on "dots" in a sentence except if these dots are between brackets.
This is to say that in this example sentence :
word1.word2.word3[word4.word5[word6.word7]].word8
I would like to split only the first two ones and the last one (just before "word8").
I managed to get to this regex :
\.(?![^\[]*?\])
But it's not good enough as it also splits on the dot between words 4 and 5 :-(
Any idea to solve this particuliar case ?

By looking at PerlMonks discussions I don't think the problem can be solved in Java by a single regex.
If you are okay with using multiple steps, then you could first remove all pairs of brackets (starting with the innermost) and then split the remaining string by dots:
public static void main (String[] args) {
String str = "word1.word2.word3[word4.word5[word6.word7]].word8";
final Pattern BRACKET_PAIR = Pattern.compile("\\[[^\\[\\]]+\\]");
while (BRACKET_PAIR.matcher(str).find()) {
str = BRACKET_PAIR.matcher(str).replaceFirst("");
}
for (String word: str.split("\\.")) {
System.out.println(word);
}
}
Resulting in the output:
word1
word2
word3
word8

Java- I want to change a particular string with another one

import java.util.*;
import java.io.*;
public class OptimusPrime{
public static void main(String[] args){
System.out.println("Please enter the sentence");
Scanner scan= new Scanner(System.in);
String bucky=scan.nextLine();
int pOs=bucky.indexOf("is");
System.out.println(pOs);
if(pOs==-1){
System.out.println("the statement is invalid for the question");
}
else{
String nay=bucky.replace("is", "was");
System.out.println(nay);
}
}
}
Now I know the "replace" method is wrong as i want to change the particular string "is" and not the portion of other string elements. I also tried using SetChar method but I guess the "string is immutable" concept applies here.
How to go about it?

Using String.replaceAll() instead enables you to use a regex. You can use the predefined character class \W in order to catch a non-word character :
System.out.println("This is not difficult".replaceAll("\\Wis", ""));
Output :
This not difficult
The verb is disappeared but not the isfrom This.
Note 1 : It also removes the non-word character. If you want to keep it, you can capture it with some parenthesis in the regex then reintroduce it with $1:
System.out.println("This [is not difficult".replaceAll("(\\W)is", "$1"));
Output :
This [ not difficult
Note 2 : If you want to handle a string which begins with is, this line will not be enough but it is quite easy to handle with another regex.
System.out.println("is not difficult".replaceAll("^is", ""));
Output :
not difficult

If you use replaceAll instead, then you can use \b to use the word boundary to perform a "whole words only" search.
See this example:
public static void main(final String... args) {
System.out.println(replace("this is great", "is", "was"));
System.out.println(replace("crysis", "is", "was"));
System.out.println(replace("island", "is", "was"));
System.out.println(replace("is it great?", "is", "was"));
}
private static String replace(final String source, final String replace, final String with) {
return source.replaceAll("\\b" + replace + "\\b", with);
}
The output is:
this was great
crysis
island
was it great?

Simpler way:
String nay = bucky.replaceAll(" is ", " was ");
Match word boundary:
String nay = bucky.replaceAll("\\bis\\b", "was");

to replace string with another string you can use this
if Your string variable contains like this
bucky ="Android is my friend";
Then you can do like this
bucky =bucky.replace("is","are");
and your bucky's data will be like this Android are my friend
Hope this helps you.

How to change characters of a string into '*'

So I'm trying to make a simple Wheel of fortune type game. But I'm having a serious issue getting started. I'm just trying to convert my phrase into "*" so that it can't be seen until the user guesses what one of the letters is. Here's what I have so far:
public class Puzzle
{
private String solution="DOG PILE";
private StringBuilder puzzle;
public Puzzle(String solution)
{
int startindex=puzzle.indexOf(solution);
puzzle.replace(startIndex, endIndex, "-");
}
}

Use a regular expression and replace method:
String hideSolution = solution.replaceAll(".", "-");

Use guava library
example:
String noDigits = CharMatcher.JAVA_DIGIT.replaceFrom(string, "*"); // star out all digits

You can try something like this
public static String hide(String data, StringBuilder charactersToShow) {
return data.replaceAll("[^\\s" + charactersToShow.toString() + "]", "*");
}
public static void main(String[] args) throws Exception {
StringBuilder gueses = new StringBuilder();
String solution = "DOG PILE";
System.out.println(hide(solution, gueses));//
gueses.append('D');
System.out.println(hide(solution, gueses));
gueses.append('I');
System.out.println(hide(solution, gueses));
}
Output:
*** ****
D** ****
D** *I**
Little explanation:
replaceAll method takes two arguments: regular expression that describes what part of String should be replaced, and second argument is replacement. Result of that method is new String so original String will not be changed.
As regular expression I used class of characters [] with negation [^...] so it will match any character that is not in this class. Besides user characters I added \\s at the beginning, because it represents every white space (normal spaces, tabulators, new lines, and so on) since you probably don't want to replace them with *.
You may also want to add ' into that "set" if you don't want to replace it.

Split string into individual words Java

I would like to know how to split up a large string into a series of smaller strings or words.
For example:
I want to walk my dog.
I want to have a string: "I",
another string:"want", etc.
How would I do this?

Use split() method
Eg:
String s = "I want to walk my dog";
String[] arr = s.split(" ");
for ( String ss : arr) {
System.out.println(ss);
}

As a more general solution (but ASCII only!), to include any other separators between words (like commas and semicolons), I suggest:
String s = "I want to walk my dog, cat, and tarantula; maybe even my tortoise.";
String[] words = s.split("\\W+");
The regex means that the delimiters will be anything that is not a word [\W], in groups of at least one [+]. Because [+] is greedy, it will take for instance ';' and ' ' together as one delimiter.

A regex can also be used to split words.
\w can be used to match word characters ([A-Za-z0-9_]), so that punctuation is removed from the results:
String s = "I want to walk my dog, and why not?";
Pattern pattern = Pattern.compile("\\w+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}
Outputs:
I
want
to
walk
my
dog
and
why
not
See Java API documentation for Pattern

See my other answer if your phrase contains accentuated characters :
String[] listeMots = phrase.split("\\P{L}+");

Yet another method, using StringTokenizer :
String s = "I want to walk my dog";
StringTokenizer tokenizer = new StringTokenizer(s);
while(tokenizer.hasMoreTokens()) {
System.out.println(tokenizer.nextToken());
}

To include any separators between words (like everything except all lower case and upper case letters), we can do:
String mystring = "hi, there,hi Leo";
String[] arr = mystring.split("[^a-zA-Z]+");
for(int i = 0; i < arr.length; i += 1)
{
System.out.println(arr[i]);
}
Here the regex means that the separators will be anything that is not a upper or lower case letter [^a-zA-Z], in groups of at least one [+].

You can use split(" ") method of the String class and can get each word as code given below:
String s = "I want to walk my dog";
String []strArray=s.split(" ");
for(int i=0; i<strArray.length;i++) {
System.out.println(strArray[i]);
}

This regex will split word by space like space, tab, line break:
String[] str = s.split("\\s+");

Use split()
String words[] = stringInstance.split(" ");

StringTokenizer separate = new StringTokenizer(s, " ");
String word = separate.nextToken();
System.out.println(word);

Java String split() method example
public class SplitExample{
public static void main(String args[]){
String str="java string split method";
String[] words=str.split("\\s");//splits the string based on whitespace
for(String word:words){
System.out.println(word);
}
}
}

you can use Apache commons' StringUtils class
String[] partsOfString = StringUtils.split("I want to walk my dog", StringUtils.SPACE)

class test{
public static void main(String[] args){
StringTokenizer st= new StringTokenizer("I want to walk my dog.");
while (st.hasMoreTokens())
System.out.println(st.nextToken());
}
}

Using Java Stream API:
String sentence = "I want to walk my dog.";
Arrays.stream(sentence.split(" ")).forEach(System.out::println);
Output:
I
want
to
walk
my
dog.
Or
String sentence2 = "I want to walk my dog.";
Arrays.stream(sentence2.split(" ")).map(str -> str.replace(".", "")).forEach(System.out::println);
Output:
I
want
to
walk
my
dog

String[] str = s.split("[^a-zA-Z]+");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split a string based on punctuation marks and whitespace? - java

I have a String that I want to split based on punctuation marks and whitespace. What should be the regex argument to the split() method?

try something like this: String myString = "item1, item2, item3"; String[] tokens = myString.split(", "); for (String t : tokens){ System.out.println(t); } /output item1 item2 item3 /

str.split(" ,.!?;") would be a good start for english. You need to improve it based on what you see in your data, and what language you're using.

Related

Split a string using split method

Regex : Looking for dots in a sentence except inside braquets

Java- I want to change a particular string with another one

How to change characters of a string into '*'

Split string into individual words Java

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split a string based on punctuation marks and whitespace? - java

I have a String that I want to split based on punctuation marks and whitespace. What should be the regex argument to the split() method?

try something like this: String myString = "item1, item2, item3"; String[] tokens = myString.split(", "); for (String t : tokens){ System.out.println(t); } /*output item1 item2 item3 */

str.split(" ,.!?;") would be a good start for english. You need to improve it based on what you see in your data, and what language you're using.

Related

Split a string using split method

Regex : Looking for dots in a sentence except inside braquets

Java- I want to change a particular string with another one

How to change characters of a string into '*'

Split string into individual words Java

Categories

Resources

try something like this: String myString = "item1, item2, item3"; String[] tokens = myString.split(", "); for (String t : tokens){ System.out.println(t); } /output item1 item2 item3 /