Java split by white spaces with condition - java

I want to split string by white spaces. However if words are enclosed with quotation marks, then treat them as a single word.
For example Word to split. I will get word,to,split.
but if
"word to" split i should get "word to", split. quotation mark remains.

Is that what you want??
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TmpTest {
public static void main(String args[]) {
final String regex = "\".*?\"|\\b\\w+\\b";
final String string = "\"word to\" split i should get \"word to2\", split.";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
}
}
}
demo

Here's how you can achieve this:
String str = "\"word to\" split";
List<String> list = new ArrayList<String>();
Matcher m = Pattern.compile("([^\"]\\S*|\".+?\")\\s*").matcher(str);
while (m.find())
list.add(m.group(1)); // Add .replace("\"", "") to remove surrounding quotes.
System.out.println(list);

Related

Extract values between commas without the quotation marks

Let's say I have a string such as 'John','Smith'. I want my regex to extract the values John and Smith from that string, without the commas and quotation marks. I looked around the site and found a solution that gets rid of the commas, but not the quotation marks.
This is the regex I tried (?:^|(?<=,))[^,]*
With that I get 'John' and 'Smith'. Of course, I could simply iterate over the Matcher like this and remove the quotation marks manually, but I was wondering if there's a more direct solution using regex without having to resort to replaceAll.
Pattern pat = Pattern.compile("(?:^|(?<=,))[^,]*");
Matcher matcher = pat.matcher("'John', 'Smith'");
List<String> matches = new ArrayList<>();
while (matcher.find()) {
matches.add(matcher.group().replaceAll("'", ""));
}
The following regex will work: "[^,']+".
Below is the updated code.
public static void main(String[] args) {
String regex = "[^,']+";
Pattern pat = Pattern.compile(regex);
Matcher matcher = pat.matcher("'John', 'Smith'");
List<String> matches = new ArrayList<>();
while (matcher.find()) {
matches.add(matcher.group());
}
System.out.println(matches);
}
Output:
[John, , Smith]
I tried this code and it gives output string without single quotes:
public class SubstringExample{
public static void main(String args[]){
String nameStr="'John','Smith'";
String newNameStr = nameStr.replaceAll("\'","");
System.out.println(newNameStr);
}}

Java regular expression to validate and extract some values

I want to extract all three parts of the following string in Java
MS-1990-10
The first part should always be 2 letters (A-Z)
The second part should always be a year
The third part should always be a number
Does anyone know how can I do that using Java's regular expressions?
You can do this using java's pattern matcher and group syntax:
Pattern datePatt = Pattern.compile("([A-Z]{2})-(\\d{4})-(\\d{2})");
Matcher m = datePatt.matcher("MS-1990-10");
if (m.matches()) {
String g1 = m.group(1);
String g2 = m.group(2);
String g3 = m.group(3);
}
Use Matcher's group so you can get the patterns that actually matched.
In Matcher, the matches inside parenthesis will be captured and can be retrieved via the group() method. To use parenthesis without capturing the matches, use the non-capturing parenthesis (?:xxx).
See also Pattern.
public static void main(String[] args) throws Exception {
String[] lines = { "MS-1990-10", "AA-999-12332", "ZZ-001-000" };
for (String str : lines) {
System.out.println(Arrays.toString(parse(str)));
}
}
private static String[] parse(String str) {
String regex = "";
regex = regex + "([A-Z]{2})";
regex = regex + "[-]";
// regex = regex + "([^0][0-9]+)"; // any year, no leading zero
regex = regex + "([12]{1}[0-9]{3})"; // 1000 - 2999
regex = regex + "[-]";
regex = regex + "([0-9]+)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
if (!matcher.matches()) {
return null;
}
String[] tokens = new String[3];
tokens[0] = matcher.group(1);
tokens[1] = matcher.group(2);
tokens[2] = matcher.group(3);
return tokens;
}
This is a way to get all 3 parts with a regex:
public class Test {
public static void main(String... args) {
Pattern p = Pattern.compile("([A-Z]{2})-(\\d{4})-(\\d{2})");
Matcher m = p.matcher("MS-1990-10");
m.matches();
for (int i = 1; i <= m.groupCount(); i++)
System.out.println(m.group(i));
}
}
String rule = "^[A-Z]{2}-[1-9][0-9]{3}-[0-9]{2}";
Pattern pattern = Pattern.compile(rule);
Matcher matcher = pattern.matcher(s);
regular matches year between 1000 ~ 9999, u can update as u really need.

Java string split() Regex

I have a string like this,
["[number][name]statement_1.","[number][name]statement_1."]
i want to get only statement_1 and statement_2. I used tried in this way,
String[] statement = message.trim().split("\\s*,\\s*");
but it gives ["[number][name]statement_1." and "[number][name]statement_2."] . how can i get only statement_1 and statement_2?
Match All instead of Splitting
Splitting and Match All are two sides of the same coin. In this case, Match All is easier.
You can use this regex:
(?<=\])[^\[\]"]+(?=\.)
See the matches in the regex demo.
In Java code:
Pattern regex = Pattern.compile("(?<=\\])[^\\[\\]\"]+(?=\\.)");
Matcher regexMatcher = regex.matcher(yourString);
while (regexMatcher.find()) {
// the match: regexMatcher.group()
}
In answer to your question to get both matches separately:
Pattern regex = Pattern.compile("(?<=\\])[^\\[\\]\"]+(?=\\.)");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
String theFirstMatch: regexMatcher.group()
}
if (regexMatcher.find()) {
String theSecondMatch: regexMatcher.group()
}
Explanation
The lookbehind (?<=\]) asserts that what precedes the current position is a ]
[^\[\]"]+ matches one or more chars that are not [, ] or "
The lookahead (?=\.) asserts that the next character is a dot
Reference
Match All and Split are Two Sides of the Same Coin
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
I somehow don't think that is your actual string, but you may try the following.
String s = "[\"[number][name]statement_1.\",\"[number][name]statement_2.\"]";
String[] parts = s.replaceAll("\\[.*?\\]", "").split("\\W+");
System.out.println(parts[0]); //=> "statement_1"
System.out.println(parts[1]); //=> "statement_2"
is the string going to be for example [50][James]Loves cake?
Scanner scan = new Scanner(System.in);
System.out.println ("Enter string");
String s = scan.nextLine();
int last = s.lastIndexOf("]")+1;
String x = s.substring(last, s.length());
System.out.println (x);
Enter string
[21][joe]loves cake
loves cake
Process completed.
Use a regex instead.
With Java 7
final Pattern pattern = Pattern.compile("(^.*\\])(.+)?");
final String[] strings = { "[number][name]statement_1.", "[number][name]statement_2." };
final List<String> results = new ArrayList<String>();
for (final String string : strings) {
final Matcher matcher = pattern.matcher(string);
if (matcher.matches()) {
results.add(matcher.group(2));
}
}
System.out.println(results);
With Java 8
final Pattern pattern = Pattern.compile("(^.*\\])(.+)?");
final String[] strings = { "[number][name]statement_1.", "[number][name]statement_2." };
final List<String> results = Arrays.stream(strings)
.map(pattern::matcher)
.filter(Matcher::matches)
.map(matcher -> matcher.group(2))
.collect(Collectors.toList());
System.out.println(results);

Pattern.COMMENTS always causing Matcher.find to fail

The following code matches the two expressions and prints success.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String regex = "\\{user_id : [0-9]+\\}";
String string = "{user_id : 0}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find())
System.out.println("Success.");
else
System.out.println("Failure.");
}
}
However, I want white space to not matter, so the following should also print success.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String regex = "\\{user_id:[0-9]+\\}";
String string = "{user_id : 0}";
Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);
Matcher matcher = pattern.matcher(string);
if (matcher.find())
System.out.println("Success.");
else
System.out.println("Failure.");
}
}
The Pattern.COMMENTS flag is supposed to permit white space, but it causes Failure to be printed. It even causes Failure to be printed if the strings are exactly equivalent including white space, like in the first example. For example,
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String regex = "\\{user_id : [0-9]+\\}";
String string = "{user_id : 0}";
Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);
Matcher matcher = pattern.matcher(string);
if (matcher.find())
System.out.println("Success.");
else
System.out.println("Failure.");
}
}
Prints Failure.
Why is this happening and how do I make the Pattern ignore white space?
There is a misunderstanding on your side. Pattern.COMMENTS allow you to put additional whitespace into your regex, to improve the readability of the regex, but this whitespace will NOT be matched in the string.
This does not allow whitespace in your string, that is then matched automatically, without being defined in the regex.
Example
With Pattern.COMMENTS you can put whitespace in your regex like this
String regex = "\\{ user_id: [0-9]+ \\}";
to improve readablitiy, but the it will not match the string
String string = "{user_id : 0}";
because you haven't defined the whitespaces in the string, so if you want to use Pattern.COMMENTS then you need to treat whitespace you want to match specially, either you escape it
String regex = "\\{ user_id\\ :\\ [0-9]+ \\}";
or you use the whitespace class
String regex = "\\{ user_id \\s:\\s [0-9]+ \\}";

How to extract uppercase substrings from a String in Java?

I need a piece of code with which I can extract the substrings that are in uppercase from a string in Java.
For example:
"a:[AAAA|0.1;BBBBBBB|-1.90824;CC|0.0]"
I need to extract CC BBBBBBB and AAAA
You can do it with String[] split(String regex). The only problem can be with empty strings, but it's easy to filter them out:
String str = "a:[AAAA|0.1;BBBBBBB|-1.90824;CC|0.0]";
String[] substrings = str.split("[^A-Z]+");
for (String s : substrings)
{
if (!s.isEmpty())
{
System.out.println(s);
}
}
Output:
AAAA
BBBBBBB
CC
This should demonstrate the proper syntax and method. More details can be found here http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html and http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html
String myStr = "a:[AAAA|0.1;BBBBBBB|-1.90824;CC|0.0]";
Pattern upperCase = Pattern.compile("[A-Z]+");
Matcher matcher = upperCase.matcher(myStr);
List<String> results = new ArrayList<String>();
while (matcher.find()) {
results.add(matcher.group());
}
for (String s : results) {
System.out.println(s);
}
The [A-Z]+ part is the regular expression which does most of the work. There are a lot of strong regular expression tutorials if you want to look more into it.
If you want just to extract all the uppercase letter use [A-Z]+, if you want just uppercase substring, meaning that if you have lowercase letters you don't need it (HELLO is ok but Hello is not) then use \b[A-Z]+\b
I think you should do a replace all regular expression to turn the character you don't want into a delimiter, perhaps something like this:
str.replaceAll("[^A-Z]+", " ")
Trim any leading or trailing spaces.
Then, if you wish, you can call str.split(" ")
This is probably what you're looking for:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatcherDemo {
private static final String REGEX = "[A-Z]+";
private static final String INPUT = "a:[AAAA|0.1;BBBBBBB|-1.90824;CC|0.0]";
public static void main(String[] args) {
Pattern p = Pattern.compile(REGEX);
// get a matcher object
Matcher m = p.matcher(INPUT);
List<String> sequences = new Vector<String>();
while(m.find()) {
sequences.add(INPUT.substring(m.start(), m.end()));
}
}
}

Categories