Splitting a string in java where the delimiter is a word - java

I hava a string(A list of author names for a book) which is of the following format:
author_name1, author_name2, author_name3 and author_name4
How can I parse the string so that I get the list of author names as an array of String. (The delimiters in this case are , and the word and. I'm not sure how I can split the string based on these delimiters (since the delimiter here is a word and not a single character).

You can use myString.split(",|and") it will do what you want :)

You should use regular expressions:
"someString".split("(,|and)")

Try:
yourString.split("\\s*(,|and)\\s*")
\\s* means zero or more whitespace characters (so the surrounding spaces aren't included in your split).
(,|and) means , or and.
Test (Arrays.toString prints the array in the form - [element1, element2, ..., elementN]).
Java regex reference.

I think you need to include the regex OR operator:
String[]tokens = someString.split(",|and");

Related

Java - Splitting String [duplicate]

I am wondering if I am going about splitting a string on a . the right way? My code is:
String[] fn = filename.split(".");
return fn[0];
I only need the first part of the string, that's why I return the first item. I ask because I noticed in the API that . means any character, so now I'm stuck.
split() accepts a regular expression, so you need to escape . to not consider it as a regex meta character. Here's an example :
String[] fn = filename.split("\\.");
return fn[0];
I see only solutions here but no full explanation of the problem so I decided to post this answer
Problem
You need to know few things about text.split(delim). split method:
accepts as argument regular expression (regex) which describes delimiter on which we want to split,
if delim exists at end of text like in a,b,c,, (where delimiter is ,) split at first will create array like ["a" "b" "c" "" ""] but since in most cases we don't really need these trailing empty strings it also removes them automatically for us. So it creates another array without these trailing empty strings and returns it.
You also need to know that dot . is special character in regex. It represents any character (except line separators but this can be changed with Pattern.DOTALL flag).
So for string like "abc" if we split on "." split method will
create array like ["" "" "" ""],
but since this array contains only empty strings and they all are trailing they will be removed (like shown in previous second point)
which means we will get as result empty array [] (with no elements, not even empty string), so we can't use fn[0] because there is no index 0.
Solution
To solve this problem you simply need to create regex which will represents dot. To do so we need to escape that .. There are few ways to do it, but simplest is probably by using \ (which in String needs to be written as "\\" because \ is also special there and requires another \ to be escaped).
So solution to your problem may look like
String[] fn = filename.split("\\.");
Bonus
You can also use other ways to escape that dot like
using character class split("[.]")
wrapping it in quote split("\\Q.\\E")
using proper Pattern instance with Pattern.LITERAL flag
or simply use split(Pattern.quote(".")) and let regex do escaping for you.
Split uses regular expressions, where '.' is a special character meaning anything. You need to escape it if you actually want it to match the '.' character:
String[] fn = filename.split("\\.");
(one '\' to escape the '.' in the regular expression, and the other to escape the first one in the Java string)
Also I wouldn't suggest returning fn[0] since if you have a file named something.blabla.txt, which is a valid name you won't be returning the actual file name. Instead I think it's better if you use:
int idx = filename.lastIndexOf('.');
return filename.subString(0, idx);
the String#split(String) method uses regular expressions.
In regular expressions, the "." character means "any character".
You can avoid this behavior by either escaping the "."
filename.split("\\.");
or telling the split method to split at at a character class:
filename.split("[.]");
Character classes are collections of characters. You could write
filename.split("[-.;ld7]");
and filename would be split at every "-", ".", ";", "l", "d" or "7". Inside character classes, the "." is not a special character ("metacharacter").
As DOT( . ) is considered as a special character and split method of String expects a regular expression you need to do like this -
String[] fn = filename.split("\\.");
return fn[0];
In java the special characters need to be escaped with a "\" but since "\" is also a special character in Java, you need to escape it again with another "\" !
String str="1.2.3";
String[] cats = str.split(Pattern.quote("."));
Wouldn't it be more efficient to use
filename.substring(0, filename.indexOf("."))
if you only want what's up to the first dot?
Usually its NOT a good idea to unmask it by hand. There is a method in the Pattern class for this task:
java.util.regex
static String quote(String s)
The split must be taking regex as a an argument... Simply change "." to "\\."
The solution that worked for me is the following
String[] fn = filename.split("[.]");
Note: Further care should be taken with this snippet, even after the dot is escaped!
If filename is just the string ".", then fn will still end up to be of 0 length and fn[0] will still throw an exception!
This is, because if the pattern matches at least once, then split will discard all trailing empty strings (thus also the one before the dot!) from the array, leaving an empty array to be returned.
Using ApacheCommons it's simplest:
File file = ...
FilenameUtils.getBaseName(file.getName());
Note, it also extracts a filename from full path.
split takes a regex as argument. So you should pass "\." instead of "." because "." is a metacharacter in regex.

About string split method in java [duplicate]

I am wondering if I am going about splitting a string on a . the right way? My code is:
String[] fn = filename.split(".");
return fn[0];
I only need the first part of the string, that's why I return the first item. I ask because I noticed in the API that . means any character, so now I'm stuck.
split() accepts a regular expression, so you need to escape . to not consider it as a regex meta character. Here's an example :
String[] fn = filename.split("\\.");
return fn[0];
I see only solutions here but no full explanation of the problem so I decided to post this answer
Problem
You need to know few things about text.split(delim). split method:
accepts as argument regular expression (regex) which describes delimiter on which we want to split,
if delim exists at end of text like in a,b,c,, (where delimiter is ,) split at first will create array like ["a" "b" "c" "" ""] but since in most cases we don't really need these trailing empty strings it also removes them automatically for us. So it creates another array without these trailing empty strings and returns it.
You also need to know that dot . is special character in regex. It represents any character (except line separators but this can be changed with Pattern.DOTALL flag).
So for string like "abc" if we split on "." split method will
create array like ["" "" "" ""],
but since this array contains only empty strings and they all are trailing they will be removed (like shown in previous second point)
which means we will get as result empty array [] (with no elements, not even empty string), so we can't use fn[0] because there is no index 0.
Solution
To solve this problem you simply need to create regex which will represents dot. To do so we need to escape that .. There are few ways to do it, but simplest is probably by using \ (which in String needs to be written as "\\" because \ is also special there and requires another \ to be escaped).
So solution to your problem may look like
String[] fn = filename.split("\\.");
Bonus
You can also use other ways to escape that dot like
using character class split("[.]")
wrapping it in quote split("\\Q.\\E")
using proper Pattern instance with Pattern.LITERAL flag
or simply use split(Pattern.quote(".")) and let regex do escaping for you.
Split uses regular expressions, where '.' is a special character meaning anything. You need to escape it if you actually want it to match the '.' character:
String[] fn = filename.split("\\.");
(one '\' to escape the '.' in the regular expression, and the other to escape the first one in the Java string)
Also I wouldn't suggest returning fn[0] since if you have a file named something.blabla.txt, which is a valid name you won't be returning the actual file name. Instead I think it's better if you use:
int idx = filename.lastIndexOf('.');
return filename.subString(0, idx);
the String#split(String) method uses regular expressions.
In regular expressions, the "." character means "any character".
You can avoid this behavior by either escaping the "."
filename.split("\\.");
or telling the split method to split at at a character class:
filename.split("[.]");
Character classes are collections of characters. You could write
filename.split("[-.;ld7]");
and filename would be split at every "-", ".", ";", "l", "d" or "7". Inside character classes, the "." is not a special character ("metacharacter").
As DOT( . ) is considered as a special character and split method of String expects a regular expression you need to do like this -
String[] fn = filename.split("\\.");
return fn[0];
In java the special characters need to be escaped with a "\" but since "\" is also a special character in Java, you need to escape it again with another "\" !
String str="1.2.3";
String[] cats = str.split(Pattern.quote("."));
Wouldn't it be more efficient to use
filename.substring(0, filename.indexOf("."))
if you only want what's up to the first dot?
Usually its NOT a good idea to unmask it by hand. There is a method in the Pattern class for this task:
java.util.regex
static String quote(String s)
The split must be taking regex as a an argument... Simply change "." to "\\."
The solution that worked for me is the following
String[] fn = filename.split("[.]");
Note: Further care should be taken with this snippet, even after the dot is escaped!
If filename is just the string ".", then fn will still end up to be of 0 length and fn[0] will still throw an exception!
This is, because if the pattern matches at least once, then split will discard all trailing empty strings (thus also the one before the dot!) from the array, leaving an empty array to be returned.
Using ApacheCommons it's simplest:
File file = ...
FilenameUtils.getBaseName(file.getName());
Note, it also extracts a filename from full path.
split takes a regex as argument. So you should pass "\." instead of "." because "." is a metacharacter in regex.

how to break the string using keywords using regex

I have a scenario where i need to break the below input string based on the keywords using regex.
Keywords are UPRCAS, REPLC, LOWCAS and TUPIL.
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
The output should be as follows
UPRCAS-00040-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-00030-ABCD
How can i achieve this using java regex.
I have tried using split by '-' and using regex but both the approach gives an array of strings and again i have to process each string and combine 3 strings together to form UPRCAS-00040-abcd. I felt this is not the efficient way to do as it takes an extra array and process them back.
String[] tokens = input.split("-");
String[] r = input.split("(?=\\p{Upper})");
Please let me know if we can split the string using regex based on the keyword. Basically i need to extract the string between the keyword boundary.
Edited question after understanding the limitation of existing problem
The regex should be generic to extract the string from input between the UPPERCASE characters
The regex should not contains keywords to split the string.
I understood that, it is a bad idea to add new keyword everytime in regex pattern for searching. My expectation is to be a generic as possible.
Thanks all for your time. Really appreciate it.
Split using the following regex:
(?=UPRCAS|REPLC|LOWCAS|TUPIL)
The (?=xxx) is a zero-width positive lookahead, meaning that it matches the empty space immediately preceding one of the 4 keywords.
See Regular-Expressions.info for more information: Lookahead and Lookbehind Zero-Length Assertions
Test
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
String[] output = input.split("(?=UPRCAS|REPLC|LOWCAS|TUPIL)");
for (String value : output)
System.out.println(value);
Output
UPRCAS-0004-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-0003-ABCD
You can try this regex:
\w+-\w+-(?:[a-z0-9]+|[A-Z]+)
Demo: https://regex101.com/r/etKBjI/3

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Splitting a string in java on more than one symbol

I want to split a string when following of the symbols encounter "+,-,*,/,="
I am using split function but this function can take only one argument.Moreover it is not working on "+".
I am using following code:-
Stringname.split("Symbol");
Thanks.
String.split takes a regular expression as argument.
This means you can alternate whatever symbol or text abstraction in one parameter in order to split your String.
See documentation here.
Here's an example in your case:
String toSplit = "a+b-c*d/e=f";
String[] splitted = toSplit.split("[-+*/=]");
for (String split: splitted) {
System.out.println(split);
}
Output:
a
b
c
d
e
f
Notes:
Reserved characters for Patterns must be double-escaped with \\. Edit: Not needed here.
The [] brackets in the pattern indicate a character class.
More on Patterns here.
You can use a regular expression:
String[] tokens = input.split("[+*/=-]");
Note: - should be placed in first or last position to make sure it is not considered as a range separator.
You need Regular Expression. Addionaly you need the regex OR operator:
String[]tokens = Stringname.split("\\+|\\-|\\*|\\/|\\=");
For that, you need to use an appropriate regex statement. Most of the symbols you listed are reserved in regex, so you'll have to escape them with \.
A very baseline expression would be \+|\-|\\|\*|\=. Relatively easy to understand, each symbol you want is escaped with \, and each symbol is separated by the | (or) symbol. If, for example, you wanted to add ^ as well, all you would need to do is append |\^ to that statement.
For testing and quick expressions, I like to use www.regexpal.com

Categories