regular expression to split the string in java - java

I want to split the string say [AO_12345678, Real Estate] into AO_12345678 and Real Estate
how can I do this in Java using regex?
main issue m facing is in avoiding "[" and "]"
please help

Does it really have to be regex?
if not:
String s = "[AO_12345678, Real Estate]";
String[] split = s.substring(1, s.length()-1).split(", ");

I'd go the pragmatic way:
String org = "[AO_12345678, Real Estate]";
String plain = null;
if(org.startsWith("[") {
if(org.endsWith("]") {
plain = org.subString(1, org.length());
} else {
plain = org.subString(1, org.length() + 1);
}
}
String[] result = org.split(",");
If the string is always surrounded with '[]' you can just substring it without checking.

One easy way, assuming the format of all your inputs is consistent, is to ignore regex altogether and just split it. Something like the following would work:
String[] parts = input.split(","); // parts is ["[AO_12345678", "Real Estate]"]
String firstWithoutBrace = parts[0].substring(1);
String secondWithoutBrace = parts[1].substring(0, parts[1].length() - 1);
String first = firstWithoutBrace.trim();
String second = secondWithoutBrace.trim();
Of course you can tailor this as you wish - you might want to check whether the braces are present before removing them, for example. Or you might want to keep any spaces before the comma as part of the first string. This should give you a basis to modify to your specific requirements however.
And in a simple case like this I'd much prefer code like the above to a regex that extracted the two strings - I consider the former much clearer!

you can also use StringTokenizer. Here is the code:
String str="[AO_12345678, Real Estate]"
StringTokenizer st=new StringTokenizer(str,"[],",false);
String s1 = st.nextToken();
String s2 = st.nextToken();
s1=AO_12345678
s1=Real Estate
Refer to javadocs for reading about StringTokenizer
http://download.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html

Another option using regular expressions (RE) capturing groups:
private static void extract(String text) {
Pattern pattern = Pattern.compile("\\[(.*),\\s*(.*)\\]");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) { // or .matches for matching the whole text
String id = matcher.group(1);
String name = matcher.group(2);
// do something with id and name
System.out.printf("ID: %s%nName: %s%n", id, name);
}
}
If speed/memory is a concern, the RE can be optimized to (using Possessive quantifiers instead of Greedy ones)
"\\[([^,]*+),\\s*+([^\\]]*+)\\]"

Related

How to get a array of string like ["#{xxxx}","#{yyyy}"] from a string like "abc#{xxxx}def#{yyyy}ghi" using java?

How to get a array of string like ["#{xxxx}","#{yyyy}"] from a string like "abc#{xxxx}def#{yyyy}ghi" using java?
I'm not good at English so I have to make great effort to express my question.
I want to take the uel expressions out, so I think there may be some methods existing to solve this situation.
Scanner sc = new Scanner(System.in);
String input = sc.nextLine();
//remove first substring from input
String formattedInput = input.substring(input.indexOf("#"), input.lastIndexOf("}") + 1);
//make a regex that checks for string enclosed in } #{
String regex = "(?<=[}])[A-Za-z]*(?=[#])";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(formattedInput);
//remove the characters between } and #{
if (m.find()) {
formattedInput = formattedInput.replaceAll(regex, "");
}
System.out.println(formattedInput);
}
Input: abc#{xxxx}def#{yyyy}
Output: #{xxxx}#{yyyy}
I am not really sure as to what you are trying to ask because your question was not worded properly, but this code will remove any characters that are not enclosed in the #{} tag. You can then split the resultant string into an array. I hope this helps

Split a string based on pattern and merge it back

I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.
You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.
I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word
public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);
I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.

Change tags in symbol Pattern/Matcher

This code works fine :
final String result = myString.replaceAll("<tag1>", "{").replaceAll("<tag2>", "}");
but I have to parse big files, so I'm asking me if I can have a Pattern.compile("REGEX"); before the while :
Patter p = Pattern.compile("REGEX");
while(scan.hasNextLine()){
final String myWorkLine = scan.readLine();
p.matcher(s).replaceAll("$1"); // or other value
..;
}
I expect faster result because regex compilation is maid once and only once.
EDIT
I want to put (if it is possible) the replaceAll(..).replaceAll(..) model in a Pattern, and have tag1==>{, and tag2==>}.
Question : is outside loop Pattern model faster than inside loop replaceAll.replaceAll model?
To answer your original question: yes, you could do that, and indeed it would be faster than your original code, if you apply the same regular expression(s) multiple times in a loop. Your loop should be rewritten like this:
Pattern p1 = Pattern.compile("REGEX1");
Pattern p1 = Pattern.compile("REGEX1");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p1.matcher(myWorkLine).replaceAll("replacement1");
myWorkLine = p2.matcher(myWorkLine).replaceAll("replacement2");
...;
}
But, if your're not using regular expressions, as your first example suggests ("<tag1>"), then don't use String.replaceAll(String regex, String replacement), as it is slower because of the regular expression. Instead use String.replace(CharSequence target, CharSequence replacement), as it doesn't work with regular expression and is much faster.
Example:
"ABAP is fun! ABAP ABAP ABAP".replace("ABAP", "Java");
See: Java Docs for String.replace
It's not nice changing your question that radically, but ok, here again an answer for your regular expression:
String s1
= "You can <bold>have nice weather</bold>, but <bold>not</bold> always!";
//EDIT: the regex was 'overengineered', and .?? should have been .*?
//String s2 = s1.replaceAll("(.*?)<bold>(.*?)</bold>(.??)", "$1{$2}$3");
String s2 = s1.replaceAll("<bold>(.*?)</bold>", "{$1}");
System.out.println(s2);
Output: You can {have nice weather}, but {not} always!
Here the loop with this new regex, and yes, this would be faster than original loop:
//EDIT: the regex was 'overengineered'
Pattern p = Pattern.compile("<bold>(.*?)</bold>");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p.matcher(myWorkLine).replaceAll("{$1}");
...;
}
EDIT:
Here the description of Java RegEx syntax constructs
replaceAll uses regex Patterns. From the java.lang.String source code:
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
Edit1: Please stop changing what you're asking. Pick a question and stick with it.
Edit2:
If you're really sure you want to do it this way, compiling a regex outside of the loop, in the simplest case you'd need two different patterns:
Pattern tag1Pattern = Pattern.compile("<tag1>");
Pattern tag2Pattern = Pattern.compile("<tag2>");
while( scan.hasNextLine() ) {
String line = scan.readLine();
String modifiedLine = tag1Pattern.matcher(line).replaceAll("{");
modifiedLine = tag2Pattern.matcher(line).replaceAll("}");
...
}
You're still applying the pattern matcher twice per line, so if there's any performance hits that's why.
Without knowing what your data looks like, it's hard to give you a more precise answer or better regex. Unless you've edited your question (again) while I was writing this.

Optionally using String.split(), split a string at the last occurance of a delimiter

I have a string that matches this regular expression: ^.+:[0-9]+(\.[0-9]+)*/[0-9]+$ which can easily be visualized as (Text):(Double)/(Int). I need to split this string into the three parts. Normally this would be easy, except that the (Text) may contain colons, so I cannot split on any colon - but rather the last colon.
The .* is greedy so it already does a pretty neat job of doing this, but this wont work as a regular expression into String.split() because it will eat my (Text) as part of the delimiter. Ideally I'd like to have something that would return a String[] with three strings. I'm 100% fine with not using String.split() for this.
I don't like regex (just kidding I do but I'm not very good at it).
String s = "asdf:1.0/1"
String text = s.substring(0,s.lastIndexOf(":"));
String doub = s.substring(s.lastIndexOf(":")+1,text.indexOf("/"));
String inte = s.substring(text.indexOf("/")+1,s.length());
Why don't you just use a straight up regular expression?
Pattern p = Pattern.compile("^(.*):([\\d\\.]+)/(\\d+)$");
Matcher m = p.matcher( someString );
if (m.find()) {
m.group(1); // returns the text before the colon
m.group(2); // returns the double between the colon and the slash
m.group(3); // returns the integer after the slash
}
Or similar. The pattern ^(.*):([\d\.]+)/(\d+)$ assumes that you actually have values in all three positions, and will allow just a period/fullstop in the double position, so you may want to tweak it to your specifications.
String.split() is typically used in simpler scenarios where the delimiter and formatting are more consistent and when you don't know how many elements you are going to be splitting.
Your use case calls for a plain old regular expression. You know the formatting of the string, and you know you want to collect three values. Try something like the following.
Pattern p = Pattern.compile("(.+):([0-9\\.]+)/([0-9]+)$");
Matcher m = p.matcher(myString);
if (m.find()) {
String myText = m.group(1);
String myFloat = m.group(2);
String myInteger = m.group(3);
}

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)
I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)
Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.
Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.
There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.
Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].
StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}
The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

Categories