I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.
You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.
I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word
public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);
I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.
Related
I have a string consisting of 18 digits Eg. 'abcdefghijklmnopqr'. I need to add a blank space after 5th character and then after 9th character and after 15th character making it look like 'abcde fghi jklmno pqr'. Can I achieve this using regular expression?
As regular expressions are not my cup of tea hence need help from regex gurus out here. Any help is appreciated.
Thanks in advance
Regex finds a match in a string and can't preform a replacement. You could however use regex to find a certain matching substring and replace that, but you would still need a separate method for replacement (making it a two step algorithm).
Since you're not looking for a pattern in your string, but rather just the n-th char, regex wouldn't be of much use, it would make it unnecessary complex.
Here are some ideas on how you could implement a solution:
Use an array of characters to avoid creating redundant strings: create a character array and copy characters from the string before
the given position, put the character at the position, copy the rest
of the characters from the String,... continue until you reach the end
of the string. After that construct the final string from that
array.
Use Substring() method: concatenate substring of the string before
the position, new character, substring of the string after the
position and before the next position,... and so on, until reaching the end of the original string.
Use a StringBuilder and its insert() method.
Note that:
First idea listed might not be a suitable solution for very large strings. It needs an auxiliary array, using additional space.
Second idea creates redundant strings. Strings are immutable and final in Java, and are stored in a pool. Creating
temporary strings should be avoided.
Yes you can use regex groups to achieve that. Something like that:
final Pattern pattern = Pattern.compile("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})");
final Matcher matcher = pattern.matcher("abcdefghijklmnopqr");
if (matcher.matches()) {
String first = matcher.group(0);
String second = matcher.group(1);
String third = matcher.group(2);
String fourth = matcher.group(3);
return first + " " + second + " " + third + " " + fourth;
} else {
throw new SomeException();
}
Note that pattern should be a constant, I used a local variable here to make it easier to read.
Compared to substrings, which would also work to achieve the desired result, regex also allow you to validate the format of your input data. In the provided example you check that it's a 18 characters long string composed of only lowercase letters.
If you had a more interesting examples, with for example a mix of letters and digits, you could check that each group contains the correct type of data with the regex.
You can also do a simpler version where you just replace with:
"abcdefghijklmnopqr".replaceAll("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})", "$1 $2 $3 $4")
But you don't have the benefit of checking because if the string doesn't match the format it will just not replaced and this is less efficient than substrings.
Here is an example solution using substrings which would be more efficient if you don't care about checking:
final Set<Integer> breaks = Set.of(5, 9, 15);
final String str = "abcdefghijklmnopqr";
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (breaks.contains(i)) {
stringBuilder.append(' ');
}
stringBuilder.append(str.charAt(i));
}
return stringBuilder.toString();
Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");
I want to remove any substring(s) in a string that begins with 'galery' and ends with 'jssdk));'
For instance, consider the following string:
Galery something something.... jssdk));
I need an algorithm that removes 'something something....' and returns 'Galery jssdk));'
This is what I've done, but it does not work.
newsValues[1].replaceAll("Galery.*?jssdK));", "");
Could probably be improved, I've done it fast:
public static String replaceMatching(String input, String lowerBound, String upperBound{
Pattern p = Pattern.compile(".*?"+lowerBound+"(.*?)"+upperBound+".*?");
Matcher m = p.matcher(input);
String textToRemove = "";
while(m.find()){
textToRemove = m.group(1);
}
return input.replace(textToRemove, "");
}
UPDATE Thx for accepting the answer, but here is a smaller reviewed version:
public static String replaceMatching2(String input, String lowerBound, String upperBound){
String result = input.replaceAll("(.*?"+lowerBound + ")" + "(.*?)" + "(" + upperBound + ".*)", "$1$3");
return result;
}
The idea is pretty simple actually, split the String into 3 groups, and replace those 3 groups with the first and third, droping the second one.
You are almost there, but that will remove the entire string. If you want to remove anything between Galery and jssdK));, you will have to do something like so:
String newStr = newsValues[1].replaceAll("(Galery)(.*?)(jssdK\\)\\);)","$1$3");
This will put the strings into groups and will then use these groups to replace the entire string. Note that in regex syntax, the ) is a special character so it needs to be escaped.
String str = "GaleryABCDEFGjssdK));";
String newStr = str.replaceAll("(Galery)(.*?)(jssdK\\)\\);)","$1$3");
System.out.println(newStr);
This yields: GaleryjssdK));
I know that the solution presented by #amit is simpler, however, I thought it would be a good idea to show you a useful way in which you can use the replaceAll method.
Simplest solution will be to replace the string with just the "edges", effectively "removing" 1 everything between them.
newsValues[1].replaceAll("Galery.*?jssdK));", "GaleryjssdK));");
1: I used "" here because it is not exactly replacing - remember strings are immutable, so it is creating a new object, without the "removed" part.
newsValues[1] = newsValues[1].substring(0,6)+newsValues.substring(newsValues[1].length()-5,newsValues[1].length())
This basically concatenates the "Galery" and the "jssdk" leaving or ignoring everything else. More importantantly, you can simply assign newValues[1] = "Galeryjssdk"
For the string value "ABCD_12" (including quotes), I would like to extract only the content and exclude out the double quotes i.e. ABCD_12 . My code is:
private static void checkRegex()
{
final Pattern stringPattern = Pattern.compile("\"([a-zA-Z_0-9])+\"");
Matcher findMatches = stringPattern.matcher("\"ABC_12\"");
if (findMatches.matches())
System.out.println("Match found" + findMatches.group(0));
}
Now I have tried doing findMatches.group(1);, but that only returns the last character in the string (I did not understand why !).
How can I extract only the content leaving out the double quotes?
Try this regex:
Pattern.compile("\"([a-zA-Z_0-9]+)\"");
OR
Pattern.compile("\"([^\"]+)\"");
Problem in your code is a misplaced + outside right parenthesis. Which is causing capturing group to capture only 1 character (since + is outside) and that's why you get only last character eventually.
A nice simple (read: non-regex) way to do this is:
String myString = "\"ABC_12\"";
String myFilteredString = myString.replaceAll("\"", "");
System.out.println(myFilteredString);
gets you
ABC_12
You should change your pattern to this:
final Pattern stringPattern = Pattern.compile("\"([a-zA-Z_0-9]+)\"");
Note that the + sign was moved inside the group, since you want the character repetition to be part of the group. In the code you posted, what you were actually searching for was a repetition of the group, which consisted in a single occurence of a single characters in [a-zA-Z_0-9].
If your pattern is strictly any text in between double quotes, then you may be better off using substring:
String str = "\"ABC_12\"";
System.out.println(str.substring(1, str.lastIndexOf('\"')));
Assuming it is a bit more complex (double quotes in between a larger string), you can use the split() function in the Pattern class and use \" as your regex - this will split the string around the \" so you can easily extract the content you want
Pattern p = Pattern.compile("\"");
// Split input with the pattern
String[] result =
p.split(str);
for (int i=0; i<result.length; i++)
System.out.println(result[i]);
}
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html#split%28java.lang.CharSequence%29
Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)
I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)
Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.
Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.
There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.
Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].
StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}
The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]