Finding tokens in a Java String

Finding tokens in a Java String - java

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)

I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.

Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}

The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

Related

Split a string based on pattern and merge it back

I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.

You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.

I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word

public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);

I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.

Add escape "\" in front of special character for a string

I have a simple SQL query where I check whether the query matches any of the fields I have. I'm using LIKE statement for this. One of my field can have special characters and so does the search query. So I'm looking for a solution where I need to an escape "\" in front of the special character.
query = "hello+Search}query"
I need the above to change to
query = "hello\+Search\}query"
Is there a simple way of doing this other than searching for each special character separately and adding the "\". Because if I don't have the escape character I will get the error message
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 0
Thanks in advance

Decide which special characters you want to escape and just call
query.replace("}", "\\}")
You may keep all special characters you allow in some array then iterate it and replace the occurrences as exemplified.
This method replaces all regex meta characters.
public String escapeMetaCharacters(String inputString){
final String[] metaCharacters = {"\\","^","$","{","}","[","]","(",")",".","*","+","?","|","<",">","-","&","%"};
for (int i = 0 ; i < metaCharacters.length ; i++){
if(inputString.contains(metaCharacters[i])){
inputString = inputString.replace(metaCharacters[i],"\\"+metaCharacters[i]);
}
}
return inputString;
}
You could use it as query=escapeMetaCharacters(query);
Don't think that any library you would find would do anything more than that. At best it defines a complete list of specialCharacters.

There is actually a better way of doing this in a sleek manner.
String REGEX = "[\\[+\\]+:{}^~?\\\\/()><=\"!]";
StringUtils.replaceAll(inputString, REGEX, "\\\\$0");

You need to use \\ to introduce a \ into a string literal; that is you need to escape the \. (A single backslash is used to introduce special characters into a string: e.g. \t is a tab.)
query = "hello\\+Search\\}query" is what you need.

I had to do same thing in javascript. I came up with below solution. I think it might help someone.
function escapeSpecialCharacters(s){
let arr = s.split('');
arr = arr.map(function(d){
return d.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\'+d)
});
let reg = new RegExp(arr.join(''));
return reg;
}
let newstring = escapeSpecialCharacters("hello+Search}query");

If you want to use Java 8+ and Streams, you could do something like:
private String escapeSpecialCharacters(String input) {
List<String> specialCharacters = Lists.newArrayList("\\","^","$","{","}","[","]","(",")",".","*","+","?","|","<",">","-","&","%");
return Arrays.stream(input.split("")).map((c) -> {
if (specialCharacters.contains(c)) return "\\" + c;
else return c;
}).collect(Collectors.joining());
}

The simple version ( without deprecated StringUtils.replaceAll ):
String regex = "[\\[+\\]+:{}^~?\\\\/()><=\"!]";
String query = "hello+Search}query";
String replaceAll = query.replaceAll(regex, "\\\\$0");

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.

As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);

You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag

All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];

private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}

I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

How to replace all numbers in java string

I have string like this String s="ram123",d="ram varma656887"
I want string like ram and ram varma so how to seperate string from combined string
I am trying using regex but it is not working
PersonName.setText(cursor.getString(cursor.getColumnIndex(cursor
.getColumnName(1))).replaceAll("[^0-9]+"));

The correct RegEx for selecting all numbers would be just [0-9], you can skip the +, since you use replaceAll.
However, your usage of replaceAll is wrong, it's defined as follows: replaceAll(String regex, String replacement). The correct code in your example would be: replaceAll("[0-9]", "").

You can use the following regex: \d for representing numbers. In the regex that you use, you have a ^ which will check for any characters other than the charset 0-9
String s="ram123";
System.out.println(s);
/* You don't need the + because you are using the replaceAll method */
s = s.replaceAll("\\d", ""); // or you can also use [0-9]
System.out.println(s);

To remove the numbers, following code will do the trick.
stringname.replaceAll("[0-9]","");

Please do as follows
String name = "ram varma656887";
name = name.replaceAll("[0-9]","");
System.out.println(name);//ram varma
alternatively you can do as
String name = "ram varma656887";
name = name.replaceAll("\\d","");
System.out.println(name);//ram varma
also something like given will work for you
String given = "ram varma656887";
String[] arr = given.split("\\d");
String data = new String();
for(String x : arr){
data = data+x;
}
System.out.println(data);//ram varma

i think you missed the second argument of replace all. You need to put a empty string as argument 2 instead of actually leaving it empty.
try
replaceAll(<your regexp>,"")

you can use Java - String replaceAll() Method.
This method replaces each substring of this string that matches the given regular expression with the given replacement.
Here is the syntax of this method:
public String replaceAll(String regex, String replacement)
Here is the detail of parameters:
regex -- the regular expression to which this string is to be matched.
replacement -- the string which would replace found expression.
Return Value:
This method returns the resulting String.
for your question use this
String s = "ram123", d = "ram varma656887";
System.out.println("s" + s.replaceAll("[0-9]", ""));
System.out.println("d" + d.replaceAll("[0-9]", ""));

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?

What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.

[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well

Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.

Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}

Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}

Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding tokens in a Java String - java

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

Try a regular expression like: (.?\[(.?)\]) The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

The regular expression \\[[\\[\\w]+\\] gives us [world] and [[is]

Related

Split a string based on pattern and merge it back

Add escape "\" in front of special character for a string

Java String- How to get a part of package name in android?

How to replace all numbers in java string

Regular Expression problem in Java

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding tokens in a Java String - java

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

Try a regular expression like: (.*?\[(.*?)\]) The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

The regular expression \\[[\\[\\w]+\\] gives us [world] and [[is]

Related

Split a string based on pattern and merge it back

Add escape "\" in front of special character for a string

Java String- How to get a part of package name in android?

How to replace all numbers in java string

Regular Expression problem in Java

Categories

Resources

Try a regular expression like: (.?\[(.?)\]) The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].