Regular expression extracting a string from url - java

What I am trying is to extract my account id from a url for other validations.
see my URL samples.
http://localhost:8024/accounts/u8m21ercgelj/
http://localhost:8024/accounts/u8m21ercgelj
http://localhost:8024/accounts/u8m21ercgelj/users?
What I required is to extract u8m21ercgelj from the url. I tried it with below code but it fails for the cases like http://localhost:8024/accounts/u8m21ercgelj
i.e with out a / at the end.
public String extractAccountIdFromURL(String url) {
String accountId = null;
if ( url.contains("accounts")) {
Pattern pattern = Pattern.compile("[accounts]/(.*?)/");
Matcher matcher = pattern.matcher(url);
while (matcher.find()) {
accountId = matcher.group(1);
}
}
return accountId;
}
Can any one help me?

[accounts] doesn't try to find accounts word, but one character which is either a, c (repetition of character doesn't change anything), o, u, n, t or s because [...] is character class. So get rid of those [ and ] and replace them with / since you most likely don't want to accept cases like /specialaccounts/ but only /accounts/.
It looks like you just want to find next non-/ section after /accounts/. In that case you can just use /accounts/([^/]+)
If you are sure that there will be only one /accounts/ section in URL you can (and for more readable code should) change your while to if or even conditional operator. Also there is no need for contains("/accounts/") since it just adds additional traversing over entire string which can be done in find().
It doesn't look like your method is using any data held by your class (any fields) so it could be static.
Demo:
//we should resuse once compiled regex, there is no point in compiling it many times
private static Pattern pattern = Pattern.compile("/accounts/([^/]+)");
public static String extractAccountIdFromURL(String url) {
Matcher matcher = pattern.matcher(url);
return matcher.find() ? matcher.group(1) : null;
}
public static void main(java.lang.String[] args) throws Exception {
String examples =
"http://localhost:8024/accounts/u8m21ercgelj/\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj/users?";
for (String url : examples.split("\\R")){// split on line separator like `\r\n`
System.out.println(extractAccountIdFromURL(url));
}
}
Output:
u8m21ercgelj
u8m21ercgelj
u8m21ercgelj

Your regex is written as such that it is expecting to receive a trailing slash - that's what the slash after the (.*?) means.
You should change this so that it can accept either the trailing slash, or the end of the string. (/|$) should work in this case, meaning your regex would be [accounts]/(.*?)(/|$)

Related

Matching (Basic) Function Declarations

I want to retrieve all function definitions individually from a source code file. Ultimately, I want to just retrieve all function names. Source files are of the following form:
#include bla
first_function_name()
{
}
second_function_name(first_parameter, second_parameter)
{
i = 0;
}
Note that there are no access modifiers and return types, this is NOT for parsing the Java programming language.
I want to implement the solution via regular expression. So far I managed to match function definitions, however I'm having the problem that the regular expression doesn't only match a single function but also the ones coming afterwards. Basically, it doesn't end at the closing brace. I tried using the $ symbol but it's also not ending the regular expression.
The regular expressions I'm currently using look like this:
private static final String FUNCTION_NAME_MATCHER = "[a-zA-Z]\\w*";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "[(].*[)].*[\\{]([^\\}]*)?[\\}]";
How do I stop it from matching the following function(s) as well? It should match twice for the above example functions but instead it only matches once (both function definitions at once).
The method for getting a list of matched function definitions looks like this:
public List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
String functionDefinition = matcher.group();
String functionName = functionDefinition.split(FUNCTION_NAME_MATCHER)[0];
matchedResults.add(functionName);
}
return matchedResults;
}
Try this
private static final String FUNCTION_NAME_MATCHER = "([a-zA-Z]\\w*)";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "\\([^)]*\\)\\s*\\{[^}]*\\}";
public static List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
matchedResults.add(matcher.group(1));
}
return matchedResults;
}
* is greedy, it will select every possible matching character that it can find. Right now the [(].*[)] part is consuming everything starting at the first ( in the first function all the way to the last ) in the second. You want to make it reluctant, where it will only consume a character if it needs to. Do so by changing all the .* to .*?
Also, you probably want to match only whitespace between the function declaration and body, so you should replace [)].*[\\{] with [)]\\s*[\\{]
If you enclose the FUNCTION_NAME_MATCHER and the arguments with ( and ) it will be captured into a capture group so you can extract it.
First, you'd want to match the whole function, to avoid matching function calls & duplicates:
[^\s]*\(([^}]*)\)\{([^}]*)}
Then, you want to split this up to get the name:
String matchedName = matchedFunction.split("(")[0]
And there you go! It's all done and dusted!

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

Java and Regex, get a substring which matches

I want to match the following pattern:
[0-9]*-[0-9]*-[BL]
and apply the pattern to this string:
123-456-L-234
which should become
123-456-L.
Here's my code:
HelperRegex{
..
final static Pattern KEY = Pattern.compile("\\d*-\\d*-[BL]");
public static String matchKey(String key) {
return KEY.matcher(key).toMatchResult().group(0);
}
Junit:
#Test
public final void testMatchKey() {
Assert.assertEquals("453-04430-B", HelperRegex.matchKey("453-04430-B-1"));
}
there is a no match found exception thrown.
I've proven my regex with "the regex coach" and it seems not broken, and matches all the teststring
Never mind all that complexity. You only need one line:
String match = input.replaceAll(".*?([0-9]*-[0-9]*-[BL])?.*", "$1");
This will produce a blank string if the pattern is not found.
If it were me, I would in-line this and not even have a separare method.
You need to create the group you want to retrieve with () and make sure your regex matches the whole string (note that group 0 is the whole string, so what you want is group 1):
String key = "453-04430-B-1";
Pattern pattern = Pattern.compile("(\\d*-\\d*-[BL]).*");
Matcher m = pattern.matcher(key);
if (m.matches())
System.out.println(m.group(1)); //prints 453-04430-B

Java Regexp capturing group includes space, why?

I am trying to parse this string,
"斬釘截鐵 斩钉截铁 [zhan3 ding1 jie2 tie3] /to chop the nail and slice the iron (idiom)/resolute and decisive/unhesitating/definitely/without any doubt/";
With this code
private static final Pattern TRADITIONAL = Pattern.compile("(.*?) ");
private String extractSinglePattern(String row, Pattern pattern) {
Matcher matcher = pattern.matcher(row);
if (matcher.find()) {
return matcher.group();
}
return null;
}
However, for some reason the string returned contains a space at the end
org.junit.ComparisonFailure: expected:<斬釘截鐵[]> but was:<斬釘截鐵[ ]>
Is there something wrong with my pattern?
I have also tried
private static final Pattern TRADITIONAL = Pattern.compile("(.*?)\\s");
but to no avail
I have also tried matching with two spaces at the end of the pattern, but it doesn't match (there is only one space).
You're using Matcher.group() which is documented as:
Returns the input subsequence matched by the previous match.
The match includes the space. The capturing group within the match doesn't, but you haven't asked for that.
If you change your return statement to:
return matcher.group(1);
then I believe it'll do what you want.
use this regular expression (.+?)(?=\s+)

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Categories