Matching (Basic) Function Declarations - java

I want to retrieve all function definitions individually from a source code file. Ultimately, I want to just retrieve all function names. Source files are of the following form:
#include bla
first_function_name()
{
}
second_function_name(first_parameter, second_parameter)
{
i = 0;
}
Note that there are no access modifiers and return types, this is NOT for parsing the Java programming language.
I want to implement the solution via regular expression. So far I managed to match function definitions, however I'm having the problem that the regular expression doesn't only match a single function but also the ones coming afterwards. Basically, it doesn't end at the closing brace. I tried using the $ symbol but it's also not ending the regular expression.
The regular expressions I'm currently using look like this:
private static final String FUNCTION_NAME_MATCHER = "[a-zA-Z]\\w*";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "[(].*[)].*[\\{]([^\\}]*)?[\\}]";
How do I stop it from matching the following function(s) as well? It should match twice for the above example functions but instead it only matches once (both function definitions at once).
The method for getting a list of matched function definitions looks like this:
public List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
String functionDefinition = matcher.group();
String functionName = functionDefinition.split(FUNCTION_NAME_MATCHER)[0];
matchedResults.add(functionName);
}
return matchedResults;
}

Try this
private static final String FUNCTION_NAME_MATCHER = "([a-zA-Z]\\w*)";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "\\([^)]*\\)\\s*\\{[^}]*\\}";
public static List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
matchedResults.add(matcher.group(1));
}
return matchedResults;
}

* is greedy, it will select every possible matching character that it can find. Right now the [(].*[)] part is consuming everything starting at the first ( in the first function all the way to the last ) in the second. You want to make it reluctant, where it will only consume a character if it needs to. Do so by changing all the .* to .*?
Also, you probably want to match only whitespace between the function declaration and body, so you should replace [)].*[\\{] with [)]\\s*[\\{]
If you enclose the FUNCTION_NAME_MATCHER and the arguments with ( and ) it will be captured into a capture group so you can extract it.

First, you'd want to match the whole function, to avoid matching function calls & duplicates:
[^\s]*\(([^}]*)\)\{([^}]*)}
Then, you want to split this up to get the name:
String matchedName = matchedFunction.split("(")[0]
And there you go! It's all done and dusted!

Related

Extract file extension from String using regex

I have the following String:
"data:audio/mp3;base64,ABC..."
And I'm extracting the file extension (in this case "mp3") out of it.
The String varies accordingly to the file type. Some examples:
"data:image/jpeg;base64,ABC..."
"data:image/png;base64,ABC..."
"data:audio/wav;base64,ABC..."
"data:audio/mp3;base64,ABC..."
Here's how I've done:
public class Test {
private static final String BASE64_HEADER_EXP = "^data:.+;base64,";
private static final Pattern PATTERN_BASE64_HEADER = Pattern.compile(BASE64_HEADER_EXP);
private String data;
private String fileName;
public String getFileName() {
Matcher base64HeaderMatcher = PATTERN_BASE64_HEADER.matcher(data);
return String.format("%s.%s", getFilenameWithoutExtension(), getExtension(base64HeaderMatcher));
}
private String getFilenameWithoutExtension() {
return fileName.split("\\.")[0];
}
private String getExtension(Matcher base64HeaderMatcher) {
if (base64HeaderMatcher.find()) {
String base64Header = base64HeaderMatcher.group(0);
return base64Header.split("/")[1].split(";")[0];
}
return fileName.split("\\.")[1];
}
}
What I want is a way to do it without having to split and access array positions like I'm doing above. Maybe extract the extension using a regex expression.
I'm able to do it on RegExr site using this expression:
(?<=^data:.*/)(.*)(?=;)
But, when trying to use the same regex on Java, I get the error "Require that the characters immediately before the position do" because, aparently, Java doesn't support repetition inside lookbehind:
How about using capturing groups?
private static final String BASE64_HEADER_EXP = "^data:[^/]+/([^;]+);base64,";
This way you can use base64HeaderMatcher.group(1) and get file type.
This should do it for the examples you gave:
(?<=data:)(?:[A-z]+)/(.*?);
Explanation:
Positive look-behind
(?<=data:)
Non-capturing group to account for image, audio, etc.
(?:[A-z]+)
Match / literally, capture group for file extension, match ; literally
/(.*?);
"Strings in Java have built-in support for regular expressions. Strings have four built-in methods for regular expressions, i.e., the matches(), split()), replaceFirst() and replaceAll() methods."
-http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Using This Info We can quickly make a regex and test it against our string.
//In regex each set of () represents a capture field which can later be
//referenced with $1, $2 etc..
//The below regex breaks the string into four fields
string pattern="(^data:)(\\w+?/)(\\w+?)(;.*$)";
//First Field
//This field matches the start of a line (^) followed by "data:"
//Second Field
//This matches any wordCharacter (\\w), one or more (+) followed by a "/"
// the "?" symbol after the + means reluctantly match, match as few
//characters
//as possible. this field will effectively capture a seriece of letters
//followed by a slash
//Third Field
//This is the field we want to capture and we will reference with $3
//it matches any wordCharacter(\\w), one or more reluctantly
//Fourth Field
//This captures the rest of the string including the ";"
//Now to extract the extension from this test string
string test="data:image/jpeg;base64,ABC...";
string testExtension="";
//Replace the contents of testExtension with the 3rd capture field of
//our regex pattern applied to our test string like so
testExtension = test.replaceAll(pattern, "$3");
//This invokes the String class replaceAll() method
//And now our string testExtension should contain "jpeg"

Regular expression extracting a string from url

What I am trying is to extract my account id from a url for other validations.
see my URL samples.
http://localhost:8024/accounts/u8m21ercgelj/
http://localhost:8024/accounts/u8m21ercgelj
http://localhost:8024/accounts/u8m21ercgelj/users?
What I required is to extract u8m21ercgelj from the url. I tried it with below code but it fails for the cases like http://localhost:8024/accounts/u8m21ercgelj
i.e with out a / at the end.
public String extractAccountIdFromURL(String url) {
String accountId = null;
if ( url.contains("accounts")) {
Pattern pattern = Pattern.compile("[accounts]/(.*?)/");
Matcher matcher = pattern.matcher(url);
while (matcher.find()) {
accountId = matcher.group(1);
}
}
return accountId;
}
Can any one help me?
[accounts] doesn't try to find accounts word, but one character which is either a, c (repetition of character doesn't change anything), o, u, n, t or s because [...] is character class. So get rid of those [ and ] and replace them with / since you most likely don't want to accept cases like /specialaccounts/ but only /accounts/.
It looks like you just want to find next non-/ section after /accounts/. In that case you can just use /accounts/([^/]+)
If you are sure that there will be only one /accounts/ section in URL you can (and for more readable code should) change your while to if or even conditional operator. Also there is no need for contains("/accounts/") since it just adds additional traversing over entire string which can be done in find().
It doesn't look like your method is using any data held by your class (any fields) so it could be static.
Demo:
//we should resuse once compiled regex, there is no point in compiling it many times
private static Pattern pattern = Pattern.compile("/accounts/([^/]+)");
public static String extractAccountIdFromURL(String url) {
Matcher matcher = pattern.matcher(url);
return matcher.find() ? matcher.group(1) : null;
}
public static void main(java.lang.String[] args) throws Exception {
String examples =
"http://localhost:8024/accounts/u8m21ercgelj/\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj/users?";
for (String url : examples.split("\\R")){// split on line separator like `\r\n`
System.out.println(extractAccountIdFromURL(url));
}
}
Output:
u8m21ercgelj
u8m21ercgelj
u8m21ercgelj
Your regex is written as such that it is expecting to receive a trailing slash - that's what the slash after the (.*?) means.
You should change this so that it can accept either the trailing slash, or the end of the string. (/|$) should work in this case, meaning your regex would be [accounts]/(.*?)(/|$)

What is the regular expression (regex) for a certain piece of text

I'm learning regular expressions and I'd like to know what the regular expression to grab the "path=/return/..." for the following text:
StartTopic topic=testParser, multiCopy=false, required=true,
all=false, path=/Return/ReturnData/IRSW2
Any help or pointers to good sites would be much appreciated
Edit:
if(c.getNodeType() ==Node.ELEMENT_NODE) {
Element cell = (Element) c;
String value = cell.getAttribute("value");
// So value here equals StartTopic topic=testParser, multiCopy=false, required=true, all=false, path=/Return/ReturnData/IRSW2
if(value.contains("path")) {
// I want just this part "path=/Return/ReturnData/IRSW2"
// Inside regex.txt I have this text "/path=([^\,]+)/"
String str = FileUtils.readFileToString(new File("regex.txt"));
String escapedChars = StringEscapeUtils.escapeJava(str);
Matcher matcher = Pattern.compile(value+escapedChars).matcher(value);
System.out.println(matcher);
// never enters this for loop
if(matcher.matches()) {
System.out.println(matcher);
}
Output:
Test!
java.util.regex.Matcher[pattern=StartTopic topic=testParser, multiCopy=false, required=true, all=false, path=/Return/ReturnData/IRSW2/path=([^\\,]+)/ region=0,101 lastmatch=]
Just to be clear, I want to do something very simple.
I have a string object "value" which contains a bunch of string. I would like only a particular piece of this string which begins with "path=" and everything after it "/Return/[...etc]", I want to use this text to construct an xml file
This is one of the many ways
public static void main(String[] args) {
String str = "StartTopic topic=testParser, multiCopy=false, required=true, all=false, path=/Return/ReturnData/IRSW2";
System.out.println(str.replaceAll(".*\\s+(path=.*)$","$1"));
}
O/P
path=/Return/ReturnData/IRSW2
Other ways include split based on "," and grab the last element. Or just capturing the group which starts with path...
/path=([^\,]+)/
assuming that the , is the seperator of name-value pairs, this code groups the value of the path parameter
( ) - defining inside is a group match
[^\,] - every char except ,
+ one or more chars

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)
I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)
Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.
Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.
There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.
Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].
StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}
The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

Categories