Extract file extension from String using regex - java

I have the following String:
"data:audio/mp3;base64,ABC..."
And I'm extracting the file extension (in this case "mp3") out of it.
The String varies accordingly to the file type. Some examples:
"..."
"..."
"data:audio/wav;base64,ABC..."
"data:audio/mp3;base64,ABC..."
Here's how I've done:
public class Test {
private static final String BASE64_HEADER_EXP = "^data:.+;base64,";
private static final Pattern PATTERN_BASE64_HEADER = Pattern.compile(BASE64_HEADER_EXP);
private String data;
private String fileName;
public String getFileName() {
Matcher base64HeaderMatcher = PATTERN_BASE64_HEADER.matcher(data);
return String.format("%s.%s", getFilenameWithoutExtension(), getExtension(base64HeaderMatcher));
}
private String getFilenameWithoutExtension() {
return fileName.split("\\.")[0];
}
private String getExtension(Matcher base64HeaderMatcher) {
if (base64HeaderMatcher.find()) {
String base64Header = base64HeaderMatcher.group(0);
return base64Header.split("/")[1].split(";")[0];
}
return fileName.split("\\.")[1];
}
}
What I want is a way to do it without having to split and access array positions like I'm doing above. Maybe extract the extension using a regex expression.
I'm able to do it on RegExr site using this expression:
(?<=^data:.*/)(.*)(?=;)
But, when trying to use the same regex on Java, I get the error "Require that the characters immediately before the position do" because, aparently, Java doesn't support repetition inside lookbehind:

How about using capturing groups?
private static final String BASE64_HEADER_EXP = "^data:[^/]+/([^;]+);base64,";
This way you can use base64HeaderMatcher.group(1) and get file type.

This should do it for the examples you gave:
(?<=data:)(?:[A-z]+)/(.*?);
Explanation:
Positive look-behind
(?<=data:)
Non-capturing group to account for image, audio, etc.
(?:[A-z]+)
Match / literally, capture group for file extension, match ; literally
/(.*?);

"Strings in Java have built-in support for regular expressions. Strings have four built-in methods for regular expressions, i.e., the matches(), split()), replaceFirst() and replaceAll() methods."
-http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Using This Info We can quickly make a regex and test it against our string.
//In regex each set of () represents a capture field which can later be
//referenced with $1, $2 etc..
//The below regex breaks the string into four fields
string pattern="(^data:)(\\w+?/)(\\w+?)(;.*$)";
//First Field
//This field matches the start of a line (^) followed by "data:"
//Second Field
//This matches any wordCharacter (\\w), one or more (+) followed by a "/"
// the "?" symbol after the + means reluctantly match, match as few
//characters
//as possible. this field will effectively capture a seriece of letters
//followed by a slash
//Third Field
//This is the field we want to capture and we will reference with $3
//it matches any wordCharacter(\\w), one or more reluctantly
//Fourth Field
//This captures the rest of the string including the ";"
//Now to extract the extension from this test string
string test="...";
string testExtension="";
//Replace the contents of testExtension with the 3rd capture field of
//our regex pattern applied to our test string like so
testExtension = test.replaceAll(pattern, "$3");
//This invokes the String class replaceAll() method
//And now our string testExtension should contain "jpeg"

Related

Regular expression extracting a string from url

What I am trying is to extract my account id from a url for other validations.
see my URL samples.
http://localhost:8024/accounts/u8m21ercgelj/
http://localhost:8024/accounts/u8m21ercgelj
http://localhost:8024/accounts/u8m21ercgelj/users?
What I required is to extract u8m21ercgelj from the url. I tried it with below code but it fails for the cases like http://localhost:8024/accounts/u8m21ercgelj
i.e with out a / at the end.
public String extractAccountIdFromURL(String url) {
String accountId = null;
if ( url.contains("accounts")) {
Pattern pattern = Pattern.compile("[accounts]/(.*?)/");
Matcher matcher = pattern.matcher(url);
while (matcher.find()) {
accountId = matcher.group(1);
}
}
return accountId;
}
Can any one help me?
[accounts] doesn't try to find accounts word, but one character which is either a, c (repetition of character doesn't change anything), o, u, n, t or s because [...] is character class. So get rid of those [ and ] and replace them with / since you most likely don't want to accept cases like /specialaccounts/ but only /accounts/.
It looks like you just want to find next non-/ section after /accounts/. In that case you can just use /accounts/([^/]+)
If you are sure that there will be only one /accounts/ section in URL you can (and for more readable code should) change your while to if or even conditional operator. Also there is no need for contains("/accounts/") since it just adds additional traversing over entire string which can be done in find().
It doesn't look like your method is using any data held by your class (any fields) so it could be static.
Demo:
//we should resuse once compiled regex, there is no point in compiling it many times
private static Pattern pattern = Pattern.compile("/accounts/([^/]+)");
public static String extractAccountIdFromURL(String url) {
Matcher matcher = pattern.matcher(url);
return matcher.find() ? matcher.group(1) : null;
}
public static void main(java.lang.String[] args) throws Exception {
String examples =
"http://localhost:8024/accounts/u8m21ercgelj/\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj\r\n" +
"http://localhost:8024/accounts/u8m21ercgelj/users?";
for (String url : examples.split("\\R")){// split on line separator like `\r\n`
System.out.println(extractAccountIdFromURL(url));
}
}
Output:
u8m21ercgelj
u8m21ercgelj
u8m21ercgelj
Your regex is written as such that it is expecting to receive a trailing slash - that's what the slash after the (.*?) means.
You should change this so that it can accept either the trailing slash, or the end of the string. (/|$) should work in this case, meaning your regex would be [accounts]/(.*?)(/|$)

Matching (Basic) Function Declarations

I want to retrieve all function definitions individually from a source code file. Ultimately, I want to just retrieve all function names. Source files are of the following form:
#include bla
first_function_name()
{
}
second_function_name(first_parameter, second_parameter)
{
i = 0;
}
Note that there are no access modifiers and return types, this is NOT for parsing the Java programming language.
I want to implement the solution via regular expression. So far I managed to match function definitions, however I'm having the problem that the regular expression doesn't only match a single function but also the ones coming afterwards. Basically, it doesn't end at the closing brace. I tried using the $ symbol but it's also not ending the regular expression.
The regular expressions I'm currently using look like this:
private static final String FUNCTION_NAME_MATCHER = "[a-zA-Z]\\w*";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "[(].*[)].*[\\{]([^\\}]*)?[\\}]";
How do I stop it from matching the following function(s) as well? It should match twice for the above example functions but instead it only matches once (both function definitions at once).
The method for getting a list of matched function definitions looks like this:
public List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
String functionDefinition = matcher.group();
String functionName = functionDefinition.split(FUNCTION_NAME_MATCHER)[0];
matchedResults.add(functionName);
}
return matchedResults;
}
Try this
private static final String FUNCTION_NAME_MATCHER = "([a-zA-Z]\\w*)";
private static final String FUNCTION_MATCHER = "(?s)" + FUNCTION_NAME_MATCHER + "\\([^)]*\\)\\s*\\{[^}]*\\}";
public static List<String> getMatches()
{
List<String> matchedResults = new ArrayList<>();
Matcher matcher = Pattern.compile(FUNCTION_MATCHER).matcher(sourceFile);
while (matcher.find())
{
matchedResults.add(matcher.group(1));
}
return matchedResults;
}
* is greedy, it will select every possible matching character that it can find. Right now the [(].*[)] part is consuming everything starting at the first ( in the first function all the way to the last ) in the second. You want to make it reluctant, where it will only consume a character if it needs to. Do so by changing all the .* to .*?
Also, you probably want to match only whitespace between the function declaration and body, so you should replace [)].*[\\{] with [)]\\s*[\\{]
If you enclose the FUNCTION_NAME_MATCHER and the arguments with ( and ) it will be captured into a capture group so you can extract it.
First, you'd want to match the whole function, to avoid matching function calls & duplicates:
[^\s]*\(([^}]*)\)\{([^}]*)}
Then, you want to split this up to get the name:
String matchedName = matchedFunction.split("(")[0]
And there you go! It's all done and dusted!

Java Reg Expression to wrap HTML tag around text

I'm working on a text to HTML parser. I'm using the "##" notation to mark a Bold character. Ex.
Example ##Bold text in a paragraph
Turns to:
Example <strong>Bold</strong> text in paragraph
The following code works, however I've found out that it works just on the last Bold notation found:
private static String escapeBold(String sCurrentLine) {
if (sCurrentLine.indexOf("##") < 0) {
return sCurrentLine;
}
String newString = null;
String oldString = null;
String chars[] = sCurrentLine.split(" ");
for (String s : chars) {
if (s.startsWith("##")) {
newString = "<strong>" + s.replaceAll("##", "") + "</strong>";
oldString = s;
}
}
return (sCurrentLine.replaceAll(oldString, newString));
}
Is there a simpler way to do it, maybe with a RegExpr ?
Thanks!
It looks like your method can look like
private static String escapeBold(String sCurrentLine) {
return sCurrentLine.replaceAll("##(\\w+)", "<strong>$1</strong>");
}
It will try to find each ##someWord and place someWord part in group 1. In replacement we are using match stored in group 1 via $1 and simply surrounding it with <strong> tags.
To understand this code you need to know that replaceAll(regex,replacement) uses regular expression (regex) to find part which we want to modify, and replacement describes how we want to modify it.
In regex \\w represents characters in range a-z A-Z 0-9 and _. If you want to include other characters you can create your own character class, or use \\S which represents all non-whitespace characters.

Regex to get the string after # sign

I have a string like follows:
#78517700-1f01-11e3-a6b7-3c970e02b4ec, #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec ....
I want to extract the string after #.
I have the current code like follows:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#[^\\s]+");
Matcher m = PATTERN_LOGIN.matcher("#78517700-1f01-11e3-a6b7-3c970e02b4ec , #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec");
while (m.find()) {
String mentionedLogin = m.group();
.......
}
... but m.group() gives me #78517700-1f01-11e3-a6b7-3c970e02b4ec but I wanted 78517700-1f01-11e3-a6b7-3c970e02b4ec
You should use the regex "#([^\\s]+)" and then m.group(1), which returns you what "captured" by the capturing parentheses ().
m.group() or m.group(0) return you the full matching string found by your regex.
I would modify the pattern to omit the at sign:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#([^\\s]+)");
So the first group will be the GUID only
Correct answers are mentioned in other responses. I will add some clarification. Your code is working correctly, as expected.
Your regex means: match string which starts with # and after that follows one or more characters which isn't white space. So if you omit the parentheses you get you full string as expected.
The parentheses as mentioned in other responses are used for marking capturing groups. In layman terms - the regex engine does the matching multiple times for each parenthesis enclosed group, working it's way inside the nested structure.

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)
I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)
Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.
Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.
There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.
Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].
StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}
The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

Categories