Regex in GWT to match URLs - java

I implemented the Pattern class as shown here:
http://www.java2s.com/Code/Java/GWT/ImplementjavautilregexPatternwithJavascriptRegExpobject.htm
And I would like to use the following regex to match urls in my String:
(http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?
Unfortunately, the Java compiler of course fails on parsing that string because it doesn't use valid escape sequences (since the above is technically a url pattern for JavaScript, not Java)
At the end of the day, I'm looking for a regex pattern that will both compile in Java and execute in JavaScript correctly.

You will have to use JSNI to do the regex evaluation part in Javascript. If you do write the regex with the escaped backslashes, that will get converted to Javascript as it is and will obviously be invalid. Thought it will work in the Hosted or Dev mode as thats still running Java bytecode, but not on the compiled application.
A simple JSNI example to test if a given string is a valid URL:
// Java method
public native boolean isValidUrl(String url) /*-{
var pattern = /(http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/;
return pattern.test(url);
}-*/;
There may be other irregularities between the Java and Javascript regex engines, so it's better to offload it completely to Javascript at least for moderately complex regexes.

The pattern itself looks fine, but I guess, its because of Backslash escaping.
Please take a look this http://www.regular-expressions.info/java.html
In literal Java strings the backslash
is an escape character. The literal
string "\\" is a single backslash. In
regular expressions, the backslash is
also an escape character. The regular
expression \\ matches a single
backslash. This regular expression as
a Java string, becomes "\\\\". That's
right: 4 backslashes to match a single
one.
So, if you reuse your Javascript regex in java, you need to replace \ to \\, and vice versa.

I don't know exactly how this would help but here is the exact function you requested in Javascript. I guess using JSNI like Anurag said will help.
var urlPattern = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:#/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?#]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";
function isValidURL(url) {
urlPattern = "^" + urlPattern + "$";
var regex = new RegExp(urlPattern);
return regex.test(url);
}
Like what #S.Mark said, I basically took the "java" way of doing Regular Expression in Javascript.
In Java, you would just done it the following way (see how the expression is the same).
String urlPattern = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:#/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?#]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";
Hope this helps. PS, this Regular expression works and even validates sites pointing to localhost:port) where port is any digit port number.

Related

Translate php regex to java

I have trouble to translate this php regex /^([-\.\w]+)$/ to java regex.
I try ^([-\\.\\w]+)$ but don't work.
The regex is used to validate a string used for a name of file.
in PHP is not allowed têst.ext, but in JAVA it's.
In java, it would be:
str.matches("[-.\\w]+")
There is no need to escape the dot in a character class in any language/tool.
There is no need to use ^ or $ with java's String#matches() because it's implied (the whole string must match)
There is no need to create a group (the brackets)

Is there any need to escape the slash('/') character for regular expressions in Java

I have the following code snippet:
Pattern patternOfSlashContainingBackSlash=Pattern.compile("\\/");
Pattern patternOfSlashNotContainingBackSlash=Pattern.compile("/");
String slash = "/";
Matcher matcherOfSlashContainingBackSlash = patternOfSlashContainingBackSlash.matcher(slash);
Matcher matcherOfSlashNotContainingBackSlash = patternOfSlashNotContainingBackSlash.matcher(slash);
//both patterns match the slash
System.out.println(matcherOfSlashContainingBackSlash.matches());
System.out.println(matcherOfSlashNotContainingBackSlash.matches());
My questions:
What is the difference (from Java perspective) between the two patterns, or is there any difference?
Is the '/' character just a plain character for regex(not a special character like ']' is) ,from Java perspective?
The java version on which I run this is 1.8
This question is different from the others, since it makes it clear that the patterns "\\/" and "/" are the same for Java programming language.
Thank you very much!
/ is not special in Java regex, it is in JavaScript where we have syntax /regex/flags. Java allows escaping characters even if they don't require it, probably to make such regex more portable. So Pattern.compile("\\/") and Pattern.compile("/") will behave the same.
BTW ] by itself is not special character. If it is not part of [...] construct you don't need to escape it, but you are allowed to (at least in Java, I am not sure about other regex flavors).
/ is not a special character in Java regexes. "\\/" is equivalent to "/".
The forward slash / character is not a command character in Java Pattern representations (nor in normal Strings), and does not require escaping.
The back-slash character \ requires escaping in Java Strings, as it is used to encode special sequences such as newline ("\n").
E.g. String singleBackslash = "\\";
As the back-slash is also used to signify special constructs in Java Pattern representations, it may require double escaping in Pattern definitions.
E.g. Pattern singleBackSlash = Pattern.compile("\\\\");
See API for examples of Pattern constructs involving the back-slash.
There is no need to escape the slash character. The reference for Java's regex syntax is the Javadoc for class Pattern. There, it lists all the characters that have special meaning. The slash is not present in that list.
I think that the reason why people escape the / character is that they come from different regex flavours (like Javascript) where you can define a pattern enclosing it between two slashes and if you want to match one slash you need to escape it.
var pattern = /\d\/\d/; // Valid pattern of 2 digits divided by a slash, which must be escaped.
From a Java perspective, though,it is equivalent and thus there's no need to escape it.

replace special character String with another Special character

I have a String which is path taken dynamically from my system .
i store it in a String .
C:\Users\SXR8036\Downloads\LANE-914.xls
I need to pass this path to read excel file function , but it needs the backward slashes to be replaced with forward slash.
and i want something like C:/Users/SXR8036/Downloads/LANE-914.xls
i.e all backward slash replaced with forward one
With String replace method i am only able to replace with a a-z character , but it shows error when i replace Special characters
something.replaceAll("[^a-zA-Z0-9]", "/");
I have to pass the String name to read a file.
It's better in this case to use non-regex replace() instead of regex replaceAll(). You don't need regular expressions for this replacement and it complicates things because it needs extra escapes. Backslash is a special character in Java and also in regular expressions, so in Java if you want a straight backslash you have to double it up \\ and if you want a straight backslash in a regular expression in Java you have to quadruple it \\\\.
something = something.replace("\\", "/");
Behind the scenes, replace(String, String) uses regular expression patterns (at least in Oracle JDK) so has some overhead. In your specific case, you can actually use single character replacement, which may be more efficient (not that it probably matters!):
something = something.replace('\\', '/');
If you were to use regular expressions:
something = something.replaceAll("\\\\", "/");
Or:
something = something.replaceAll(Pattern.quote("\\"), "/");
To replace backslashes with replaceAll you'll have to escape them properly in the regular expression that you are using.
In your case the correct expression would be:
final String path = "C:\\Users\\SXR8036\\Downloads\\LANE-914.xls";
final String normalizedPath = path.replaceAll("\\\\", "/");
As the backslash itself is the escape character in Java Strings it needs to be escaped twice to work as desired.
In general you can pass very complex regular expressions to String.replaceAll. See the JavaDocs of java.lang.String.replaceAll and especially java.util.regex.Pattern for more information.

Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)
For example consider there is a string with filepath - String text = "E:\\dummypath"
and we want to replace the "\\" with "/".
text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.
If we want to implement the same functionality with replaceAll() we need to write it as,
text.replaceAll("\\\\","/")
One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!
But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")
Digging Deeper:
Even more weird behaviors can be observed when we try some other inputs.
Lets assign text="Hello\nWorld\n"
Now,
text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/
Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?
You need to esacpe twice, once for Java, once for the regex.
Java code is
"\\\\"
makes a regex string of
"\\" - two chars
but the regex needs an escape too so it turns into
\ - one symbol
#Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.
But why is it like that?
It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.
So how do other languages manage to avoid this?
They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)
Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?
The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.
The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.
The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.
1) Let's say you want to replace a single \ using Java's replaceAll method:
\
˪--- 1) the final backslash
2) Java's replaceAll method takes a regex as first argument. In a regex literal, \ has a special meaning, e.g. in \d which is a shortcut for [0-9] (any digit). The way to escape a metachar in a regex literal is to precede it with a \, which leads to:
\ \
| ˪--- 1) the final backslash
|
˪----- 2) the backslash needed to escape 1) in a regex literal
3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/). But in a string literal, \ also has a special meaning, e.g. in \n (a new line) or \t (a tab). The way to escape a metachar in a string literal is to precede it with a \, which leads to:
\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
This is because Java tries to give \ a special meaning in the replacement string, so that \$ will be a literal $ sign, but in the process they seem to have removed the actual special meaning of \
While text.replaceAll("\\\\","/"), at least can be considered to be okay in some sense (though it itself is not absolutely right), all the three executions, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") giving same output seem even more funny. It is just contradicting as to why they have restricted the functioning of text.replaceAll("\\","/") for the same reason.
Java didn't mess up with regular expressions. It is because, Java likes to mess up with coders by trying to do something unique and different, when it is not at all required.
One way around this problem is to replace backslash with another character, use that stand-in character for intermediate replacements, then convert it back into backslash at the end. For example, to convert "\r\n" to "\n":
String out = in.replace('\\','#').replaceAll("#r#n","#n").replace('#','\\');
Of course, that won't work very well if you choose a replacement character that can occur in the input string.
I think java really messed with regular expression in String.replaceAll();
Other than java I have never seen a language parse regular expression this way. You will be confused if you have used regex in some other languages.
In case of using the "\\" in replacement string, you can use java.util.regex.Matcher.quoteReplacement(String)
String.replaceAll("/", Matcher.quoteReplacement("\\"));
By using this Matcher class you can get the expected result.

Regexp to match Javascript string literals with a specific keyword using Java

I'm trying to match chunks of JS code and extract string literals that contain a given keyword using Java.
After trying to come up with my own regexp to do this, I ended up modifying this generalized string-literal matching regexp (Pattern.COMMENTS used when building the patterns in Java):
(["'])
(?:\\?+.)*?
\1
to the following
(["'])
(?:\\?+.)*?
keyword
(?:\\?+.)*?
\1
The test cases:
var v1 = "test";
var v2 = "testkeyword";
var v3 = "test"; var v4 = "testkeyword";
The regexp correctly doesn't match line 1 and correctly matches line 2.
However, in line 3, instead of just matching "testkeyword", it matches the chunk
"test"; var v4 = "testkeyword"
which is wrong - the regexp matched the first double quote and did not terminate at the second double quote, going all the way till the end of line.
Does anyone have any ideas on how to fix this?
PS: Please keep in mind that the Regexp has to correctly handle escaped single and double quote characters inside of string literals (which the generalized matcher already did).
How about this modification:
(?:
"
(?:\\"|[^"\r\n])*
keyword
(?:\\"|[^"\r\n])*
"
|
'
(?:\\'|[^'\r\n])*
keyword
(?:\\'|[^'\r\n])*
'
)
After much revision (see edit history, viewers at home :), I believe this is my final answer:
(?:
"
(?:\\?+"|[^"])*
keyword
(?:\\?+"|[^"])*
"
|
'
(?:\\?+'|[^'])*
keyword
(?:\\?+'|[^'])*
'
)
You need to write two patterns for either single or double quoted strings, as there is no way to make the regex remember which opened the string. Then you can or them together with |.
Consider using code from Rhino -- JS in Java -- to get the real String literals.
Or, if you want to use regex, consider one find for the whole literal, then a nested test if the literal contains 'keyword'.
I think Tim's construction works, but I wouldn't bet on it in all situations, and the regex would have to get insanely unwieldy if it had to deal with literals that don't want to be found (as if trying to sneak by your testing). For example:
var v5 = "test\x6b\u0065yword"
Separate from any solution, my secret weapon for interactively working out regexes is a tool I made called Regex Powertoy, which unlike many such utilities runs in any browser with Java applet support.
A grammar to construct a string literal would look roughly like this:
string-literal ::= quote text quote
text ::= character text
| character
character ::= non-quote
| backslash quote
with non-quote, backslash, and quote being terminals.
A grammar is regular if it is context free (i.e. the left hand side of all rules is always a single non-terminal) and the right hand side of all rules is always either empty, a terminal, or a terminal followed by a non-terminal.
You may notice that the first rule given above has a terminal followed by a nonterminal followed by a terminal. This is thus not a regular grammar.
A regular expression is an expression that can parse regular languages (languages that can be constructed by a regular grammar). It is not possible to parse non-regular languages with regular expressions.
The difficulty you have in finding a suitable regular expression stems from the fact that a suitable regular expression doesn't exist. You will never arrive at code that is obviously correct, this way.
It is much easier to write a simple parser along the lines of above rules. Since the text contained by your string literals is regular, you can use a simple regular expression to look for your keyword---after you extracted that text from its surroundings.

Categories