Get domain name from URL without java.net in Java - java

I would like to know in GWT is there an easy way to get host name from url WITHOUT java.net.*; Since client side doesn't support this package.
Input ejw.example.com/anotherexample?fun=no
output example.com
Input https://www3.example.com/yeteagainanotherexample?fun=miserable
output example.com

com.google.gwt.user.client.Window.Location.getHost()

Try with com.google.gwt.regexp.shared.RegExp that is used for regular expressions with features like Javascript's RegExp, plus Javascript String's replace and split methods (which can take a RegExp parameter).
Try below regex that captures all the everything between first dot and first forward slash and get it at match index 1.
https*://\w+\.(.*?)/
Here is demo on debuggex and regexr
Sample code:
String url="https://www3.example.com/yeteagainanotherexample?fun=miserable";
System.out.println(RegExp.compile("https*://\\w+\\.(.*?)/").exec(url).getGroup(1));
Note: I am not good in regex so please change the regex pattern as per your need. Read more about
What is a good regular expression to match a URL?
JavaScript Regex to match a URL in a field of text

Related

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Regex matches only part of a URL - why?

I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:
String text ="Hi please visit www.google.com";
String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";
String s = text.replaceAll(reg, "<a href='$1'>$1</a>");
System.out.println(""+s);
The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?
I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be
Hi please visit <a href='www.google.com'>www.google.com</a>
Edit
Following regex
(http(s)?://)?(www(\.\w+)+[^\s.,"']*)
works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?
To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.
Specifically:
[^[:punct:]\s]
doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as
[:punct\s]
which only matches the c of com, therefore ending your match there.
As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.
For example, the solution proposed at the end of the post would translate to Java as
String resultString = subjectString.replaceAll(
"(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [-A-Z0-9+&#\\#/%=~_|$?!:,.])*\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [A-Z0-9+&#\\#/%=~_|$])", "$0");
Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:
\p{Punct}
But I would simplify your regex for a URL to:
(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))
And elaborate it only if you find a test case that breaks it.
As a java line it would be:
text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "$3");
Note the neat capture of the "s" in "https" (if found) that is restored if required.

Regex to extract valid Http or Https

I'm currently having some issues with a regex to extract a URL.
I want my regex to take URLS such as:
http://stackoverflow.com/questions/ask
https://stackoverflow.com
http://local:1000
https://local:1000
Through some tutorials, I've learned that this regex will find all the above: ^(http|https)\://.*$ however, it will also take http://local:1000;http://invalid http://khttp://as a single string when it shouldn't take it at all.
I understand that my expression isn't written to exclude this, but my issue is I cannot think of how to write it so it checks for this scenario.
Any help is greatly appreciated!
Edit:
Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement?
Sorry this will be done with Java
I also need to add the following constraint: a string such as http://local:80/test:90 fails because of the duplicate of port...aka I need to have a constraint that only allows two total : symbols in a valid string (one after http/s) and one before port.
This will only produce a match if if there is no :// after its first appearance in the string.
^https?:\/\/(?!.*:\/\/)\S+
Note that trying to parse a valid url from within a string is very complex, see
In search of the perfect URL validation regex, so the above does not attempt to do that.
It will just match the protocol and following non-space characters.
In Java
Pattern reg = Pattern.compile("^https?:\\/\\/(?!.*:\\/\\/)\\S+");
Matcher m = reg.matcher("http://somesite.com");
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("No match");
}
Check your programming language to see if it already has a parser. E.g. php has parse_url()
From http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This may change based on the programming language/tool
/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&;?#/.=]+/g

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

I need to create a java regex expression which matches URLS with http or https but should not match urls with specific file extensions.
I can get the urls with http or https using the following expression and I am unable to complete the 2nd part that is eliminating urls with certain extensions (eg:- js|css|jpg etc..)
I guess I need to use negations but I am not sure how to do it.
String regex = "\\s*(?i)(http|https)\\s*://\\s*((\"[^\"]*\"|'[^']*'|([^'\">\\s]+)))";
Please help me to modify this regex to meet this requirement.
An easy way to implement this in Java is to use the Pattern class (from java.util.regex). To accomplish what you're suggesting, you could use two separate regex objects to check the conditions for the URL. For example (using the string regex from your question):
Scanner in = new Scanner(System.in);
String input = in.nextLine();
Pattern one = Pattern.compile(regex);
Pattern two = Pattern.compile("([^\s]+(\.(?i)(js|css|jpg|etc))$)");
if(one.matcher(input).matches() && !two.matcher(input).matches())
System.out.println("It matches!");
else System.out.println("Nope!");
In short, using two Pattern objects makes your code more readable and easy to manage, since you're considering multiple aspects about an input string of a URL.
You need an anchor to look behind - see regex to match url that should give you the expression you need. The regex you have currently will match malformed urls with disallowed characters.
Here's a good site to check your expressions: http://www.regexplanet.com/advanced/java/index.html

Java regex to retain specific closing tags

I'm trying to write a regex to remove all but a handful of closing xml tags.
The code seems simple enough:
String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");
However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.
I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).
You probably shouldn't use regex for this task, but let's see what happens...
Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:
"</(?!a|em|li).*?>"
But this won't handle a number of cases correctly:
Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...
You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.
I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.
See this answer for more passionate info re. parsing XML/HTML via regexps.
You cannot use an alternation inside a character class. A character class always matches a single character.
You likely want to use a negative lookahead or lookbehind instead:
"</(?!a|em|li).*?>"

Categories