Get all HTTP url from a webpage

Get all HTTP url from a webpage - java

I am creating a simple utility to retrieve all HTTP URL's from a webpage.
Initially I had planned to use a HTML parsing library to parse out the HREF tags but I got to know that I need to retrieve the URL contained inside the script too (Example script below) hence I started trying out regular expression to get all the HTTP url from the web page but for some reason my regular expression is not working properly.
The URL can be inside a javascript
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
My program:
try {
BufferedReader in=new BufferedReader(new FileReader("c:\\sample\\sample.html"));
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
String pattern = "http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(inputLine.replaceAll("http://", "\nhttp://"));
while (!m.hitEnd()) {
if (m.find()) {
System.out.println("Found value: " + m.group(0));
} else {
//System.out.println("NO MATCH");
}
}
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
Can someone help me fix this issue or let me know the best way to retrieve all URL's from a web page?

Description
Your expression has a typo. It should make the s optional.
https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?
^
Also I recommend:
replacing the (...) capture groups with non capture groups like (?:...)
you don't need to escape a . inside a character group [.]
add a test to ensure you're not captureing the close quotes surrounding your url
rewrite your section looking for /folder/subfolder sections as a repeating non-capture group looking for the initial slash followed by the folder name
regex: https?:\/\/(?:[\w-]+.)+(?::\d+)?(?:\/[\w\/_.]*)*?(?:\?\S+)?(?=['"\s])
as a Java string: "https?:\\/\\/(?:[\\w-]+.)+(?::\\d+)?(?:\\/[\\w\\/_.]*)*?(?:\\?\\S+)?(?=['\"\\s])"
Example
Live Demo
Sample Text
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
<a class="test" href="http://blablablablabla.com">Third Link</a>
Matches
[0] => http://test.com/test/test
[1] => http://blablablablabla.com

try using this
\A'http:\/\/[\w\W]+'\z
this will check that your url must be starting from http:// and is an string in starting and ending and as in between the url nowadys anything can come so we will have to allow special character like ?:,-_/\ and also the numbers digits etc.
so this will get you all the urls present in the file.

Related

CSS validation with AntiSamy

I have a String, and I want to validate whether it is a valid CSS value or not. In the documentation of AntiSamy, I found that I might be able to use CSSValidator.isValidProperty (http://javadox.com/org.owasp/antisamy/1.4/org/owasp/validator/css/CssValidator) to do so. However, the type of the second param requires LexicalUnit.
Is there another way to validate a String with AnitSamy?

I think what you want is the CssScanner.
/****** pull out style tag from html *****/
Pattern p = Pattern.compile("<style>([\\s\\S]+?)</style>");
Matcher m = p.matcher(validHTML);
// if we find a match, get the group
if (m.find()) {
// get the matching group
codeGroup = m.group(1);
}
/****** block for checking all css for validity *****/
InternalPolicy policy = null;
try {
policy = (InternalPolicy) InternalPolicy.getInstance("antisamy-ebay.xml");
} catch (PolicyException e) {
e.printStackTrace();
}
ResourceBundle messages = ResourceBundle.getBundle("AntiSamy", Locale.getDefault());
CssScanner scanner = new CssScanner(policy, messages);
CleanResults results = scanner.scanStyleSheet(codeGroup, Integer.MAX_VALUE);
validCSS = results.getCleanHTML().toString();
That is the part of the code that worked for me. Let me know if any of this does not work for you, I have variables declared at the top of the code because I am also handling html validation in here too. So some variables are not in this code. But it should point you in the right direction. Also, you need a policy in place, I chose the ebay policy, this guides the whitelist of what the css will allow for the resulting output. I have not used the CssValidator, so I am not sure how they compare, but CssScanner does a great job of giving back clean css.

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.

Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Jsoup get hidden email

I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page
If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none .
This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);

You need to do several steps because Jsoup doesn't allow you to execute JavaScript.
I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}

If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
Reverse engineering - figure out what does server side script do, and try to implement same behaviour
Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing).Here you have basic example showing how to evaluate javascript in java.

Java Matcher Class

I need a pattern matcher to get the page id value in the below text which is coming from a http response body.
<meta name="ajs-page-id" content="262250">
What i'm after is to get the content value from this line that will always be generated in responsebody.

Pattern pat = Pattern.compile("<meta\\sname=\"ajs-page-id\"\\scontent=\"(\\d+)\">");
That is obviously a very literal pattern... but group(1) should return the number as a string.
Haven't tested.

Use an HTML parser like jsoup to parse and search for the part. You should not be using regular expressions for this.
e.g.,
String htmlStr = "<meta name=\"ajs-page-id\" content=\"262250\">";
Document doc = Jsoup.parse(htmlStr);
Element meta = doc.select("meta[name=ajs-page-id]").first();
if (meta != null)
{
System.out.println(meta.attr("content"));
}

Java : replacing text URL with clickable HTML link

I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !

Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression

public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.

For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)

Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda

Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")

In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get all HTTP url from a webpage - java

Related

CSS validation with AntiSamy

How can I adjust this regex to filter out "

Jsoup get hidden email

Java Matcher Class

Java : replacing text URL with clickable HTML link

Categories

Resources