Jsoup get hidden email - java

I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page
If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none .
This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);

You need to do several steps because Jsoup doesn't allow you to execute JavaScript.
I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}

If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
Reverse engineering - figure out what does server side script do, and try to implement same behaviour
Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing).Here you have basic example showing how to evaluate javascript in java.

Related

GWT - extract text in between two characters

In GWT i have a servlet that returns an image from the database to the client. I need to extract out part of the string to properly show the image. What is returned in chrome, firefox, and IE has a slash in the src part. Ex: String s = "src=\""; Which is not visible in the string below. Maybe the slash is adding more parentheses around the http string. Im not sure?
what is returned in those 3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
EDGE browser doesn't have the slash in the src so my method to extract the image doesnt work in edge
What edge returns:
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
Problem: I need to extract the string below.
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
either with src= or src=\
What I tried and works with the browsers that return without the parentheses "src=\":
String s = "src=\"";
int index = returned.indexOf(s) + s.length();
image.setUrl(returned.substring(index, returned.indexOf("\"", index + 1)));
But fails to work in EDGE because it doesnt return a slash
I do not have access to Pattern, and matcher in GWT.
How can i extract and keep in mind the entityId number will change
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
out of what is returned string above?
EDIT:
I need a generic way to extract out http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
When the string might look like this both ways.
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
public static void main(String[] args) {
String toParse = "<img style=\"-webkit-user-select: none;\" src=\"http://localhost:8080/dashboardmanager/downloadfile?entityId=4886\">";
String delimiter = "src=\"";
int index = toParse.indexOf(delimiter) + delimiter.length();
System.out.println(toParse.substring(index, toParse.length()).split("\"")[0]);
}

java / jsoup - retrieve language

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!
Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!
Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}
I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b
If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.
Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Get all HTTP url from a webpage

I am creating a simple utility to retrieve all HTTP URL's from a webpage.
Initially I had planned to use a HTML parsing library to parse out the HREF tags but I got to know that I need to retrieve the URL contained inside the script too (Example script below) hence I started trying out regular expression to get all the HTTP url from the web page but for some reason my regular expression is not working properly.
The URL can be inside a javascript
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
My program:
try {
BufferedReader in=new BufferedReader(new FileReader("c:\\sample\\sample.html"));
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
String pattern = "http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(inputLine.replaceAll("http://", "\nhttp://"));
while (!m.hitEnd()) {
if (m.find()) {
System.out.println("Found value: " + m.group(0));
} else {
//System.out.println("NO MATCH");
}
}
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
Can someone help me fix this issue or let me know the best way to retrieve all URL's from a web page?
Description
Your expression has a typo. It should make the s optional.
https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?
^
Also I recommend:
replacing the (...) capture groups with non capture groups like (?:...)
you don't need to escape a . inside a character group [.]
add a test to ensure you're not captureing the close quotes surrounding your url
rewrite your section looking for /folder/subfolder sections as a repeating non-capture group looking for the initial slash followed by the folder name
regex: https?:\/\/(?:[\w-]+.)+(?::\d+)?(?:\/[\w\/_.]*)*?(?:\?\S+)?(?=['"\s])
as a Java string: "https?:\\/\\/(?:[\\w-]+.)+(?::\\d+)?(?:\\/[\\w\\/_.]*)*?(?:\\?\\S+)?(?=['\"\\s])"
Example
Live Demo
Sample Text
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
<a class="test" href="http://blablablablabla.com">Third Link</a>
Matches
[0] => http://test.com/test/test
[1] => http://blablablablabla.com
try using this
\A'http:\/\/[\w\W]+'\z
this will check that your url must be starting from http:// and is an string in starting and ending and as in between the url nowadys anything can come so we will have to allow special character like ?:,-_/\ and also the numbers digits etc.
so this will get you all the urls present in the file.

Java : replacing text URL with clickable HTML link

I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !
Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression
public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.
For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)
Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda
Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")
In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);

Categories