How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup?

How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup? - java

I am using jsoup to parse some polish sites, but I have problem with special characters like "ą", "ś" in URL(!), for example example.com/kąt is readed like example.com/k
every query without this special characters works perfectly
I have tried Document doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url) but it does not work.
any other tips?

You want to encode your URL before passing it to Jsoup.
SAMPLE CODE
String url = "http://sjp.pl/maść";
System.out.println("BEFORE " + url);
String encodedURL = URI.create(url).toASCIIString();
System.out.println("AFTER " + encodedURL);
System.out.println("Title: " + Jsoup.connect(encodedURL).get().title());
OUTPUT
BEFORE http://sjp.pl/maść
AFTER http://sjp.pl/ma%C5%9B%C4%87
Title: maść - Słownik SJP
French locale
Jsoup 1.8.3

Related

How to keep link title attribute with jsoup?

Using Jsoup.clean(), jsoup turns the title attribute of a HTML link from:
TEST
into:
TEST
This is the demo application:
Whitelist whitelist = new Whitelist();
whitelist.addTags("a");
whitelist.addAttributes("a", "href", "title");
String input = "TEST";
System.out.println("input: " + input);
String output = Jsoup.clean(input, whitelist);
System.out.println("output: " + output);
which prints:
input: TEST
output: TEST
I tried to add OutputSettings with EscapeMode:
OutputSettings outputSettings = new OutputSettings();
outputSettings.escapeMode(EscapeMode.xhtml);
EscapeMode.base and EscapeMode.extend have no effect. EscapeMode.xhtml prints the following:
input: TEST
output: TEST
Any idea how jsoup does not manipulate the title tag?

This is a known issue/behavior: https://github.com/jhy/jsoup/issues/684 (marked as "won't fix" by the jsoup team).
There's not a bug here.
When serializing (i.e. in your example when you're printing out XML/HTML), we escape as few characters as necessary. That is why the > is not escaped to >; because it's in a quoted attribute, there's no ambiguity that it's closing a tag, so it doesn't get escaped.

Java Mail Embed URL

When I'm adding an HTML URL into email body, it is not redirecting to the preferred location. This is the snippet, please tell me what am I doing wrong.
#location variable contains the URL
StringBuffer body = new StringBuffer("<html><body>Hi, <br/><br/>");
body.append("<p>"+cmts+"</p>");
#both the ways are not working, how to construct proper URL
body.append("<br/><br/>" + location + "<br/>");
body.append("<br/><br/>" +location + "<br/>");
#this is working as link only in OUTLOOK, but in other mail client it shows as plain text
body.append("<br/><br/>"+location);
URL:
http://host:port/weebApp/report/viewer.html#%2Fpublic%2FSamples%2FDashboards%2_FSample_report

It looks like a problem with the quotation marks. Try:
body.append("<br/><br/>" + location + "<br/>");

There could be many ways to add href in javamail for example:
1) InternetHeaders headers = new InternetHeaders();
headers.addHeader("Content-type", "text/html; charset=UTF-8");
String aHref = "some text\n" + text +
"\n<a href='http://google.com'>google.com</a>";
2) String aHref = "some text\n" + text +
"\n<a href='http://google.com'>google.com</a>";
messageBodyPart.setText(aHref,"UTF-8","html");
UPDATE:
Make sure the content-type is set to html or text/html because text/plain will display it as only text

java / jsoup - retrieve language

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!

Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.

Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Jsoup get hidden email

I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page
If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none .
This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);

You need to do several steps because Jsoup doesn't allow you to execute JavaScript.
I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}

If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
Reverse engineering - figure out what does server side script do, and try to implement same behaviour
Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing).Here you have basic example showing how to evaluate javascript in java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup? - java

Related

How to keep link title attribute with jsoup?

Java Mail Embed URL

java / jsoup - retrieve language

How can I adjust this regex to filter out "

Jsoup get hidden email

Categories

Resources