java / jsoup - retrieve language - java

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!

Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

Related

How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup?

I am using jsoup to parse some polish sites, but I have problem with special characters like "ą", "ś" in URL(!), for example example.com/kąt is readed like example.com/k
every query without this special characters works perfectly
I have tried Document doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url) but it does not work.
any other tips?
You want to encode your URL before passing it to Jsoup.
SAMPLE CODE
String url = "http://sjp.pl/maść";
System.out.println("BEFORE " + url);
String encodedURL = URI.create(url).toASCIIString();
System.out.println("AFTER " + encodedURL);
System.out.println("Title: " + Jsoup.connect(encodedURL).get().title());
OUTPUT
BEFORE http://sjp.pl/maść
AFTER http://sjp.pl/ma%C5%9B%C4%87
Title: maść - Słownik SJP
French locale
Jsoup 1.8.3

Simple way templating multiline strings in java code

I'm often run in to the following situation: I have long multiline strings where properties must be injected - e.g. something like templating. But I don't want to inlcude a complete templating engine (like velocity or freemarker) in my projects.
How can this be done in a simple way:
String title = "Princess";
String name = "Luna";
String community = "Stackoverflow";
String text =
"Dear " + title + " " + name + "!\n" +
"This is a question to " + community + "-Community\n" +
"for simple approach how to code with Java multiline Strings?\n" +
"Like this one.\n" +
"But it must be simple approach without using of Template-Engine-Frameworks!\n" +
"\n" +
"Thx for ...";
You can create your own small & simply template engine with few lines of code:
public static void main(String[] args) throws IOException {
String title = "Princes";
String name = "Luna";
String community = "Stackoverflow";
InputStream stream = DemoMailCreater.class.getResourceAsStream("demo.mail");
byte[] buffer = new byte[stream.available()];
stream.read(buffer);
String text = new String(buffer);
text = text.replaceAll("§TITLE§", title);
text = text.replaceAll("§NAME§", name);
text = text.replaceAll("§COMMUNITY§", community);
System.out.println(text);
}
and small text file e.g. in the same folder (package) demo.mail:
Dear §TITLE§ §NAME§!
This is a question to §COMMUNITY§-Community
for simple approach how to code with Java multiline Strings?
Like this one.
But it must be simple approach without using of Template-Engine-Frameworks!
Thx for ...
With Java 15+:
String title = "Princess";
String name = "Luna";
String community = "Stackoverflow";
String text = """
Dear %s %s!
This is a question to %s-Community
for simple approach how to code with Java multiline Strings?
""".formatted(title, name, community);
One basic way of doing it would be to use String.format(...)
Example:
String title = "Princess";
String name = "Celestia";
String community = "Stackoverflow";
String text = String.format(
"Dear %s %s!%n" +
"This is a question to %s-Community%n" +
"for simple approach how to code with Java multiline Strings?%n" +
"Like this one.%n" +
"But it must be simple approach without using of Template-Engine-Frameworks!%n" +
"%n" +
"Thx for ...", title, name, community);
More info
You can use Java Resources in order to achieve it HERE
Or you can keep the current method you're using with different approach like HERE
You can use String#format():
String title = "Princess";
String name = "Luna";
String community = "Stackoverflow";
String text = String.format("Dear %s %s!\n" +
"This is a question to %s-Community\n" +
"for simple approach how to code with Java multiline Strings?\n" +
"Like this one.\n" +
"But it must be simple approach without using of Template-Engine-Frameworks!\n" +
"\n" +
"Thx for ...", title, name, community);
Java has no built-in support for templating. Your choices are:
use an existing templating framework / engine,
build your own templating framework / engine (or similar), or
write a lot of "string bashing" code ... like in your question.
You may be able to write the above code a bit more concisely using String.format(...), MessageFormat and similar, but they don't get you very far ... unless your templating is very simple.
By contrast, some languages have built-in support for string interpolation, "here" documents, or a concise structure building syntax that can be adapted to templating.
You can use java.text.MessageFormat for this:
String[] args = {"Princess", "Luna", "Stackoverflow"};
String text = MessageFormat.format("Bla bla, {1}, and {2} and {3}", args);

Jsoup get hidden email

I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page
If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none .
This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);
You need to do several steps because Jsoup doesn't allow you to execute JavaScript.
I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}
If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
Reverse engineering - figure out what does server side script do, and try to implement same behaviour
Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing).Here you have basic example showing how to evaluate javascript in java.

Passing parameters to YQL using Java

I am using the following code:
String zip = "75227";
String str = "http://query.yahooapis.com/v1/public/yql?q=select%20Title%2C%20Address%2C%20" +
"City%2C%20State%2C%20Phone%2C%20Distance%20from%20local.search%20where%20query%3D%22" +
"food%20pantries%22%20and%20zip%3D%22" + zip +"%22%20and%20(category%3D%2296927050%22%20or" +
"%20category%3D%2296934498%22)%20%7C%20sort(field%3D%22Distance%22)";
Document doc = Jsoup.connect(str).get();
and it is producing the results I want by replacing the zip code value. I would like to also change the location. I tried doing the same I did with the zip code by doing this:
String zip = "32207";
String service = "food pantry";
String testOne = "http://query.yahooapis.com/v1/public/yql?q=select%20Title%2C%20Address%2C%20" +
"City%2C%20State%2C%20Phone%2C%20Distance%20from%20local.search%20where%20query%3D%22" +
service + "%22%20and%20zip%3D%22" + zip +"%22%20and%20(category%3D%2296927050%22%20or" +
"%20category%3D%2296934498%22)%20%7C%20sort(field%3D%22Distance%22)";
When used this way the variable "service" gave me an error.
I initially tried to use the yql table like this:
String search = "http://query.yahooapis.com/v1/public/yql?q=";
String table = "select Title, Address, City, State, Phone, Distance from local.search where " +
"query=\"food pantries\" and zip=\"75227\" and (category=\"96927050\" or category=" +
"\"96934498\") | sort(field=\"Distance\")";
String searchText = search + table;
UPDATE:
Here is the error I am getting:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=505, URL=http://query.yahooapis.com/v1/public/yql?q=select%20Title%2C%20Address%2C%20City%2C%20State%2C%20Phone%2C%20Distance%20from%20local.search%20where%20query%3D%22food pantry%22%20and%20zip%3D%2232207%22%20and%20(category%3D%2296927050%22%20or%20category%3D%2296934498%22)%20%7C%20sort(field%3D%22Distance%22)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:418)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at org.jsoup.examples.HtmlToPlainText.main(HtmlToPlainText.java:86)
However, this did not work either. Any ideas on how I can do this search and provide the service and zip code as variables?
Have you tried replacing String service = "food pantry"; with String service = "food%20pantry"; ?
EDIT:
and it is "food pantry" or "food pantries"... ?

using hyperlinks in java file

I am trying to add a hyperlink in java file.
TestHyperlink.java
class TestHyperlink.java {
String url = "stackoverflow.com/questions/ask";
String someVariable = "testUrl";
Html entryLink = new Html("<a target=\"_blank\" href=url>someVariable</a>");
}
I am trying to use two string variables url and someVariable but I am not sure how to do it. My hyper link appears as 'someVariable' and on click leads to a broken page.
What I seek is a hyperlink which appears as testUrl and on click leads to a desired url page, stackoverflow.com/questions/ask in this case.
Thanks,
Sony
Java doesn't interpolate variables inside Strings. You need to change to new Html("<a target=\"_blank\" href=\"" + url + "\">" + someVariable + "</a>");

Categories