Check whether web page exist on the web or not - java

I am currently using Jsoup to parse HTML document and I use the following command to get the document first
Document doc = Jsoup.connect(url).post();
if URL is not a real or existing URL, then error message will appear. So, is there any way to check that and print error message
thanks,
Zhua

It will throw an exception?
Probablly so. Why not put it in a try, and check for that kind of exception?
sth like
try{
Document doc = Jsoup.connect(url).post();
// it gets here, when it works
} catch(InvalidUrlException e){
//bla bla bla...
}
it seems easy to do so.

Related

How do you use Jsoup to search for a string on a website?

I am trying to use Jsoup to search a website to see if it contains a string. Is this even possible, and if so, how is it done?
Yes it is possible and actually quite easy if you are using Jsoup. To simply see if a specific Web-Page contains a specific string then you might do something like the following example:
Let say we want to see if the following string exists within the Jsoup home web-page (https://jsoup.org/):
If you have any questions on how to use jsoup
Your code could look something like this:
String stringToFind = "If you have any questions on how to use jsoup";
try {
Document doc = Jsoup.connect("https://jsoup.org/").get();
if (doc.text().contains(stringToFind)) {
System.out.println("Yes...String exists in web-page.");
}
else {
System.out.println("No...String does not exist in web-page.");
}
}
catch (IOException ex) {
// Do whatever you like to handle the exception...
}

Unable to fetch the text from the div, Although i getting that content in response

Unable to fetch the text from the div... Using below code, Not sure whats the issue
try {
connection = new URL("https://en-gb.facebook.com/Aeysunnna/videos/265359950651320/")
.openConnection();
Scanner scanner = new Scanner(connection.getInputStream(),"UTF-8");
scanner.useDelimiter("\\A");
content = scanner.next();
} catch (Exception ex) {
ex.printStackTrace();
System.out.println("error");
}
Document doc = Jsoup.parse(content);
Thread.sleep(10000);
Elements elements = doc.select("div._1t6k");
System.out.println(elements.text());
It is probably because your Java browser is not logged in to Facebook. If you're not logged in you will see nothing except the login page. Thus you can not read the content of the div, you are currently at the starting page.
Try printing the whole content you're getting and check whether it is actually the starting page.
You find an in-depth explanation here: Finding a word in a web page using java
As shown there you will probably not find an easy fix for this issue.
Also note that scraping Facebook without using their own API violates their Terms of Service.

jsoup to w3c-document: INVALID_CHARACTER_ERR

My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?
Try setting the outputSettings to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.
Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...

Using Jsoup to extract text from specific class

i'm trying to work with a little android at the moment.
Before all the hate - Yes I have tried searching and found answers related to mine but I simply couldnt get mine to work with the way they did it :/.
I have found out that Jsoup is rather good for parsing data from HTML to use in an app.
So I have trying to recieve data from here Krak
So when I enter the input for a number lets say "86202710"
The link will be Number link
I have then tried to extract the name of the owner of the given number which is "Jens Fisker Automobiler A/S". But I cant seem to get this text out into my textview.
Hope you guys can guide me alittle...
I get an exception "NetworkOnMainThreadException" - AndroidBlockGuardPolicy.onNetwork"
Here is the code I have written to the method for extracting the owner of the number
public void getData() throws IOException{
URL url = new URL("http://mobil.krak.dk/h/#companyResult&searchWord=86202710");
Document doc = Jsoup.parse(url, 3000);
Element content = doc.select("p[header bold]").first();
text = (TextView) findViewById(R.id.tv);
text.setText(content.text());
}
You have to run your networkcode in a AsyncTask on Android. Everything else will fail.
See here: Error when parsing Html using Jsoup
Do you connect through a proxy?
Please check if you have the settings like posted here: Android NetworkOnMainThreadException inside of AsyncTask
Btw. better use connect() method than parse() for such connections:
public void getData() throws IOException{
// You can use a simple string as url
final String url = "http://mobil.krak.dk/h/#companyResult&searchWord=86202710";
// Connect to url and parse it's content
Document doc = Jsoup.connect(url).get(); // Timeout is set to 3 Sec. per default
// Everything else stays same
Element content = doc.select("p[header bold]").first();
text = (TextView) findViewById(R.id.tv);
text.setText(content.text());
}

why can't I load this URL with JDOM? Browser spoofing?

I'm writing some code to load and parse HTML docs from the web.
I'm using JDOM like so:
SAXBuilder parser = new SAXBuilder();
Document document = (Document)parser.build("http://www.google.com");
Element rootNode = document.getRootElement();
/* and so on ...*/
It works fine like that. However, when I change the URL to some other web sites, like "http://www.kijiji.com", for example, the parser.build(...) line hangs.
Any idea why it hangs? I'm wondernig if it might be because kijiji knows I'm not a "real" web browser -- perhaps I have to spoof my http request so it looks like it's coming from IE or something like that?
Any ideas are useful, thanks!
Rob
I think a few things may be going on here. The firdt issue is that you cannot parse regular HTML with JDOM, HTML is not XML....
Secondly, when I run kijiji.com through JDOM I get an immediate HTTP_400 response
When I parse google.com I get an immediate XML error about well-formedness.
If you happen to be parsing xhtml at some point though, you will likely run in to this problem here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
XHTML has a doctype that references other doctypes, etc. Thes each take 30 seconds to load from w3c.org....

Categories