jsoup select() method not found - java

I want to make a web parser in java. I'm using jsoup. But I got an error like this
how to fix it? is it because of my classpath?
also this are my imports

Use below code.
Document doc = Jsoup.connect(I).get();
Elements links = doc.select("div#content > p").first();

You have to get the Document from Connection.
Connection con = Jsoup.connect(url);
Document doc = con.get();

Related

Simplest example - how to parse html with saxon using java?

Could anyone say, why the following code does not give any results?
Of course html is valid and has a lot of "div" elements.
Processor proc = new Processor(false);
proc.setConfigurationProperty("http://saxon.sf.net/feature/sourceParserClass", "org.ccil.cowan.tagsoup.Parser");
XPathCompiler xpath = proc.newXPathCompiler();
DocumentBuilder builder = proc.newDocumentBuilder();
XdmNode doc = builder.build(new File("/tmp/test.html"));
XPathSelector selector = xpath.compile("//div").load();
selector.setContextItem(doc);
for (XdmItem item : selector)
{
System.out.println(((XdmNode)item).getNodeName());
}
I took that code from saxon samples and added "proc.setConfigurationProperty..." in order to parse html input.
All i want is:
1) submit html string
2) get document node
3) make some queries with xpath v3
Thank you.
P.s. I don't want to use xslt.
Changing "//div" to "//*[name()="div"]" solved the problem.

Plugin for xml api response

I am getting html response from server i want xml response from my webserver so Which plugin to use to get XML response from server. Can any one help me regarding this it would be very helpful.
Keep in mind that having an xml from a webserver, file or whatever is the same for the Java XPath API.
So, you have to apply this following :
//Partie connexion
URL url = new URL(addressToConnect);
HttpURLConnection connexion = (HttpURLConnection) url.openConnection();
connexion.connect();
InputSource geocoderResultInputSource;
geocoderResultInputSource = new InputSource(connexion.getInputStream());
//Partie XML/XPATH
Document geocoderResultDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(geocoderResultInputSource);
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeListCodeResult = (NodeList) xpath.evaluate("//status", geocoderResultDocument, XPathConstants.NODESET);
etc...
More examples here, from a project that I'm making
Hope it's helps ;)

Working with Xpath in android

In the past ive used xpath to find the value of specific nodes that came from an xml ducemnt from a URL. Now i want to use this same code but from an xml document that is stored locally on the android phone at say sdcard/images/xml/newxml.xml
Here is the old code that i would like to be able to implement to use this, i just cannot figure out how to use the local xml file instead of a URL.
URL url = new URL("UrlWentHere");
InputSource xml = new InputSource(url.openStream());
XPath xpath = XPathFactory.newInstance().newXPath();
datafromxml = xpath.evaluate("//forecast_conditions[1]/high/#data", xml);
I don't quite understand the question. Why not just URL url = new URL("sdcard/images/xml/newxml.xml"); - or does the problem have to do with the app's restricted access to the file system?

Check whether web page exist on the web or not

I am currently using Jsoup to parse HTML document and I use the following command to get the document first
Document doc = Jsoup.connect(url).post();
if URL is not a real or existing URL, then error message will appear. So, is there any way to check that and print error message
thanks,
Zhua
It will throw an exception?
Probablly so. Why not put it in a try, and check for that kind of exception?
sth like
try{
Document doc = Jsoup.connect(url).post();
// it gets here, when it works
} catch(InvalidUrlException e){
//bla bla bla...
}
it seems easy to do so.

Some help scraping a page in Java

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.

Categories