extract the main part of a page in java - java

Hello
I have a page of a personality in wikipedia and I want to extract with java source a code HTML from the main part is that.
Do you have any ideas?

Use Jsoup, specifically the selector syntax.
Document doc = Jsoup.parse(new URL("http://en.wikipedia.org/", 10000);
Elements interestingParts = doc.select("div.interestingClass");
//get the combined HTML fragments as a String
String selectedHtmlAsString = interestingParts.html();
//get all the links
Elements links = interestingParts.select("a[href]");
//filter the document to include certain tags only
Whitelist allowedTags = Whitelist.simpleText().addTags("blockquote","code", "p");
Cleaner cleaner = new Cleaner(allowedTags);
Document filteredDoc = cleaner.clean(doc);
It's a very useful API for parsing HTML pages and extracting the desired data.

For wikipedia there is API: http://www.mediawiki.org/wiki/API:Main_page

Analyze web page's structure
Use JSoup to parse HTML

Note that this returns a STRING (blob of a sort) of the HTML source code, not a nicely formatted content item.
I use this myself - a little snippet I have for whatever i need. Pass in the url, any start and stop text, or the boolean to get everything.
public static String getPage(
String url,
String booleanStart,
String booleanStop,
boolean getAll) throws Exception {
StringBuilder page = new StringBuilder();
URL iso3 = new URL(url);
URLConnection iso3conn = iso3.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
iso3conn.getInputStream()));
String inputLine;
if (getAll) {
while ((inputLine = in.readLine()) != null) {
page.append(inputLine);
}
} else {
boolean save = false;
while ((inputLine = in.readLine()) != null) {
if (inputLine.contains(booleanStart))
save = true;
if (save)
page.append(inputLine);
if (save && inputLine.contains(booleanStop)) {
break;
}
}
}
in.close();
return page.toString();
}

Related

how can i get spesific words from an url in java

How can i get spesific words from an url in java. Like i want to take datas from class which calling like blablabla.
Here is my code.
URL url = new URL("https://www.doviz.com/");
URLConnection connect = url.openConnection();
InputStream is = connect.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
while((line = br.readLine()) != null)
{
System.out.println(line);
}
Take a look at Jsoup , this will allow you to get the content of a web page and NOT the HTML code. Let's say it will play the role of the browser, it will parse the HTML tags into a human readable text.
Once you will get the content of your page in a String, you can count the occurrences of your word using any algorithm of occurrences count.
Simple example to use it:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
/* ........ */
String URL = "https://www.doviz.com/";
Document doc = Jsoup.connect(URL).get();
String text = doc.body().text();
System.out.println(text);
EDIT
If you don't want to use a parser (as you mentioned in the comment that you don't want external libraries), you will get the whole HTML code of the page, that's how you can do it
try {
URL url = new URL("https://www.doviz.com/");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
str = in.readLine().toString();
System.out.println(str);
/*str will get each time the new line, if you want to store the whole text in str
you can use concatenation (str+ = in.readLine().toString())*/
}
in.close();
} catch (Exception e) {}

How do I create a Regular Expression to find valid URL's in a webpage?

I'm writing a program that caches every webpage it can find. It works by caching a website into a file, and then looks for all of the valid URLs in that file. Then, it scans all of the valid URLs recursively. The problem is, I can't find a Regex or way to find the valid URLs. So far, this is my code:
public static void findAllPages(String baseURL) throws Exception {
URL url = new URL(baseURL);
BufferedReader bf = new BufferedReader(new InputStreamReader(url.openStream()));
String cnt = ""; //HTML content read from URL
String ln; //Line
while((ln = bf.readLine()) != null) { //Read content
cnt += (ln + "\n");
}
int count = 0;
ArrayList<String> val = findUrlsInString(baseURL)
count = val.size();
for(int i = 0;i < count;i++) { //Find content of links on page
try {
findAllPages(val.get(i));
}catch(Exception e) {
//Invalid URL
}
}
}
public static void findUrlsInString(String url) {
//Need to filter out URLs here and put them in an ArrayList
}
Note: There is no reading/writing files in the code above
You should use some html parser instead of regexp. One example of such parser is jsoup

Get Prices from html page

Hello i want to understand how to get a value specific all the prices from here example http://www.ebay.com/sch/i.html?_sacat=0&_nkw=iphone+5&_frs=1
and return the values so i can lets say add them into database along with the product name.
String weburl = "http://www.ebay.com/sch/i.html?_sacat=0&_nkw=iphone+5&_frs=1";
URL oracle = new URL(weburl);
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String line;
while ((line = in.readLine()) != null)
{
if (line.contains("EUR</b>"))
{
String command = line.split("EUR</b>").toString();
final String value = command.substring(8);
final StringTokenizer s = new StringTokenizer(value, " ");
final String DurationString = s.nextToken();
System.out.println("Timh: " + DurationString);
}
}
in.close();
this does not work for me until now.
How should i change it ?
You can use jSoup for this.
Note that eBay offers various APIs that will suit for your purposes.
https://go.developer.ebay.com/

How can I get specific text from a webpage

I've looked for answers to this question on stackoverflow and google, couldn't really find what I was looking for.
When I want to retrieve data from a page, like this one, with this code
public class ConsoleSearch {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.stackoverflow.com");
URLConnection cnt = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader
(cnt.getInputStream()));
String content;
while((content = br.readLine()) != null){
System.out.println(content);
}
br.close();
}
}
I obviously get the HTML tags, and everything else that comes with it.
I can easily filter HTML using HtmlCleaner
The challenging part and where I find my self stuck is when I want to retrieve specific text from all the retrieved data.
For example, if I wanted to only retrieve text "Nova Scotia" and/or "Europe"... how would I do that?
Pattern p = Pattern.compile("Nova Scotia");
Matcher m = p.matcher(content);
boolean b = m.matches();
Just look into the above regex package and it will be helpful to you.

How to get HTML links from a URL

I'm just starting out on my Networking Assignment and I'm already stuck.
Assignment asks me to check the user provided website for links and to determine if they are active or inactive by reading the header info.
So far after googling, I just have this code which retrieves the website. I don't get how to go over this information and look for HTML links.
Here's the code:
import java.net.*;
import java.io.*;
public class url_checker {
public static void main(String[] args) throws Exception {
URL yahoo = new URL("http://yahoo.com");
URLConnection yc = yahoo.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
yc.getInputStream()));
String inputLine;
int count = 0;
while ((inputLine = in.readLine()) != null) {
System.out.println (inputLine);
}
in.close();
}
}
Please help.
Thanks!
You can also try jsoup html retriever and parser.
Document doc = Jsoup.parse(new URL("<url>"), 2000);
Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
String href = link.attr("href");
System.out.println("title: " + link.text());
System.out.println("href: " + href);
}
With this code you can list and analyze all elements inside a div with class "post-title" from the url .
You can try this:
URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);
Then Create a class called Page
class Page extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
String link = null;
Enumeration<?> attributeNames = a.getAttributeNames();
if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
link = a.getAttribute(HTML.Attribute.HREF).toString();
//save link some where
}
}
}
I don't get how to go over this information and look for HTML links
I cannot use any external library on my Assignment
You have a couple of options:
1) You can read the web page into an HTMLDocument. Then you can get an iterator from the Document to find all the HTML.Tag.A tags. Once you find the attrbute tags you can get the HTML.Attribute.HREF from the attribute set of the attribute tag.
2) You can extend HTMLEditor.ParserCallback and implement the handleStartTag(...) method. Then whenever you find an A tag, you can get the href attribute which will again contain the link. The basic code for invoking the parser callback is:
MyParserCallback parser = new MyParserCallback();
// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);
// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
HtmlParser is what you need here. A lot of things can be done with it.
You need to get the HTTP status code that the server returned with the response. A server will return a 404 if the page does not exist.
Check out this:
http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html
most specifically the getResponseCode method.
I would parse the HTML with a tool like NekoHTML. It basically fixes malformed HTML for you and allows to access it like XML. Then you can process the link elements and try to follow them like you did for the original page.
You can check out some sample code that does this.

Categories