Extract HTML from URL

Extract HTML from URL - java

I'm using Boilerpipe to extract text from url, using this code:
URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);
the String text contains just the text of the html page, but I need to extract to whole html code from it.
Is there anyone who used this library and knows how to extract the HTML code?
You can check the demo page for more info on the library.

For something as simple as this you don't really need an external library:
URL url = new URL("http://www.google.com");
InputStream is = (InputStream) url.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
StringBuffer sb = new StringBuffer();
while((line = br.readLine()) != null){
sb.append(line);
}
String htmlContent = sb.toString();

Just use the KeepEverythingExtractor instead of the ArticleExtractor.
But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?

With Java 7 and a trick of Scanner, you can do the following:
public static String toHtmlString(URL url) throws IOException {
Objects.requireNonNull(url, "The url cannot be null.");
try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
sc.useDelimiter("\\A");
if (sc.hasNext()) {
return sc.next();
} else {
return null; // or empty
}
}
}

Related

Having trouble reading in content of url using InputStream

So I run the code below and it prints "!DOCTYPE html". How do I get the content of the url, like the html for instance?
public static void main(String[] args) throws IOException {
URL u = new URL("https://www.whitehouse.gov/");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader websiteText = new BufferedReader(isr);
System.out.println(websiteText.readLine());
}
According to java doc https://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html: "When you run the program, you should see, scrolling by in your command window, the HTML commands and textual content from the HTML file located at ".... Why am I not getting that?

In your program, your did not put while loop.
URL u = new URL("https://www.whitehouse.gov/");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader websiteText = new BufferedReader(isr);
String inputLine;
while ((inputLine = websiteText.readLine()) != null){
System.out.println(inputLine);
}
websiteText.close();

You are only reading one line of the text.
Try this and you will see that you get two lines:
System.out.println(websiteText.readLine());
System.out.println(websiteText.readLine());
Try reading it in a loop to get all the text.

BufferedReader has a method called #lines() since Java 8. The return type of #lines() is Stream. To read an entire site you could do something like that:
String htmlText = websiteText.lines()
.reduce("", (text, nextLine) -> text + "\n" + nextLine)
.orElse(null);

ImageIO.readImage IIOException while I can open it in Chrome

I can open this image in my browser but it won't load in my java application, why? It is supposed to be a free-to-use database, I can't see why I can't use it.
I'm using this piece of code:
public static String getContentsFromURL(String address){
String contents = "";
try{
URL url = new URL(address);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
while((line = bufferedReader.readLine()) != null){
contents += line;
}
bufferedReader.close();
}catch(IOException e){
e.printStackTrace();
}
return contents;
}
And I'm getting an IIOException "Can't find input file!"

try this code
URL url = new URL("http://ddragon.leagueoflegends.com/cdn/9.20.1/img/champion/Gragas.png");
Image image1 = ImageIO.read(url);
image screenshot from my debbuger.

how to download multiple webpages that use a 'next' button

I am trying to download the latest HTML code from this website, until recently the URL displayed all the information I needed. Recently the web designer changed the format so a portion of the data is displayed and the user must hit the 'next' button to display next portion of data.
The URL doesn't change though.
Anyone know how I can download all the information using JAVA??
Thanks. This is my current code:
[code]
URL url = null;
InputStream is = null;
BufferReader br;
String line;
try {
url = new URL("HTTP://...../..../...");
is = url.openStream();
br = new BufferedReader(new InputStreamReader(is));
while ( (line = br.readLine() ) != null)
System.out.println(line);
} catch(IOException e) {
}
....
[/code]

How to Decide InputStream Encoding?

My objective is to download an xml feed into an InputStream, then convert it to a String so that if may be used with XmlPullParser.
I convert the InputStream into a String like this:
InputStream input_stream = connection.getInputStream();
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(input_stream,"UTF-8"));
while ((line = br.readLine()) != null) {
sb.append(line);
}
Here's the problem, some XML feeds define specific encoding. Take this one for example:
http://voxinox.ch/podcasts/valdo/feed.xml
If I use a default of "UTF-8" encoding some characters from the feed look like a black rhombus shape with a question mark in it. If I use the encoding specified in the xml header it works (iso-8859-1), not a surprise.
The thing is how do I decide what encoding to use before I start reading the input stream which contains encoding specifications? Is there a better way of doing this?

Example how i get encoding from XML inputstream
FileInputStream finput = new FileInputStream(myFile);
String encoding = getInputEncoding(finput);
Log.d("Encoding: ", "> " + encoding);
public String getInputEncoding(FileInputStream finput){
String encoding = "";
if(finput!=null){
try{
BufferedReader myReader = new BufferedReader(new InputStreamReader(finput));
String getline = "";
getline = myReader.readLine();
myReader.close();
Log.d("Line: ", "> " + getline);
String[] separated = getline.split("encoding=\"");
String encoding1 = separated[1];
String[] separated2 = encoding1.split("\"");
encoding = separated2[0];
} catch (Exception e) {
}
}
return encoding;
}

Unable to parse img and name from amazon or flipkart pages using Jsoup

I am unable to get main image and name for products at Amazon or Flipkart using Jsoup.
My java/jsoup code for the same is:
// For amazon
Connection connection = Jsoup.connect(url).timeout(5000).maxBodySize(1024*1024*10);
Document doc = connection.get();
Elements imgs = doc.select("img#landingImage");
Elements names = doc.select("span#productTitle");
// For flipkart
Connection connection = Jsoup.connect(url).timeout(5000).maxBodySize(1024*1024*10);
Document doc = connection.get();
Elements imgs = doc.select("h1.title");
Elements names = doc.select("img.productImage.current");
Can someone please point out what am I missing here?
URLs I have used are:
http://www.flipkart.com/lenovo-yoga-2-tablet-android-10-inch/p/itmeyqkznqa2zjf5?pid=TABEYQKXWAXMSGER&srno=b_2&offer=ExchangeOffer_LenovoYoga.&ref=9ea008ab-ae95-4f52-8ef7-3ef1a54947ae
and
http://www.amazon.com/gp/product/B00LZGBU3Y/ref=s9_psimh_gw_p504_d0_i5?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=0ESK1KNE31TBRVC8115Q&pf_rd_t=36701&pf_rd_p=1970559082&pf_rd_i=desktop
Also, I would like to do this parsing on the front end if possible using javascript and jquery.
Is there a way to do the same?

Found out the issue.
Jsoup in GAE works when we use the URL fetch service using java.net.URL as:
private String read(String url) throws IOException
{
URL urlObj = new URL(url);
BufferedReader reader = new BufferedReader(new InputStreamReader(urlObj .openStream()));
String line;
StringBuffer sbuf = new StringBuffer();
while ((line = reader.readLine()) != null) {
if (line.trim().length() > 0)
sbuf.append(line).append("\n");
}
reader.close();
return sbuf.toString();
}
And then you use regular Jsoup as:
String html = read(url);
Document doc = Jsoup.parse(html);
Doing the above works very well.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract HTML from URL - java

Just use the KeepEverythingExtractor instead of the ArticleExtractor. But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?

Related

Having trouble reading in content of url using InputStream

ImageIO.readImage IIOException while I can open it in Chrome

how to download multiple webpages that use a 'next' button

How to Decide InputStream Encoding?

Unable to parse img and name from amazon or flipkart pages using Jsoup

Categories

Resources