How can I get the data from website in Java? - java

I want to get the value of "Yield" in "http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319"
How can I do this with java?
I have tried "Jsoup" and my code like these:
public static void main(String[] args) throws IOException {
String url = "http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319";
Document document = Jsoup.connect(url).get();
Elements answerers = document.select(".c3 .floatR ");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.data());
}
// TODO code application logic here
}
But it return empty. How can I do this?

Your code is fine. I tested it myself. The problem is the URL you're using. If I open the url in a browser, the value fields (e.g. Yield) are empty. Using the browser development tools (Network tab) you should get an URL that looks like:
http://www.aastocks.com/en/ltp/RTQuoteContent.aspx?symbol=01319&process=y
Using this URL gives you the wanted results.

The simplest solution is to create a URL instance pointing to the web page / link you want get the content using streams-
for example-
public static void main(String[] args) throws IOException
{
URL url = new URL("http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319");
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
// Once you have the Input Stream, it's just plain old Java IO stuff.
// For this case, since you are interested in getting plain-text web page
// I'll use a reader and output the text content to System.out.
// For binary content, it's better to directly read the bytes from stream and write
// to the target file.
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
// read each line and write to System.out
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}

I think Jsoup is critical in this purpose. I would not suspect a valid HTML document (or whatever).

Related

How to format webpage source code in java?

The below code i have helps me get the source code from the provided url without any errors. But what i am looking for is to format the source code i receive.
My manual task earlier was to go to this website http://www.freeformatter.com/html-formatter.html paste my source code and then format it by selecting 3 space per indent option. How do i get my java code to do the same formatting for me ?
The reason i want it formatted is because i have another script which reads it line by line and saves data which is required and ignores the rest.
private static String getUrlSource(String url) throws IOException {
URL x= new URL(url);
URLConnection yc = x.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder a = new StringBuilder();
while ((inputLine = in.readLine()) != null)
{ a.append(inputLine); a.append("\n");
}
in.close();
return a.toString();
}
public static void main(String[] args) {
// TODO Auto-generated method stub
System.out.println("Hello");
url="http://www.bctransit.com/regions/cfv/schedules/schedule.cfm?p=day.text&route=1%3A0&day=1&";
try {
String value= getUrlSource(url);
System.out.println(value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
If you are scraping a web page, I suggest using a real HTML parser instead. Your method is bound to fail sooner or later.
I would recommend having a look at jsoup. While I have never used it, I have had great results with its Python counterpart, Beautifulsoup.
Using a library such as jsoup will get you a nice object model to traverse instead of relying on string manipulation.
As a bonus, jsoup will actually format the HTML string for you, should you want that anyway.

How can I get specific text from a webpage

I've looked for answers to this question on stackoverflow and google, couldn't really find what I was looking for.
When I want to retrieve data from a page, like this one, with this code
public class ConsoleSearch {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.stackoverflow.com");
URLConnection cnt = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader
(cnt.getInputStream()));
String content;
while((content = br.readLine()) != null){
System.out.println(content);
}
br.close();
}
}
I obviously get the HTML tags, and everything else that comes with it.
I can easily filter HTML using HtmlCleaner
The challenging part and where I find my self stuck is when I want to retrieve specific text from all the retrieved data.
For example, if I wanted to only retrieve text "Nova Scotia" and/or "Europe"... how would I do that?
Pattern p = Pattern.compile("Nova Scotia");
Matcher m = p.matcher(content);
boolean b = m.matches();
Just look into the above regex package and it will be helpful to you.

Error while reading data from webpage in java?

I am using this code to read data from a webpage :
public class ReadLatex {
public static void main(String[] args) throws IOException {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl=1+2%20\frac{3}{4}";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
The webpage gives the image for a latex code in the URL.
I am getting this exception:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 400 for URL: http://chart.apis.google.com/chart?
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at ReadLatex.main(ReadLatex.java:11)
Can anyone tell why I am having this problem and what should be the solution for this?
Try escaping with something like org.apache.commons.lang.StringEscapeUtils
Your problem is that you are using a \ (backslash) in a string which in Java is a escape character. To get an actual \ you need to have two of them in your string. So:
Wanted text: part1\part2
you need to have
String theString = "part1\\part2";
So you actually want
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl=1+2%20\\frac{3}{4}";
Also, when you succeed with your request you get back an image (png) which should not be read with a reader which will try to interpret the bytes as characters using some encoding and this will break the image data. Instead, use the input stream and write the content (bytes) to a file.
A simple example without error handling
public static void main(String[] args) throws IOException {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl=1+2%20\\frac{3}{4}";
URL url = new URL(urltext);
InputStream in = url.openStream();
FileOutputStream out = new FileOutputStream("TheImage.png");
byte[] buffer = new byte[8*1024];
int readSize;
while ( (readSize = in.read(buffer)) != -1) {
out.write(buffer, 0, readSize);
}
out.close();
in.close();
}
I think you should consider escaping the backslash in the URL. I Java, the backslash must be escaped in a String
It should become
String urltext =
"http://chart.apis.google.com/chart?cht=tx&chl=1+2%20\\frac{3}{4}";
This was for the pure java start.
It seems that this url works with my browser but, as suggested in the other answers, I think it should be better to also escape all the special characters such as backslashes, laces...

Get web page content to String is very slow

I did the download a web page with the HttpURLConnection.getInputStream() and to get the content to a String, i do the following method:
String content="";
isr = new InputStreamReader(pageContent);
br = new BufferedReader(isr);
try {
do {
line = br.readLine();
content += line;
} while (line != null);
return content;
} catch (Exception e) {
System.out.println("Error: " + e);
return null;
}
The download of the page is fast, but the processing to get the content to String is very slow. There is another way faster to get the content to a String?
I transform it to String to insert in the database.
Read into buffer by number of bytes, not something arbitrary like lines. That alone should be a good start to speeding this up, as the reader will not have to find the line end.
Use a StringBuffer instead.
Edit for an example:
StringBuffer buffer=new StringBuffer();
for(int i=0;i<20;++i)
buffer.append(i.toString());
String result=buffer.toString();
use the blob/clob to put the content directly into database.
any specific reason for buliding string line by line and put it in the database??
I'm using jsoup to get specified content of a page and here is a web demo based on jquery and jsoup to catch any content of a web page, you should specify the ID or Class for the page content you need to catch: http://www.gbin1.com/technology/democenter/20120720jsoupjquerysnatchpage/index.html

How to get HTML links from a URL

I'm just starting out on my Networking Assignment and I'm already stuck.
Assignment asks me to check the user provided website for links and to determine if they are active or inactive by reading the header info.
So far after googling, I just have this code which retrieves the website. I don't get how to go over this information and look for HTML links.
Here's the code:
import java.net.*;
import java.io.*;
public class url_checker {
public static void main(String[] args) throws Exception {
URL yahoo = new URL("http://yahoo.com");
URLConnection yc = yahoo.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
yc.getInputStream()));
String inputLine;
int count = 0;
while ((inputLine = in.readLine()) != null) {
System.out.println (inputLine);
}
in.close();
}
}
Please help.
Thanks!
You can also try jsoup html retriever and parser.
Document doc = Jsoup.parse(new URL("<url>"), 2000);
Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
String href = link.attr("href");
System.out.println("title: " + link.text());
System.out.println("href: " + href);
}
With this code you can list and analyze all elements inside a div with class "post-title" from the url .
You can try this:
URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);
Then Create a class called Page
class Page extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
String link = null;
Enumeration<?> attributeNames = a.getAttributeNames();
if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
link = a.getAttribute(HTML.Attribute.HREF).toString();
//save link some where
}
}
}
I don't get how to go over this information and look for HTML links
I cannot use any external library on my Assignment
You have a couple of options:
1) You can read the web page into an HTMLDocument. Then you can get an iterator from the Document to find all the HTML.Tag.A tags. Once you find the attrbute tags you can get the HTML.Attribute.HREF from the attribute set of the attribute tag.
2) You can extend HTMLEditor.ParserCallback and implement the handleStartTag(...) method. Then whenever you find an A tag, you can get the href attribute which will again contain the link. The basic code for invoking the parser callback is:
MyParserCallback parser = new MyParserCallback();
// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);
// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
HtmlParser is what you need here. A lot of things can be done with it.
You need to get the HTTP status code that the server returned with the response. A server will return a 404 if the page does not exist.
Check out this:
http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html
most specifically the getResponseCode method.
I would parse the HTML with a tool like NekoHTML. It basically fixes malformed HTML for you and allows to access it like XML. Then you can process the link elements and try to follow them like you did for the original page.
You can check out some sample code that does this.

Categories