How can I get specific text from a webpage - java

I've looked for answers to this question on stackoverflow and google, couldn't really find what I was looking for.
When I want to retrieve data from a page, like this one, with this code
public class ConsoleSearch {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.stackoverflow.com");
URLConnection cnt = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader
(cnt.getInputStream()));
String content;
while((content = br.readLine()) != null){
System.out.println(content);
}
br.close();
}
}
I obviously get the HTML tags, and everything else that comes with it.
I can easily filter HTML using HtmlCleaner
The challenging part and where I find my self stuck is when I want to retrieve specific text from all the retrieved data.
For example, if I wanted to only retrieve text "Nova Scotia" and/or "Europe"... how would I do that?

Pattern p = Pattern.compile("Nova Scotia");
Matcher m = p.matcher(content);
boolean b = m.matches();
Just look into the above regex package and it will be helpful to you.

Related

Trying to get a string from a website which only has one line

Basically I'm trying to get a string from an API,and this API is just a blank page with one line which has all the information needed, and I'm only trying to get a part of that.
This part is an ID which for every person - has the same amount of characters.
The API has for each person this line:
{"id":"anExampleUniqueWhichHas32Charact","name": "Player"}
I kinda changed the code so you'll understand, because I'm using a library dedicated for that, but I'm just trying to get the web scraping right.
So what I tried to do was Web Scrape and get the string.length of that amount.
But it doesn't work.
I know I can also use Regex for patterns, but I don't really know how to use that. Regex would honestly be more helpful in this situation.
public void checkAPI() throws IOException {
String person = userInput.nextLine(); // It's just any name.
URL url = new URL("https://api.mojang.com/users/profiles/minecraft/" +
person);
URLConnection con = url.openConnection();
InputStream isr =con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(isr));
String line;
while ((line = br.readLine()) != null) {
if (line.length() == 32) {
System.out.println(line);
}
}
}
I currently just expect the line to print, later when it'll work I'll use it for other stuff.
No errors are being thrown.
The API uses Json. https://de.wikipedia.org/wiki/JavaScript_Object_Notation
You can use a standard json parser like jackson https://en.wikipedia.org/wiki/Jackson_(API) to parse and query the result.
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(new URL("https://api.mojang.com/users/profiles/minecraft/KrisJelbring"));
System.out.println("Name: "+node.get("name"));
System.out.println("Id: "+node.get("id"));
but if you don't like to use jackson you can do it by hand: but that's nonsense and not very stable:
while ((line = br.readLine()) != null)
{
int startOfId = line.indexOf("\"id\"") + 4;
int startOfValue = line.indexOf("\"", startOfId) + 1;
int endOfValue = line.indexOf("\"", startOfValue);
System.out.println("id: " + line.substring(startOfValue, endOfValue));
}

How can I get the data from website in Java?

I want to get the value of "Yield" in "http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319"
How can I do this with java?
I have tried "Jsoup" and my code like these:
public static void main(String[] args) throws IOException {
String url = "http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319";
Document document = Jsoup.connect(url).get();
Elements answerers = document.select(".c3 .floatR ");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.data());
}
// TODO code application logic here
}
But it return empty. How can I do this?
Your code is fine. I tested it myself. The problem is the URL you're using. If I open the url in a browser, the value fields (e.g. Yield) are empty. Using the browser development tools (Network tab) you should get an URL that looks like:
http://www.aastocks.com/en/ltp/RTQuoteContent.aspx?symbol=01319&process=y
Using this URL gives you the wanted results.
The simplest solution is to create a URL instance pointing to the web page / link you want get the content using streams-
for example-
public static void main(String[] args) throws IOException
{
URL url = new URL("http://www.aastocks.com/en/ltp/rtquote.aspx?symbol=01319");
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
// Once you have the Input Stream, it's just plain old Java IO stuff.
// For this case, since you are interested in getting plain-text web page
// I'll use a reader and output the text content to System.out.
// For binary content, it's better to directly read the bytes from stream and write
// to the target file.
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
// read each line and write to System.out
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
I think Jsoup is critical in this purpose. I would not suspect a valid HTML document (or whatever).

parsing a text file using a java scanner

I am trying to create a method that parses a text file and returns a string that is the url after the colon. The text file looks as follow (it is for a bot):
keyword:url
keyword,keyword:url
so each line consists of a keyword and a url, or multiple keywords and a url.
could anyone give me a bit of direction as to how to do this? Thank you.
I believe I need to use a scanner but couldn't find anything on anyone wanting to do anything similar to me.
Thank you.
edit: my attempt using suggestions below. doesn't quite work. Any help would be appreciated.
public static void main(String[] args) throws IOException {
String sCurrentLine = "";
String key = "hello";
BufferedReader reader = new BufferedReader(
new FileReader(("sites.txt")));
Scanner s = new Scanner(sCurrentLine);
while ((sCurrentLine = reader.readLine()) != null) {
System.out.println(sCurrentLine);
if(sCurrentLine.contains(key)){
System.out.println(s.findInLine("http"));
}
}
}
output:
hello,there:http://www.facebook.com
null
whats,up:http:/google.com
sites.txt:
hello,there:http://www.facebook.com
whats,up:http:/google.com
You should read the file line by line with a BufferedReader as you are doing, I would the recommend parsing the file using regex.
The pattern
(?<=:)http://[^\\s]++
Will do the trick, this pattern says:
http://
followed by any number of non-space characters (more than one) [^\\s]++
and preceded by a colon (?<=:)
Here is a simple example using a String to proxy your file:
public static void main(String[] args) throws Exception {
final String file = "hello,there:http://www.facebook.com\n"
+ "whats,up:http://google.com";
final Pattern pattern = Pattern.compile("(?<=:)http://[^\\s]++");
final Matcher m = pattern.matcher("");
try (final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(file.getBytes("UTF-8"))))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
m.reset(line);
while (m.find()) {
System.out.println(m.group());
}
}
}
}
Output:
http://www.facebook.com
http://google.com
Use BufferedReader, for text parsing you can use regular expresions.
You should use the split method:
String strCollection[] = yourScannedStr.Split(":", 2);
String extractedUrl = strCollection[1];
Reading a .txt file using Scanner class in Java
http://www.tutorialspoint.com/java/java_string_substring.htm
That should help you.

extract the main part of a page in java

Hello
I have a page of a personality in wikipedia and I want to extract with java source a code HTML from the main part is that.
Do you have any ideas?
Use Jsoup, specifically the selector syntax.
Document doc = Jsoup.parse(new URL("http://en.wikipedia.org/", 10000);
Elements interestingParts = doc.select("div.interestingClass");
//get the combined HTML fragments as a String
String selectedHtmlAsString = interestingParts.html();
//get all the links
Elements links = interestingParts.select("a[href]");
//filter the document to include certain tags only
Whitelist allowedTags = Whitelist.simpleText().addTags("blockquote","code", "p");
Cleaner cleaner = new Cleaner(allowedTags);
Document filteredDoc = cleaner.clean(doc);
It's a very useful API for parsing HTML pages and extracting the desired data.
For wikipedia there is API: http://www.mediawiki.org/wiki/API:Main_page
Analyze web page's structure
Use JSoup to parse HTML
Note that this returns a STRING (blob of a sort) of the HTML source code, not a nicely formatted content item.
I use this myself - a little snippet I have for whatever i need. Pass in the url, any start and stop text, or the boolean to get everything.
public static String getPage(
String url,
String booleanStart,
String booleanStop,
boolean getAll) throws Exception {
StringBuilder page = new StringBuilder();
URL iso3 = new URL(url);
URLConnection iso3conn = iso3.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
iso3conn.getInputStream()));
String inputLine;
if (getAll) {
while ((inputLine = in.readLine()) != null) {
page.append(inputLine);
}
} else {
boolean save = false;
while ((inputLine = in.readLine()) != null) {
if (inputLine.contains(booleanStart))
save = true;
if (save)
page.append(inputLine);
if (save && inputLine.contains(booleanStop)) {
break;
}
}
}
in.close();
return page.toString();
}

How to get HTML links from a URL

I'm just starting out on my Networking Assignment and I'm already stuck.
Assignment asks me to check the user provided website for links and to determine if they are active or inactive by reading the header info.
So far after googling, I just have this code which retrieves the website. I don't get how to go over this information and look for HTML links.
Here's the code:
import java.net.*;
import java.io.*;
public class url_checker {
public static void main(String[] args) throws Exception {
URL yahoo = new URL("http://yahoo.com");
URLConnection yc = yahoo.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
yc.getInputStream()));
String inputLine;
int count = 0;
while ((inputLine = in.readLine()) != null) {
System.out.println (inputLine);
}
in.close();
}
}
Please help.
Thanks!
You can also try jsoup html retriever and parser.
Document doc = Jsoup.parse(new URL("<url>"), 2000);
Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
String href = link.attr("href");
System.out.println("title: " + link.text());
System.out.println("href: " + href);
}
With this code you can list and analyze all elements inside a div with class "post-title" from the url .
You can try this:
URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);
Then Create a class called Page
class Page extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
String link = null;
Enumeration<?> attributeNames = a.getAttributeNames();
if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
link = a.getAttribute(HTML.Attribute.HREF).toString();
//save link some where
}
}
}
I don't get how to go over this information and look for HTML links
I cannot use any external library on my Assignment
You have a couple of options:
1) You can read the web page into an HTMLDocument. Then you can get an iterator from the Document to find all the HTML.Tag.A tags. Once you find the attrbute tags you can get the HTML.Attribute.HREF from the attribute set of the attribute tag.
2) You can extend HTMLEditor.ParserCallback and implement the handleStartTag(...) method. Then whenever you find an A tag, you can get the href attribute which will again contain the link. The basic code for invoking the parser callback is:
MyParserCallback parser = new MyParserCallback();
// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);
// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
HtmlParser is what you need here. A lot of things can be done with it.
You need to get the HTTP status code that the server returned with the response. A server will return a 404 if the page does not exist.
Check out this:
http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html
most specifically the getResponseCode method.
I would parse the HTML with a tool like NekoHTML. It basically fixes malformed HTML for you and allows to access it like XML. Then you can process the link elements and try to follow them like you did for the original page.
You can check out some sample code that does this.

Categories