Grabbing text from websites - java

i have this small chunk of code that will grab the html code from a website. Im interested in parsing a certain section of the code though, several times. More specifically, im making a pokedex, and would like to parse certain descriptions from say a bulbapedia page, http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon) for example. How would I make this parser take just the description of bulbasaur? How would I create any boundary to stop and start?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class WebCrawler{
public static void main(String[] args) {
try {
URL google = new URL("http://pokemondb.net/pokedex/bulbasaur");
URLConnection yc = google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}

You can use Jsoup, with this code you can get the description of Bulbasaur:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws IOException {
Document doc = Jsoup
.connect(
"http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)")
.get();
Elements newsHeadlines = doc.select("#mw-content-text p");
for (Object o : newsHeadlines) {
System.out.println(o.toString());
}
}
}
Where mw-content is the main div.

Try with Jsoup
Syntax is JQuery selectors liked.

Related

Hоw to convert website to .txt file for finding in this file some word?

Hоw to convert website to .txt file for finding in this file some word (ex. "Абрамов Николай Викторович")? My code read only html. In other words I want to re-check website every second. If my word appears ( by the author of the website), then my code print "Yes".
And how can I make a computer application to test any other word?
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class web {
public static void main(String[] args) {
for (;;) {
try {
// Create a URL for the desired page
URL url = new URL("http://abit.itmo.ru/page/195");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str = null;
while (in.readLine() != null) {
str = in.readLine().toString();
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
Pattern p = Pattern.compile("Абрамов Николай Викторович");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println("Yes");
System.exit(0);
}
} catch (IOException ignored) {
}
}
}
}
You don't need to convert it to TXT.
If you want just to search for the word you can check it directly . But be careful it can appears as DDOS attack if the period is too short and you may be blocked
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class Main {
public static String wordToFind = "30 Day";
public static String siteURL = "https://stackoverflow.com/";
public static void checkSite()
{
try {
URL google = new URL(siteURL);
BufferedReader in = new BufferedReader(new InputStreamReader(google.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) { // Process each line.
if (inputLine.contains( wordToFind)) // System.out.println(inputLine);
{
System.out.println( "Yes" );
return;
}
}
in.close();
} catch (MalformedURLException me) {
System.out.println(me);
} catch (IOException ioe) {
System.out.println(ioe);
}
}
public static void main(String[] args) {
Integer initalDelay = 0;
Integer period = 10; //number of seconds to repeat
ScheduledExecutorService exec = Executors.newSingleThreadScheduledExecutor();
exec.scheduleAtFixedRate(new Runnable() {
#Override
public void run() {
checkSite();
// do stuff
}
}, initalDelay, period, TimeUnit.SECONDS);
}
}

JSON sorting large data sets

http://openlibrary.org/search.json?q=prolog
I have the above API which i am going to be implementing in a android application.
Is there a way to grab from the json on the fly. a specific field for instance:
if i search the above i would need the Author, Language, suggested_title and ISBN. for each result. (so in the above case there would be 100 results.)
which i the plan on storing in a array in the format of
Title|author|lang|ISBN
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class main {
public static void main(String[] args) throws IOException
{
URL testing = new URL("http://openlibrary.org/search.json?q=prolog");
BufferedReader in = new BufferedReader(
new InputStreamReader(testing.openStream()));
String inputLine;
String test = null;
int i =1;
while ((inputLine = in.readLine()) != null)
{
if (inputLine == "\"title_suggest\":")
{
test = inputLine;
}
}
in.close();
System.out.println(test);
}
}
appears to work
as i could then add test to array location [x][y]

My HTML fetcher program in java returns incomplete results

My java code is:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class celebGrepper {
static class CelebData {
URL link;
String name;
CelebData(URL link, String name) {
this.link=link;
this.name=name;
}
}
public static String grepper(String url) {
URL source;
String data = null;
try {
source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
/**
* Attempting to fetch an entire line at a time instead of just a character each time!
*/
StringBuilder str = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while((data = br.readLine()) != null)
str.append(data);
data=str.toString();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
ArrayList<CelebData> list = new ArrayList<CelebData>();
Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
Matcher m = p.matcher(html);
while(m.find()) {
CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
list.add(current);
}
return list;
}
public static void main(String... args) throws MalformedURLException {
String html = grepper("https://www.forbes.com/celebrities/list/");
System.out.println("RAW Input: "+html);
System.out.println("Start Grepping...");
ArrayList<CelebData> celebList = parser(html);
for(CelebData item: celebList) {
System.out.println("Name:\t\t "+item.name);
System.out.println("Image URL:\t "+item.link+"\n");
}
System.out.println("Grepping Done!");
}
}
It's supposed to fetch the entire HTML content of https://www.forbes.com/celebrities/list/. However, when I compare the actual result below to the original page, I find the entire table that I need is missing! Is it because the page isn't completely loaded when I start getting the bytes from the page via the input stream? Please help me understand.
The Output of the page:
https://jsfiddle.net/e0771aLz/
What can I do to just extract the Image link and the names of the celebs?
I know it's an extremely bad practice to try to parse HTML using regex and is the stuff of nightmares, but on a certain video training course for android, that's exactly what the guy did, and I just wanna follow along since it's just in this one lesson.

How to download and read a file from a web site without having web browser?

If a user don't have any web browser, which java code he should write (and which classes he needs) to download and read file? Lets say that this is the URL where the file will be downloaded:
http://www.thewebsource.serv/dir1/myfile.txt
So far I have tried to access a url, but in order to download a file what procedure I should follow.
package filedownload;
import java.awt.Desktop;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
public class FileDownload {
public static void main(String[] args) throws URISyntaxException, IOException {
Desktop d=Desktop.getDesktop();
d.browse(new URI("http://www.thewebsource.serv/dir1/myfile.txt"));
}
}
You could use something like this using the URL class:
import java.net.*;
import java.io.*;
public class URLReader {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
From java doc tutorials: Link

display console result on java GUI for each class (java newbie programmer)

guys..i has a simple question.
is it possible to display each class console result on java GUI ?
Each class has different console results..
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class SimpleWebCrawler {
public static void main(String[] args) throws IOException {
try {
URL my_url = new URL("http://theworldaccordingtothisgirl.blogspot.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(
my_url.openStream()));
String strTemp = "";
while (null != (strTemp = br.readLine())) {
System.out.println(strTemp);
}
} catch (Exception ex) {
ex.printStackTrace();
}
System.out.println("\n");
System.out.println("\n");
System.out.println("\n");
Validate.isTrue(args.length == 0, "usage: supply url to crawl");
String url = "http://theworldaccordingtothisgirl.blogspot.com/";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
System.out.println("\n");
BufferedWriter bw = new BufferedWriter(new FileWriter("abc.txt"));
for (Element link : links) {
print(" %s ", link.attr("abs:href"), trim(link.text(), 35));
bw.write(link.attr("abs:href"));
bw.write(System.getProperty("line.separator"));
}
bw.flush();
bw.close();
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width - 1) + ".";
else
return s;
}
}
Example output :
Fetching http://theworldaccordingtothisgirl.blogspot.com/...
http://theworldaccordingtothisgirl.blogspot.com/2011/03/in-time-like-this.html
https://lh5.googleusercontent.com/-yz2ql0o45Aw/TYBNhyFVpMI/AAAAAAAAAGU/OrPZrBjwWi8/s1600/Malaysian-Newspaper-Apologises-For-Tsunami-Cartoon.jpg
http://ireport.cnn.com/docs/DOC-571892
https://lh3.googleusercontent.com/-nXOxDT4ZyWA/TX-HaKoHE3I/AAAAAAAAAGQ/xwXJ-8hNt1M/s1600/ScrnShotsDesktop-1213678160_large.png
http://theworldaccordingtothisgirl.blogspot.com/2011/03/in-time-like-this.html#comments
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=email
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=blog
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=twitter
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=facebook
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=buzz
If you want to separate the standard output (System.out.println) based on which class it comes from, the answer is, no, (not easily at least).
I suggest you let each class that wants to do output get a PrintWriter as argument to the constructor, then use that PrintWriter instead of the System.out print writer.

Categories