Specific data mining using scanner - java

I'm trying to build a program that would take the page source from a website and only store a snippet of code.
package Program;
import java.net.*;
import java.util.*;
public class Program {
public static void main(String[] args) {
String site = "http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294";
try {
URL url = new URL(site);
URLConnection connection = url.openConnection();
connection.connect();
Scanner in = new Scanner(connection.getInputStream());
while (in.hasNextLine()) {
System.out.println(in.nextLine());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
So far this will only display the code in the output. I would like the program to search for a specific string and display only the price.
e.g.
<tr id="actualPriceRow">
<td id="actualPriceLabel" class="priceBlockLabelPrice">Price:</td>
<td id="actualPriceContent"><span id="actualPriceValue"><b class="priceLarge">£599.99</b></span>
<span id="actualPriceExtraMessaging">
search for class="priceLarge"> and only display/store 599.99
I know that there are similar questions on the website however I don't really understand any php and would like a java solution although any solution is welcome :)

You can use some library for parsing eg. Jsoup
Document document = Jsoup.connect("http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294").get();
then you can search for concrete element
Elements el = document.select("b.priceLarge");
and then you can get content of this element like
String content = el.val();

The OP wrote in a question edit:
Thank you all for responses it was really helpful and here is the answer:
package Project;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Project {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("url of link").get();
String title = doc.title();
System.out.println("title : " + title);
String pricing = doc.getElementsByClass("priceLarge").text();
String str = pricing;
str = str.substring(1);
System.out.println("price : " + str);
} catch (Exception e) {
System.out.println(e);
}
}
}

Related

how can i do web scraping in this case?

i am trying to scrap text from https://in-the-sky.org/data/object.php?id=A216&day=17&month=6&year=2022
so i wrote a code like
import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String args[]) {
int num = 216;
int day = 17;
int month = 6;
int year = 2022;
String url ="https://in-the-sky.org/data/object.php?id=A"+Integer.toString(num)+"&day="+Integer.toString(day)+"&month="+Integer.toString(month)+"&year="+Integer.toString(year);
System.out.println(url);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
System.out.println("=======================================================");
Elements element = doc.select("div.col-md-6 col-md-pull-6");
String output = element.select("p").text();
System.out.println(output);
System.out.println("=======================================================");
}
}
but it doesnt work well. i would like someone to help me please
I believe that you can use Elements element = doc.select("div.col-md-6 > p"); to get your desired output.

My HTML fetcher program in java returns incomplete results

My java code is:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class celebGrepper {
static class CelebData {
URL link;
String name;
CelebData(URL link, String name) {
this.link=link;
this.name=name;
}
}
public static String grepper(String url) {
URL source;
String data = null;
try {
source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
/**
* Attempting to fetch an entire line at a time instead of just a character each time!
*/
StringBuilder str = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while((data = br.readLine()) != null)
str.append(data);
data=str.toString();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
ArrayList<CelebData> list = new ArrayList<CelebData>();
Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
Matcher m = p.matcher(html);
while(m.find()) {
CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
list.add(current);
}
return list;
}
public static void main(String... args) throws MalformedURLException {
String html = grepper("https://www.forbes.com/celebrities/list/");
System.out.println("RAW Input: "+html);
System.out.println("Start Grepping...");
ArrayList<CelebData> celebList = parser(html);
for(CelebData item: celebList) {
System.out.println("Name:\t\t "+item.name);
System.out.println("Image URL:\t "+item.link+"\n");
}
System.out.println("Grepping Done!");
}
}
It's supposed to fetch the entire HTML content of https://www.forbes.com/celebrities/list/. However, when I compare the actual result below to the original page, I find the entire table that I need is missing! Is it because the page isn't completely loaded when I start getting the bytes from the page via the input stream? Please help me understand.
The Output of the page:
https://jsfiddle.net/e0771aLz/
What can I do to just extract the Image link and the names of the celebs?
I know it's an extremely bad practice to try to parse HTML using regex and is the stuff of nightmares, but on a certain video training course for android, that's exactly what the guy did, and I just wanna follow along since it's just in this one lesson.

Using Jsoup to extract single value from page source

I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);
Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}
can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]
public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}

Protocol get Java URL

I'm trying to get a JSON format of all the websites found when querying google.
Code:
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
/**
* Created by Vlad on 19/03/14.
*/
public class Query {
public static void main(String[] args){
try{
String arg;
arg = "random";
URL url = new URL("GET https://www.googleapis.com/customsearch/v1?key=&cx=017576662512468239146:omuauf_lfve&q=" + arg);
InputStreamReader reader = new InputStreamReader(url.openStream(),"UTF-8");
int ch;
while((ch = reader.read()) != -1){
System.out.print(ch);
}
}catch(Exception e)
{
System.out.println("This ain't good");
System.out.println(e);
}
}
}
Exception:
java.net.MalformedURLException: no protocol: GET https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random
You should delete the GET at the beginning ;)
You should replace your code by :
URL url = new URL("https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=" + arg);
Url never start by GET or POSTor anything like that ;)
Urls are supposed to start with a protocol for transfer and GET https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random is starting with GET, that is why the exception is received.
Change it to https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random

Grabbing text from websites

i have this small chunk of code that will grab the html code from a website. Im interested in parsing a certain section of the code though, several times. More specifically, im making a pokedex, and would like to parse certain descriptions from say a bulbapedia page, http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon) for example. How would I make this parser take just the description of bulbasaur? How would I create any boundary to stop and start?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class WebCrawler{
public static void main(String[] args) {
try {
URL google = new URL("http://pokemondb.net/pokedex/bulbasaur");
URLConnection yc = google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
You can use Jsoup, with this code you can get the description of Bulbasaur:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws IOException {
Document doc = Jsoup
.connect(
"http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)")
.get();
Elements newsHeadlines = doc.select("#mw-content-text p");
for (Object o : newsHeadlines) {
System.out.println(o.toString());
}
}
}
Where mw-content is the main div.
Try with Jsoup
Syntax is JQuery selectors liked.

Categories