Specific data mining using scanner

Specific data mining using scanner - java

I'm trying to build a program that would take the page source from a website and only store a snippet of code.
package Program;
import java.net.*;
import java.util.*;
public class Program {
public static void main(String[] args) {
String site = "http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294";
try {
URL url = new URL(site);
URLConnection connection = url.openConnection();
connection.connect();
Scanner in = new Scanner(connection.getInputStream());
while (in.hasNextLine()) {
System.out.println(in.nextLine());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
So far this will only display the code in the output. I would like the program to search for a specific string and display only the price.
e.g.
<tr id="actualPriceRow">
<td id="actualPriceLabel" class="priceBlockLabelPrice">Price:</td>
<td id="actualPriceContent"><span id="actualPriceValue"><b class="priceLarge">£599.99</b></span>
<span id="actualPriceExtraMessaging">
search for class="priceLarge"> and only display/store 599.99
I know that there are similar questions on the website however I don't really understand any php and would like a java solution although any solution is welcome :)

You can use some library for parsing eg. Jsoup
Document document = Jsoup.connect("http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294").get();
then you can search for concrete element
Elements el = document.select("b.priceLarge");
and then you can get content of this element like
String content = el.val();

The OP wrote in a question edit:
Thank you all for responses it was really helpful and here is the answer:
package Project;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Project {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("url of link").get();
String title = doc.title();
System.out.println("title : " + title);
String pricing = doc.getElementsByClass("priceLarge").text();
String str = pricing;
str = str.substring(1);
System.out.println("price : " + str);
} catch (Exception e) {
System.out.println(e);
}
}
}

Related

how can i do web scraping in this case?

i am trying to scrap text from https://in-the-sky.org/data/object.php?id=A216&day=17&month=6&year=2022
so i wrote a code like
import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String args[]) {
int num = 216;
int day = 17;
int month = 6;
int year = 2022;
String url ="https://in-the-sky.org/data/object.php?id=A"+Integer.toString(num)+"&day="+Integer.toString(day)+"&month="+Integer.toString(month)+"&year="+Integer.toString(year);
System.out.println(url);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
System.out.println("=======================================================");
Elements element = doc.select("div.col-md-6 col-md-pull-6");
String output = element.select("p").text();
System.out.println(output);
System.out.println("=======================================================");
}
}
but it doesnt work well. i would like someone to help me please

I believe that you can use Elements element = doc.select("div.col-md-6 > p"); to get your desired output.

My HTML fetcher program in java returns incomplete results

My java code is:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class celebGrepper {
static class CelebData {
URL link;
String name;
CelebData(URL link, String name) {
this.link=link;
this.name=name;
}
}
public static String grepper(String url) {
URL source;
String data = null;
try {
source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
/**
* Attempting to fetch an entire line at a time instead of just a character each time!
*/
StringBuilder str = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while((data = br.readLine()) != null)
str.append(data);
data=str.toString();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
ArrayList<CelebData> list = new ArrayList<CelebData>();
Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
Matcher m = p.matcher(html);
while(m.find()) {
CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
list.add(current);
}
return list;
}
public static void main(String... args) throws MalformedURLException {
String html = grepper("https://www.forbes.com/celebrities/list/");
System.out.println("RAW Input: "+html);
System.out.println("Start Grepping...");
ArrayList<CelebData> celebList = parser(html);
for(CelebData item: celebList) {
System.out.println("Name:\t\t "+item.name);
System.out.println("Image URL:\t "+item.link+"\n");
}
System.out.println("Grepping Done!");
}
}
It's supposed to fetch the entire HTML content of https://www.forbes.com/celebrities/list/. However, when I compare the actual result below to the original page, I find the entire table that I need is missing! Is it because the page isn't completely loaded when I start getting the bytes from the page via the input stream? Please help me understand.
The Output of the page:
https://jsfiddle.net/e0771aLz/
What can I do to just extract the Image link and the names of the celebs?
I know it's an extremely bad practice to try to parse HTML using regex and is the stuff of nightmares, but on a certain video training course for android, that's exactly what the guy did, and I just wanna follow along since it's just in this one lesson.

Using Jsoup to extract single value from page source

I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);

Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}

can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]

public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}

Protocol get Java URL

I'm trying to get a JSON format of all the websites found when querying google.
Code:
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
/**
* Created by Vlad on 19/03/14.
*/
public class Query {
public static void main(String[] args){
try{
String arg;
arg = "random";
URL url = new URL("GET https://www.googleapis.com/customsearch/v1?key=&cx=017576662512468239146:omuauf_lfve&q=" + arg);
InputStreamReader reader = new InputStreamReader(url.openStream(),"UTF-8");
int ch;
while((ch = reader.read()) != -1){
System.out.print(ch);
}
}catch(Exception e)
{
System.out.println("This ain't good");
System.out.println(e);
}
}
}
Exception:
java.net.MalformedURLException: no protocol: GET https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random

You should delete the GET at the beginning ;)
You should replace your code by :
URL url = new URL("https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=" + arg);
Url never start by GET or POSTor anything like that ;)

Urls are supposed to start with a protocol for transfer and GET https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random is starting with GET, that is why the exception is received.
Change it to https://www.googleapis.com/customsearch/v1?key=AIzaSyCS26VtzuCs7bEpC821X_l0io_PHc4-8tY&cx=017576662512468239146:omuauf_lfve&q=random

Grabbing text from websites

i have this small chunk of code that will grab the html code from a website. Im interested in parsing a certain section of the code though, several times. More specifically, im making a pokedex, and would like to parse certain descriptions from say a bulbapedia page, http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon) for example. How would I make this parser take just the description of bulbasaur? How would I create any boundary to stop and start?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class WebCrawler{
public static void main(String[] args) {
try {
URL google = new URL("http://pokemondb.net/pokedex/bulbasaur");
URLConnection yc = google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}

You can use Jsoup, with this code you can get the description of Bulbasaur:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws IOException {
Document doc = Jsoup
.connect(
"http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)")
.get();
Elements newsHeadlines = doc.select("#mw-content-text p");
for (Object o : newsHeadlines) {
System.out.println(o.toString());
}
}
}
Where mw-content is the main div.

Try with Jsoup
Syntax is JQuery selectors liked.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Specific data mining using scanner - java

Related

how can i do web scraping in this case?

My HTML fetcher program in java returns incomplete results

Using Jsoup to extract single value from page source

Protocol get Java URL

Grabbing text from websites

Categories

Resources