Using Jsoup to extract single value from page source - java

I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);

Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}

can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]

public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}

Related

Java PDFBox list all named destinations of a page

For my Java project I need to list all named destinations of a PDF page.
The PDF and its named destination are created with LaTeX (using the hypertarget command), e.g. as follows:
\documentclass[12pt]{article}
\usepackage{hyperref}
\begin{document}
\hypertarget{myImportantString}{} % the anchor/named destination to be extracted "myImportantString"
Empty example page
\end{document}
How do I extract all named destinations of a specific page of this PDF document with the PDFBox library version 2.0.11?
I could not find any working code for this problem in the internet or the PDFBox examples. This is my current (minified) code:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import java.io.File;
import java.util.List;
public class ExtractNamedDests {
public static void main(String[] args) {
try {
int c = 1;
PDDocument document = PDDocument.load(new File("<path to PDF file>"));
for (PDPage page : document.getPages()) {
System.out.println("Page " + c + ":");
// named destinations seem to be no type of annotations since the list is always empty:
List<PDAnnotation> annotations = page.getAnnotations();
System.out.println(" Count annotations: " + annotations.size());
// How to extract named destinations??
}
}catch(Exception e){
e.printStackTrace();
}
}
}
In this example I want to extract the String "myImportantString" from the page in Java.
EDIT: Here is the example PDF file. I use PDFBox version 2.0.11.
I found a solution with the great help of Tilman Hausherr. It uses the code he suggested in his comments.
The method getAllNamedDestinations() returns a map of all named destinations in the document (not annotations) with name and destination. Named destinations can be deeply nested in the document. Therefore, the method traverseKids() recursively finds all nested named destinations.
public static Map<String, PDPageDestination> getAllNamedDestinations(PDDocument document){
Map<String, PDPageDestination> namedDestinations = new HashMap<>(10);
// get catalog
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDDocumentNameDictionary names = documentCatalog.getNames();
if(names == null)
return namedDestinations;
PDDestinationNameTreeNode dests = names.getDests();
try {
if (dests.getNames() != null)
namedDestinations.putAll(dests.getNames());
} catch (Exception e){ e.printStackTrace(); }
List<PDNameTreeNode<PDPageDestination>> kids = dests.getKids();
traverseKids(kids, namedDestinations);
return namedDestinations;
}
private static void traverseKids(List<PDNameTreeNode<PDPageDestination>> kids, Map<String, PDPageDestination> namedDestinations){
if(kids == null)
return;
try {
for(PDNameTreeNode<PDPageDestination> kid : kids){
if(kid.getNames() != null){
try {
namedDestinations.putAll(kid.getNames());
} catch (Exception e){ System.out.println("INFO: Duplicate named destinations in document."); e.printStackTrace(); }
}
if (kid.getKids() != null)
traverseKids(kid.getKids(), namedDestinations);
}
} catch (Exception e){
e.printStackTrace();
}
}

Java extracting every HTML content in a String list

I m actually making my first RSS reader with JAVA android and I have a problem.
In fact, I get some RSS informations, but there are HTML tags all around.
What I need is to extract every HTML content in these tags and put them in a string list, but I dont know how to do that.
Can you help me with this ?
Thanks for advance
If your rss is xml format, you will need dom4j.jar
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
public class test {
public static void main(String[] args) throws Exception {
String rssUrl = ""; // paste url here
List<RssDocument> docList = new ArrayList<RssDocument>();
try
{
SAXReader saxReader = new SAXReader();
Document document = saxReader.read(rssUrl);
Element channel = (Element) document.getRootElement().element("channel");
for (Iterator i = channel.elementIterator("item"); i.hasNext();)
{
Element element = (Element) i.next();
String title = element.elementText("title");
String pubDate = element.elementText("pubDate");
String description = element.elementText("description");
RssDocument doc = new RssDocument(title, pubDate, description);
docList.add(doc);
}
}
catch (Exception e)
{
e.printStackTrace();
}
// do something with docList
}
public static class RssDocument {
String title;
String pubDate;
String description;
RssDocument(String title, String pubDate, String description) {
this.title = title;
this.pubDate = pubDate;
this.description = description;
}
}
}
Paste your rss url into variable "rssUrl", and run this main. You will get a list of RSS document, which contains title, published date and description.
If what you need is only the title and description of every rss item, use the following codes.
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
public class test {
public static void main(String[] args) throws Exception {
String rssUrl = ""; // paste url here
List<String> strList = new ArrayList<String>();
try
{
SAXReader saxReader = new SAXReader();
Document document = saxReader.read(rssUrl);
Element channel = (Element) document.getRootElement().element("channel");
for (Iterator i = channel.elementIterator("item"); i.hasNext();)
{
Element element = (Element) i.next();
String title = element.elementText("title").replaceAll("\\<.*?>","");
String description = element.elementText("description").replaceAll("\\<.*?>","");
strList.add(title + " " + description);
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
Then strList will be the list of string, which contains title and description.
For example:
{
"title1 description1"
"title2 description2"
"title3 description3"
}
Assume you have a html content called htmlString, you can clean that with regular expressions.
String htmlString = "<tr><td>12345</td></tr>";
String noHTMLString = htmlString.replaceAll("\\<.*?>","");
This should extract a list of all contents between html tags into the list called matches. You should modify the regex in brackets to match your content. The current version only matches text containing digits, letters, dots, commas, brackets, minuses and spaces.
Pattern pattern = Pattern.compile("<\\w+>([\\w\\s\\.,\\-\\(\\)]+)</\\w+>");
Matcher matcher = pattern.matcher(content);
List<String> matches = new ArrayList<String>();
while(matcher.find()){
matches.add(matcher.group(1));
}

Trying jsoup: getting "cannot find symbol" when I just declared it the line before

I need to find a value in an html table, so right now I'm trying to get the hang of JSoup.
I'm trying to use implement this code: http://jsoup.org/cookbook/extracting-data/dom-navigation
The first two lines I've implemented, but the third one (Element content = doc.getElementById("content");) causes an "error: cannot find symbol. symbol: variable doc. location: class testparse).
Here is my code:
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
public class testparse {
public static void main(String[] args){
try
{
File input = new File("abc.htm");
Document doc = Jsoup.parse(input, "UTF-8", "");
}
catch(IOException exc){
System.out.println(exc);
}
Element content = doc.getElementById("content");
}
}
All help is greatly appreciated!
The scope of the variables is within the try block:
try
{
File input = new File("abc.htm");
Document doc = Jsoup.parse(input, "UTF-8", "");
}
To fix, declare the variables outside:
File input = null;
Document doc = null;
try
{
input = new File("abc.htm");
doc = Jsoup.parse(input, "UTF-8", "");
}

Specific data mining using scanner

I'm trying to build a program that would take the page source from a website and only store a snippet of code.
package Program;
import java.net.*;
import java.util.*;
public class Program {
public static void main(String[] args) {
String site = "http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294";
try {
URL url = new URL(site);
URLConnection connection = url.openConnection();
connection.connect();
Scanner in = new Scanner(connection.getInputStream());
while (in.hasNextLine()) {
System.out.println(in.nextLine());
}
} catch (Exception e) {
System.out.println(e);
}
}
}
So far this will only display the code in the output. I would like the program to search for a specific string and display only the price.
e.g.
<tr id="actualPriceRow">
<td id="actualPriceLabel" class="priceBlockLabelPrice">Price:</td>
<td id="actualPriceContent"><span id="actualPriceValue"><b class="priceLarge">£599.99</b></span>
<span id="actualPriceExtraMessaging">
search for class="priceLarge"> and only display/store 599.99
I know that there are similar questions on the website however I don't really understand any php and would like a java solution although any solution is welcome :)
You can use some library for parsing eg. Jsoup
Document document = Jsoup.connect("http://www.amazon.co.uk/gp/product/B00BE4OUBG/ref=s9_ri_gw_g63_ir01?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-5&pf_rd_r=0GJRXWMKNC5559M5W2GB&pf_rd_t=101&pf_rd_p=394918607&pf_rd_i=468294").get();
then you can search for concrete element
Elements el = document.select("b.priceLarge");
and then you can get content of this element like
String content = el.val();
The OP wrote in a question edit:
Thank you all for responses it was really helpful and here is the answer:
package Project;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Project {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("url of link").get();
String title = doc.title();
System.out.println("title : " + title);
String pricing = doc.getElementsByClass("priceLarge").text();
String str = pricing;
str = str.substring(1);
System.out.println("price : " + str);
} catch (Exception e) {
System.out.println(e);
}
}
}

How can I parse a HTML string in Java?

Given the string "<table><tr><td>Hello World!</td></tr></table>", what is the (easiest) way to get a DOM Element representing it?
If you have a string which contains HTML you can use Jsoup library like this to get HTML elements:
String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
Document doc = Jsoup.parse(htmlTable);
// then use something like this to get your element:
Elements tds = doc.getElementsByTag("td");
// tds will contain this one element: <td>Hello World!</td>
Good luck!
Here's a way:
import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(reader, new HTMLTableParser(), true);
reader.close();
}
}
class HTMLTableParser extends HTMLEditorKit.ParserCallback {
private boolean encounteredATableRow = false;
public void handleText(char[] data, int pos) {
if(encounteredATableRow) System.out.println(new String(data));
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = true;
}
public void handleEndTag(HTML.Tag t, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = false;
}
}
you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion.
It is an open source tool and can be found on SourceForge
You could use Swing:
How do you make use of the
HTML-processing capabilities that are
built into Java? You may not know that
Swing contains all the classes
necessary to parse HTML. Jeff Heaton
shows you how.
I've used Jericho HTML Parser it's OSS, detects(forgives) badly formatted tags and is lightweight
I found this somewhere (don't remember where):
public static DocumentFragment parseXml(Document doc, String fragment)
{
// Wrap the fragment in an arbitrary element.
fragment = "<fragment>"+fragment+"</fragment>";
try
{
// Create a DOM builder and parse the fragment.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document d = factory.newDocumentBuilder().parse(
new InputSource(new StringReader(fragment)));
// Import the nodes of the new document into doc so that they
// will be compatible with doc.
Node node = doc.importNode(d.getDocumentElement(), true);
// Create the document fragment node to hold the new nodes.
DocumentFragment docfrag = doc.createDocumentFragment();
// Move the nodes into the fragment.
while (node.hasChildNodes())
{
docfrag.appendChild(node.removeChild(node.getFirstChild()));
}
// Return the fragment.
return docfrag;
}
catch (SAXException e)
{
// A parsing error occurred; the XML input is not valid.
}
catch (ParserConfigurationException e)
{
}
catch (IOException e)
{
}
return null;
}
One can use some of the javax.swing.text.html utility classes for parsing HTML.
import java.io.IOException;
import java.io.StringReader;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
//...
try {
String htmlString = "<html><head><title>Example Title</title></head><body>Some text...</body></html>";
HTMLEditorKit htmlEditKit = new HTMLEditorKit();
HTMLDocument htmlDocument = (HTMLDocument) htmlEditKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(new StringReader(htmlString),
htmlDocument.getReader(0), true);
// Use HTMLDocument here
System.out.println(htmlDocument.getProperty("title")); // Example Title
} catch(IOException e){
//Handle
e.printStackTrace();
}
See:
HTMLDocument
HTMLEditorKit

Categories