How can I parse a HTML string in Java?

How can I parse a HTML string in Java? - java

Given the string "<table><tr><td>Hello World!</td></tr></table>", what is the (easiest) way to get a DOM Element representing it?

If you have a string which contains HTML you can use Jsoup library like this to get HTML elements:
String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
Document doc = Jsoup.parse(htmlTable);
// then use something like this to get your element:
Elements tds = doc.getElementsByTag("td");
// tds will contain this one element: <td>Hello World!</td>
Good luck!

Here's a way:
import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(reader, new HTMLTableParser(), true);
reader.close();
}
}
class HTMLTableParser extends HTMLEditorKit.ParserCallback {
private boolean encounteredATableRow = false;
public void handleText(char[] data, int pos) {
if(encounteredATableRow) System.out.println(new String(data));
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = true;
}
public void handleEndTag(HTML.Tag t, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = false;
}
}

you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion.
It is an open source tool and can be found on SourceForge

You could use Swing:
How do you make use of the
HTML-processing capabilities that are
built into Java? You may not know that
Swing contains all the classes
necessary to parse HTML. Jeff Heaton
shows you how.

I've used Jericho HTML Parser it's OSS, detects(forgives) badly formatted tags and is lightweight

I found this somewhere (don't remember where):
public static DocumentFragment parseXml(Document doc, String fragment)
{
// Wrap the fragment in an arbitrary element.
fragment = "<fragment>"+fragment+"</fragment>";
try
{
// Create a DOM builder and parse the fragment.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document d = factory.newDocumentBuilder().parse(
new InputSource(new StringReader(fragment)));
// Import the nodes of the new document into doc so that they
// will be compatible with doc.
Node node = doc.importNode(d.getDocumentElement(), true);
// Create the document fragment node to hold the new nodes.
DocumentFragment docfrag = doc.createDocumentFragment();
// Move the nodes into the fragment.
while (node.hasChildNodes())
{
docfrag.appendChild(node.removeChild(node.getFirstChild()));
}
// Return the fragment.
return docfrag;
}
catch (SAXException e)
{
// A parsing error occurred; the XML input is not valid.
}
catch (ParserConfigurationException e)
{
}
catch (IOException e)
{
}
return null;
}

One can use some of the javax.swing.text.html utility classes for parsing HTML.
import java.io.IOException;
import java.io.StringReader;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
//...
try {
String htmlString = "<html><head><title>Example Title</title></head><body>Some text...</body></html>";
HTMLEditorKit htmlEditKit = new HTMLEditorKit();
HTMLDocument htmlDocument = (HTMLDocument) htmlEditKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(new StringReader(htmlString),
htmlDocument.getReader(0), true);
// Use HTMLDocument here
System.out.println(htmlDocument.getProperty("title")); // Example Title
} catch(IOException e){
//Handle
e.printStackTrace();
}
See:
HTMLDocument
HTMLEditorKit

Related

How to only load text contents into JTextPane that are within the <body> tags?

Right now, I have a JTextPane in Java Swing that loads contents from a file into the pane. However, it loads everything including all the tags. I would like it to only load the contents. Is there a way to get to the tag and load the portion in between <body> and </body>?
Here is the code
public class LoadContent {
String path = "../WordProcessor_MadeInSwing/backups/testDir/cool_COPY3.rtf";
public void load(JTextPane jTextPane){
try {
FileReader fr = new FileReader(path);
BufferedReader reader = new BufferedReader(fr);
jTextPane.read(reader, path);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
catch(IOException e){
}
}
}
If my .rtf file contains the word "Here is a test", it will load as:
<html>
<head>
<style>
<!--
p.default {
family:Dialog;
size:3;
bold:normal;
italic:;
foreground:#333333;
}
-->
</style>
</head>
<body>
<p class=default>
<span style="color: #333333; font-size: 12pt; font-family: Dialog">
Here is a test
</span>
</p>
</body>
</html>
I only want it to load "Here is a test"

I would like it to only load the contents
Then you need to parse out the contents first before displaying the text.
Here is a simple example to display the text between the Span tags:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetSpan
{
public static void main(String[] args)
throws Exception
{
// Create a reader on the HTML content
Reader reader = getReader( args[0] );
// Parse the HTML
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
kit.read(reader, doc, 0);
// Find all the Span elements in the HTML document
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.SPAN);
while (it.isValid())
{
int start = it.getStartOffset();
int end = it.getEndOffset();
String text = doc.getText(start, end - start);
System.out.println( text );
it.next();
}
}
// If 'uri' begins with "http:" treat as a URL,
// otherwise, treat as a local file.
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
Just run the class with your file as the parameter.
Edit:
Just noticed the question has been changed to look for text in the <body> tag instead of the <span> tag. For some reason an iterator is not returned for the <body> tag.
So another option is to use a ParserCallback. The callback will notify you every time a starting tag (or ending tag) is found, or when text of any tag is found.
A basic example would be:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
private boolean isBody = false;
public void handleText(char[] data, int pos)
{
if (isBody)
System.out.println( data );
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.BODY))
{
isBody = true;
}
}
public static void main(String[] args)
throws Exception
{
Reader reader = getReader(args[0]);
ParserCallbackText parser = new ParserCallbackText();
new ParserDelegator().parse(reader, parser, true);
}
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
The above example will ignore any text found the <head> tag.

Try with a HTML parser. jsoup is nice one and very easy to use.
public static String extractText(Reader reader) throws IOException {
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
String textOnly = Jsoup.parse(sb.toString()).text();
return textOnly;
}

Create xml Elements and sub elements in Java using dom [duplicate]

I have to read and write to and from an XML file. What is the easiest way to read and write XML files using Java?

Here is a quick DOM example that shows how to read and write a simple xml file with its dtd:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE roles SYSTEM "roles.dtd">
<roles>
<role1>User</role1>
<role2>Author</role2>
<role3>Admin</role3>
<role4/>
</roles>
and the dtd:
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT roles (role1,role2,role3,role4)>
<!ELEMENT role1 (#PCDATA)>
<!ELEMENT role2 (#PCDATA)>
<!ELEMENT role3 (#PCDATA)>
<!ELEMENT role4 (#PCDATA)>
First import these:
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.xml.sax.*;
import org.w3c.dom.*;
Here are a few variables you will need:
private String role1 = null;
private String role2 = null;
private String role3 = null;
private String role4 = null;
private ArrayList<String> rolev;
Here is a reader (String xml is the name of your xml file):
public boolean readXML(String xml) {
rolev = new ArrayList<String>();
Document dom;
// Make an instance of the DocumentBuilderFactory
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
// use the factory to take an instance of the document builder
DocumentBuilder db = dbf.newDocumentBuilder();
// parse using the builder to get the DOM mapping of the
// XML file
dom = db.parse(xml);
Element doc = dom.getDocumentElement();
role1 = getTextValue(role1, doc, "role1");
if (role1 != null) {
if (!role1.isEmpty())
rolev.add(role1);
}
role2 = getTextValue(role2, doc, "role2");
if (role2 != null) {
if (!role2.isEmpty())
rolev.add(role2);
}
role3 = getTextValue(role3, doc, "role3");
if (role3 != null) {
if (!role3.isEmpty())
rolev.add(role3);
}
role4 = getTextValue(role4, doc, "role4");
if ( role4 != null) {
if (!role4.isEmpty())
rolev.add(role4);
}
return true;
} catch (ParserConfigurationException pce) {
System.out.println(pce.getMessage());
} catch (SAXException se) {
System.out.println(se.getMessage());
} catch (IOException ioe) {
System.err.println(ioe.getMessage());
}
return false;
}
And here a writer:
public void saveToXML(String xml) {
Document dom;
Element e = null;
// instance of a DocumentBuilderFactory
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
// use factory to get an instance of document builder
DocumentBuilder db = dbf.newDocumentBuilder();
// create instance of DOM
dom = db.newDocument();
// create the root element
Element rootEle = dom.createElement("roles");
// create data elements and place them under root
e = dom.createElement("role1");
e.appendChild(dom.createTextNode(role1));
rootEle.appendChild(e);
e = dom.createElement("role2");
e.appendChild(dom.createTextNode(role2));
rootEle.appendChild(e);
e = dom.createElement("role3");
e.appendChild(dom.createTextNode(role3));
rootEle.appendChild(e);
e = dom.createElement("role4");
e.appendChild(dom.createTextNode(role4));
rootEle.appendChild(e);
dom.appendChild(rootEle);
try {
Transformer tr = TransformerFactory.newInstance().newTransformer();
tr.setOutputProperty(OutputKeys.INDENT, "yes");
tr.setOutputProperty(OutputKeys.METHOD, "xml");
tr.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tr.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "roles.dtd");
tr.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
// send DOM to file
tr.transform(new DOMSource(dom),
new StreamResult(new FileOutputStream(xml)));
} catch (TransformerException te) {
System.out.println(te.getMessage());
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
}
} catch (ParserConfigurationException pce) {
System.out.println("UsersXML: Error trying to instantiate DocumentBuilder " + pce);
}
}
getTextValue is here:
private String getTextValue(String def, Element doc, String tag) {
String value = def;
NodeList nl;
nl = doc.getElementsByTagName(tag);
if (nl.getLength() > 0 && nl.item(0).hasChildNodes()) {
value = nl.item(0).getFirstChild().getNodeValue();
}
return value;
}
Add a few accessors and mutators and you are done!

Writing XML using JAXB (Java Architecture for XML Binding):
http://www.mkyong.com/java/jaxb-hello-world-example/
package com.mkyong.core;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;
#XmlRootElement
public class Customer {
String name;
int age;
int id;
public String getName() {
return name;
}
#XmlElement
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
#XmlElement
public void setAge(int age) {
this.age = age;
}
public int getId() {
return id;
}
#XmlAttribute
public void setId(int id) {
this.id = id;
}
}
package com.mkyong.core;
import java.io.File;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Marshaller;
public class JAXBExample {
public static void main(String[] args) {
Customer customer = new Customer();
customer.setId(100);
customer.setName("mkyong");
customer.setAge(29);
try {
File file = new File("C:\\file.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Customer.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();
// output pretty printed
jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
jaxbMarshaller.marshal(customer, file);
jaxbMarshaller.marshal(customer, System.out);
} catch (JAXBException e) {
e.printStackTrace();
}
}
}

The above answer only deal with DOM parser (that normally reads the entire file in memory and parse it, what for a big file is a problem), you could use a SAX parser that uses less memory and is faster (anyway that depends on your code).
SAX parser callback some functions when it find a start of element, end of element, attribute, text between elements, etc, so it can parse the document and at the same time you
get what you need.
Some example code:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/

The answers only cover DOM / SAX and a copy paste implementation of a JAXB example.
However, one big area of when you are using XML is missing. In many projects / programs there is a need to store / retrieve some basic data structures. Your program has already a classes for your nice and shiny business objects / data structures, you just want a comfortable way to convert this data to a XML structure so you can do more magic on it (store, load, send, manipulate with XSLT).
This is where XStream shines. You simply annotate the classes holding your data, or if you do not want to change those classes, you configure a XStream instance for marshalling (objects -> xml) or unmarshalling (xml -> objects).
Internally XStream uses reflection, the readObject and readResolve methods of standard Java object serialization.
You get a good and speedy tutorial here:
To give a short overview of how it works, I also provide some sample code which marshalls and unmarshalls a data structure.
The marshalling / unmarshalling happens all in the main method, the rest is just code to generate some test objects and populate some data to them.
It is super simple to configure the xStream instance and marshalling / unmarshalling is done with one line of code each.
import java.math.BigDecimal;
import java.util.ArrayList;
import java.util.List;
import com.thoughtworks.xstream.XStream;
public class XStreamIsGreat {
public static void main(String[] args) {
XStream xStream = new XStream();
xStream.alias("good", Good.class);
xStream.alias("pRoDuCeR", Producer.class);
xStream.alias("customer", Customer.class);
Producer a = new Producer("Apple");
Producer s = new Producer("Samsung");
Customer c = new Customer("Someone").add(new Good("S4", 10, new BigDecimal(600), s))
.add(new Good("S4 mini", 5, new BigDecimal(450), s)).add(new Good("I5S", 3, new BigDecimal(875), a));
String xml = xStream.toXML(c); // objects -> xml
System.out.println("Marshalled:\n" + xml);
Customer unmarshalledCustomer = (Customer)xStream.fromXML(xml); // xml -> objects
}
static class Good {
Producer producer;
String name;
int quantity;
BigDecimal price;
Good(String name, int quantity, BigDecimal price, Producer p) {
this.producer = p;
this.name = name;
this.quantity = quantity;
this.price = price;
}
}
static class Producer {
String name;
public Producer(String name) {
this.name = name;
}
}
static class Customer {
String name;
public Customer(String name) {
this.name = name;
}
List<Good> stock = new ArrayList<Good>();
Customer add(Good g) {
stock.add(g);
return this;
}
}
}

Ok, already having DOM, JaxB and XStream in the list of answers, there is still a complete different way to read and write XML: Data projection You can decouple the XML structure and the Java structure by using a library that provides read and writeable views to the XML Data as Java interfaces. From the tutorials:
Given some real world XML:
<weatherdata>
<weather
...
degreetype="F"
lat="50.5520210266113" lon="6.24060010910034"
searchlocation="Monschau, Stadt Aachen, NW, Germany"
... >
<current ... skytext="Clear" temperature="46"/>
</weather>
</weatherdata>
With data projection you can define a projection interface:
public interface WeatherData {
#XBRead("/weatherdata/weather/#searchlocation")
String getLocation();
#XBRead("/weatherdata/weather/current/#temperature")
int getTemperature();
#XBRead("/weatherdata/weather/#degreetype")
String getDegreeType();
#XBRead("/weatherdata/weather/current/#skytext")
String getSkytext();
/**
* This would be our "sub projection". A structure grouping two attribute
* values in one object.
*/
interface Coordinates {
#XBRead("#lon")
double getLongitude();
#XBRead("#lat")
double getLatitude();
}
#XBRead("/weatherdata/weather")
Coordinates getCoordinates();
}
And use instances of this interface just like POJOs:
private void printWeatherData(String location) throws IOException {
final String BaseURL = "http://weather.service.msn.com/find.aspx?outputview=search&weasearchstr=";
// We let the projector fetch the data for us
WeatherData weatherData = new XBProjector().io().url(BaseURL + location).read(WeatherData.class);
// Print some values
System.out.println("The weather in " + weatherData.getLocation() + ":");
System.out.println(weatherData.getSkytext());
System.out.println("Temperature: " + weatherData.getTemperature() + "°"
+ weatherData.getDegreeType());
// Access our sub projection
Coordinates coordinates = weatherData.getCoordinates();
System.out.println("The place is located at " + coordinates.getLatitude() + ","
+ coordinates.getLongitude());
}
This works even for creating XML, the XPath expressions can be writable.

SAX parser is working differently with a DOM parser, it neither load any XML document into memory nor create any object representation of the XML document. Instead, the SAX parser use callback function org.xml.sax.helpers.DefaultHandler to informs clients of the XML document structure.
SAX Parser is faster and uses less memory than DOM parser.
See following SAX callback methods :
startDocument() and endDocument() – Method called at the start and end of an XML document.
startElement() and endElement() – Method called at the start and end of a document element.
characters() – Method called with the text contents in between the start and end tags of an XML document element.
XML file
Create a simple XML file.
<?xml version="1.0"?>
<company>
<staff>
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff>
<firstname>low</firstname>
<lastname>yin fong</lastname>
<nickname>fong fong</nickname>
<salary>200000</salary>
</staff>
</company>
XML parser:
Java file Use SAX parser to parse the XML file.
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class ReadXMLFile {
public static void main(String argv[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
boolean bfname = false;
boolean blname = false;
boolean bnname = false;
boolean bsalary = false;
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equalsIgnoreCase("FIRSTNAME")) {
bfname = true;
}
if (qName.equalsIgnoreCase("LASTNAME")) {
blname = true;
}
if (qName.equalsIgnoreCase("NICKNAME")) {
bnname = true;
}
if (qName.equalsIgnoreCase("SALARY")) {
bsalary = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length) throws SAXException {
if (bfname) {
System.out.println("First Name : " + new String(ch, start, length));
bfname = false;
}
if (blname) {
System.out.println("Last Name : " + new String(ch, start, length));
blname = false;
}
if (bnname) {
System.out.println("Nick Name : " + new String(ch, start, length));
bnname = false;
}
if (bsalary) {
System.out.println("Salary : " + new String(ch, start, length));
bsalary = false;
}
}
};
saxParser.parse("c:\\file.xml", handler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Result
Start Element :company
Start Element :staff
Start Element :firstname
First Name : yong
End Element :firstname
Start Element :lastname
Last Name : mook kim
End Element :lastname
Start Element :nickname
Nick Name : mkyong
End Element :nickname
and so on...
Source(MyKong) - http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/

Java PDFBox list all named destinations of a page

For my Java project I need to list all named destinations of a PDF page.
The PDF and its named destination are created with LaTeX (using the hypertarget command), e.g. as follows:
\documentclass[12pt]{article}
\usepackage{hyperref}
\begin{document}
\hypertarget{myImportantString}{} % the anchor/named destination to be extracted "myImportantString"
Empty example page
\end{document}
How do I extract all named destinations of a specific page of this PDF document with the PDFBox library version 2.0.11?
I could not find any working code for this problem in the internet or the PDFBox examples. This is my current (minified) code:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import java.io.File;
import java.util.List;
public class ExtractNamedDests {
public static void main(String[] args) {
try {
int c = 1;
PDDocument document = PDDocument.load(new File("<path to PDF file>"));
for (PDPage page : document.getPages()) {
System.out.println("Page " + c + ":");
// named destinations seem to be no type of annotations since the list is always empty:
List<PDAnnotation> annotations = page.getAnnotations();
System.out.println(" Count annotations: " + annotations.size());
// How to extract named destinations??
}
}catch(Exception e){
e.printStackTrace();
}
}
}
In this example I want to extract the String "myImportantString" from the page in Java.
EDIT: Here is the example PDF file. I use PDFBox version 2.0.11.

I found a solution with the great help of Tilman Hausherr. It uses the code he suggested in his comments.
The method getAllNamedDestinations() returns a map of all named destinations in the document (not annotations) with name and destination. Named destinations can be deeply nested in the document. Therefore, the method traverseKids() recursively finds all nested named destinations.
public static Map<String, PDPageDestination> getAllNamedDestinations(PDDocument document){
Map<String, PDPageDestination> namedDestinations = new HashMap<>(10);
// get catalog
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDDocumentNameDictionary names = documentCatalog.getNames();
if(names == null)
return namedDestinations;
PDDestinationNameTreeNode dests = names.getDests();
try {
if (dests.getNames() != null)
namedDestinations.putAll(dests.getNames());
} catch (Exception e){ e.printStackTrace(); }
List<PDNameTreeNode<PDPageDestination>> kids = dests.getKids();
traverseKids(kids, namedDestinations);
return namedDestinations;
}
private static void traverseKids(List<PDNameTreeNode<PDPageDestination>> kids, Map<String, PDPageDestination> namedDestinations){
if(kids == null)
return;
try {
for(PDNameTreeNode<PDPageDestination> kid : kids){
if(kid.getNames() != null){
try {
namedDestinations.putAll(kid.getNames());
} catch (Exception e){ System.out.println("INFO: Duplicate named destinations in document."); e.printStackTrace(); }
}
if (kid.getKids() != null)
traverseKids(kid.getKids(), namedDestinations);
}
} catch (Exception e){
e.printStackTrace();
}
}

How can I parse String containing XML tags , so that I can get value of all the sub tags

I have to parse a String containing XML tags like the one hard coded below so that I can get values of all the tags separately. Here when I am using
NodeList node = doc.getElementsByTagName("event");
It is returning value as "ajain1AnkitJain24-04-199223:09.08"
I want to retrieve value for each tag and store it separately in different variables.
Like for eg in this scenario I want to Store Value as :
String UID = ajain1
String FirstName = Ankit
String LastName = Jain
Date date = "24-04-1992 23:09.08"
Here is the Sample code I am working on.
package test;
import java.io.IOException;
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class Demo {
public static void main(String[] args) {
String xmldata = "<event><class></class><data><UID><![CDATA[ajain1]]></UID><FIRSTNAME><![CDATA[Ankit]]></FIRSTNAME><LASTNAME><![CDATA[Jain]]></LASTNAME><DATE><![CDATA[24-04-1992]]></DATE><TIME><![CDATA[23:09.08]]></TIME></data></event>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = null;
try {
db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmldata));
try {
Document doc = db.parse(is);
//String message = doc.getDocumentElement().getTextContent();
//System.out.println(message);
NodeList node = doc.getElementsByTagName("event");
} catch (SAXException e) {
// handle SAXException
} catch (IOException e) {
// handle IOException
}
} catch (ParserConfigurationException e1) {
// handle ParserConfigurationException
}
// TODO Auto-generated method stub
}
}
Thanks and let me know if you require any more information.

A NodeList already is a list containing all requested nodes, but I have to admit, I find its implementation highly questionable. It's basically a node containing the requested nodes as children. Its implementation has very much nothing in common with other list implementations - it doesn't even implement the List interface. I don't exactly know how to handle [!CDATA], but to loop through all event tags you'd have to do something like this:
NodeList eventList = doc.getElementsByTagName("event");
for(int i = 0; i < eventList.getLength(); i++) {
Element eventElement = (Element) eventList.item(i);
// do some stuff with it
}
From this element, you can also use getElementsByTagName to get the information needed about first name and so on. And yes, it's likely to end up with many nested loops...

Using Jsoup to extract single value from page source

I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);

Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}

can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]

public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I parse a HTML string in Java? - java

Given the string "<table><tr><td>Hello World!</td></tr></table>", what is the (easiest) way to get a DOM Element representing it?

you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge

You could use Swing: How do you make use of the HTML-processing capabilities that are built into Java? You may not know that Swing contains all the classes necessary to parse HTML. Jeff Heaton shows you how.

I've used Jericho HTML Parser it's OSS, detects(forgives) badly formatted tags and is lightweight

Related

How to only load text contents into JTextPane that are within the <body> tags?

Create xml Elements and sub elements in Java using dom [duplicate]

Java PDFBox list all named destinations of a page

How can I parse String containing XML tags , so that I can get value of all the sub tags

Using Jsoup to extract single value from page source

Categories

Resources