Problem querying an HTML file using HTMLEditorKit in Java

Problem querying an HTML file using HTMLEditorKit in Java - java

My HTML contains tags of the following form:
<div class="author">Apple - October 22, 2009 - 01:07</div>
I'd like to extract the date, "October 22, 2009 - 01:07" in this example, from each tag
I've implemented javax.swing.text.html.HTMLEditorKit.ParserCallback as follows:
class HTMLParseListerInner extends HTMLEditorKit.ParserCallback {
private ArrayList<String> foundDates = new ArrayList<String>();
private boolean isDivLink = false;
public void handleText(char[] data, int pos) {
if(isDivLink)
foundDates.add(new String(data)); // Extracts "Apple" instead of the date.
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if (t.toString() == "div" && divValue != null && divValue.equals("author"))
isDivLink = true;
}
}
However, the above parser returns "Apple" which is inside a hyperlink within the tag. How can I fix the parser to extract the date?

Override handleEndTag and check for "a"?
However, this HTML parser is from the early 90's and these methods are not well specified.

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class ParserCallbackDiv extends HTMLEditorKit.ParserCallback
{
private boolean isDivLink = false;
private String divText;
public void handleEndTag(HTML.Tag tag, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
System.out.println( divText );
isDivLink = false;
}
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if ("author".equals(divValue))
isDivLink = true;
}
}
public void handleText(char[] data, int pos)
{
divText = new String(data);
}
public static void main(String[] args)
throws IOException
{
String file = "<div class=\"author\"><a href=\"/user/1\"" +
"title=\"View user profile.\">Apple</a> - October 22, 2009 - 01:07</div>";
StringReader reader = new StringReader(file);
ParserCallbackDiv parser = new ParserCallbackDiv();
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}

Related

arraylist.isEmpty works only in a loop and not outside

In this class I store a number and content belonging to this number. In the method deleteContentIfOlderThen() I check if the ArrayList(opexContent) is empty. But this works only if the check occurs inside the for loop. If I check it outside it doesn't work. I use a workStealingExecutorService to execute this method. But the list is only accessed by this class while the threads do the work.
So what am I doing wrong that it doesn't work outside the loop?
version with list.removeIf()
package org.excelAnalyser;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.function.Predicate;
public class OpexLine implements Runnable{
public static final int DELETE_CONTENT_TASK=0;
public final int SAPNumber;
public final ArrayList<String[]> opexContent;
private final DeletionRequestHandler m_deletionRequestHandler;
private String[] m_taskParameters=null;
private int m_taskid=-1;
public OpexLine(int SAPNumber, ArrayList<String[]> opexContent, DeletionRequestHandler deletionRequestHandler) {
this.SAPNumber=SAPNumber;
this.opexContent=opexContent;
m_deletionRequestHandler=deletionRequestHandler;
}
public void setupTask(int taskID, String[] parameters) {
m_taskid=taskID;
m_taskParameters=parameters;
}
public void run() {
if(m_taskid<0) {
System.out.println("aborted task because no taskID was given");
return;
}
if(m_taskParameters==null) {
System.out.println("no taskparameters were specified");
return;
}
switch (m_taskid) {
case DELETE_CONTENT_TASK:
deleteContentIfOlderThen();
break;
}
}
private void deleteContentIfOlderThen() {
final long unixTime=Long.parseLong(m_taskParameters[0]);
final int indexOfTime=Integer.parseInt(m_taskParameters[1]);
final SimpleDateFormat sdf = new SimpleDateFormat("dd.MM.yyyy");
final Date checkDate=new Date(System.currentTimeMillis()-unixTime);
opexContent.removeIf(new Predicate<String[]>() {
public boolean test(String[] content) {
final String timeString=content[indexOfTime];
if(timeString==null || timeString.isEmpty()) {
return false;
}
Date date = null;
try {
date = sdf.parse(timeString);
} catch (Exception e) {
return false;
//date = new Date();
}
return date.before(checkDate);
}
});
if(opexContent.isEmpty()) {
m_deletionRequestHandler.requestDeletionFor(this);
}
}
public String toString() {
final StringBuilder stringBuilder=new StringBuilder();
stringBuilder.append(SAPNumber);
stringBuilder.append(": ");
final int length=String.valueOf(SAPNumber).length();
//gen offset
final StringBuilder offsetBuilder=new StringBuilder();
for(int i=0; i<length+2; i++) {
offsetBuilder.append(' ');
}
final String offset=offsetBuilder.toString();
for (String[] strings : opexContent) {
for (String string : strings) {
stringBuilder.append(string+"; ");
}
stringBuilder.append("\n"+offset);
}
return stringBuilder.toString();
}
}
version with iterator:
package org.excelAnalyser;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
public class OpexLine implements Runnable{
public static final int DELETE_CONTENT_TASK=0;
public final int SAPNumber;
public final ArrayList<String[]> opexContent;
private final DeletionRequestHandler m_deletionRequestHandler;
private String[] m_taskParameters=null;
private int m_taskid=-1;
public OpexLine(int SAPNumber, ArrayList<String[]> opexContent, DeletionRequestHandler deletionRequestHandler) {
this.SAPNumber=SAPNumber;
this.opexContent=opexContent;
m_deletionRequestHandler=deletionRequestHandler;
}
public void setupTask(int taskID, String[] parameters) {
m_taskid=taskID;
m_taskParameters=parameters;
}
public void run() {
if(m_taskid<0) {
System.out.println("aborted task because no taskID was given");
return;
}
if(m_taskParameters==null) {
System.out.println("no taskparameters were specified");
return;
}
switch (m_taskid) {
case DELETE_CONTENT_TASK:
deleteContentIfOlderThen();
break;
}
}
private void deleteContentIfOlderThen() {
final long unixTime=Long.parseLong(m_taskParameters[0]);
final int indexOfTime=Integer.parseInt(m_taskParameters[1]);
final SimpleDateFormat sdf = new SimpleDateFormat("dd.MM.yyyy");
Iterator<String[]> opexContentIterator=opexContent.iterator();
contentLoop: while(opexContentIterator.hasNext()) {
String[] content=(String[])opexContentIterator.next();
final String timeString=content[indexOfTime];
if(timeString==null || timeString.isEmpty()) {
continue contentLoop;
}
Date date = null;
try {
date = sdf.parse(timeString);
} catch (Exception e) {
continue contentLoop;
//date = new Date();
}
Date checkDate=new Date(System.currentTimeMillis()-unixTime);
if(date.before(checkDate)) {
opexContentIterator.remove();
}
}
if(opexContent.isEmpty()) {
m_deletionRequestHandler.requestDeletionFor(this);
}
}
public String toString() {
final StringBuilder stringBuilder=new StringBuilder();
stringBuilder.append(SAPNumber);
stringBuilder.append(": ");
final int length=String.valueOf(SAPNumber).length();
//gen offset
final StringBuilder offsetBuilder=new StringBuilder();
for(int i=0; i<length+2; i++) {
offsetBuilder.append(' ');
}
final String offset=offsetBuilder.toString();
for (String[] strings : opexContent) {
for (String string : strings) {
stringBuilder.append(string+"; ");
}
stringBuilder.append("\n"+offset);
}
return stringBuilder.toString();
}
}
version without iterator:
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
public class OpexLine implements Runnable{
public static final int DELETE_CONTENT_TASK=0;
public final int SAPNumber;
public final ArrayList<String[]> opexContent;
private final DeletionRequestHandler m_deletionRequestHandler;
private String[] m_taskParameters=null;
private int m_taskid=-1;
public OpexLine(int SAPNumber, ArrayList<String[]> opexContent, DeletionRequestHandler deletionRequestHandler) {
this.SAPNumber=SAPNumber;
this.opexContent=opexContent;
m_deletionRequestHandler=deletionRequestHandler;
}
public void setupTask(int taskID, String[] parameters) {
m_taskid=taskID;
m_taskParameters=parameters;
}
public void run() {
if(m_taskid<0) {
System.out.println("aborted task because no taskID was given");
return;
}
if(m_taskParameters==null) {
System.out.println("no taskparameters were specified");
return;
}
switch (m_taskid) {
case DELETE_CONTENT_TASK:
deleteContentIfOlderThen();
break;
}
}
private void deleteContentIfOlderThen() {
final long unixTime=Long.parseLong(m_taskParameters[0]);
final int indexOfTime=Integer.parseInt(m_taskParameters[1]);
final SimpleDateFormat sdf = new SimpleDateFormat("dd.MM.yyyy");
contentLoop: for(final String[] content : opexContent) {
//final String[] content=opexContent.get(i);
final String timeString=content[indexOfTime];
if(timeString==null || timeString.isEmpty()) {
continue contentLoop;
}
Date date = null;
try {
date = sdf.parse(timeString);
} catch (Exception e) {
continue contentLoop;
//date = new Date();
}
Date checkDate=new Date(System.currentTimeMillis()-unixTime);
//System.out.println(date+" before "+checkDate+": "+date.before(checkDate));
if(date.before(checkDate)) {
System.out.println("removed");
opexContent.remove(content);
if(opexContent.isEmpty()) {
System.out.println("isEmptyAfterRemoval");
m_deletionRequestHandler.requestDeletionFor(this);
}
}
}
//System.out.println("isexecuted");
if(opexContent.isEmpty()) {
System.out.println("isEmpty");
m_deletionRequestHandler.requestDeletionFor(this);
}
}
public String toString() {
final StringBuilder stringBuilder=new StringBuilder();
stringBuilder.append(SAPNumber);
stringBuilder.append(": ");
final int length=String.valueOf(SAPNumber).length();
//gen offset
final StringBuilder offsetBuilder=new StringBuilder();
for(int i=0; i<length+2; i++) {
offsetBuilder.append(' ');
}
final String offset=offsetBuilder.toString();
for (String[] strings : opexContent) {
for (String string : strings) {
stringBuilder.append(string+"; ");
}
stringBuilder.append("\n"+offset);
}
return stringBuilder.toString();
}
}

You should never mutate a list while you are iterating it. This type of code:
final ArrayList<String[]> opexContent;
private void deleteContentIfOlderThen() {
for(final String[] content : opexContent) {
Date date = /* some date */;
Date checkDate=new Date(System.currentTimeMillis()-unixTime);
if(date.before(checkDate)) {
opexContent.remove(content); //CME!
}
}
}
Will likely cause a ConcurrentModificationException, as the list was mutated while an Iterator was in use (behind the scenes). Instead, you should use the Iterator directly, such that you have the ability to use Iterator#remove:
final ArrayList<String[]> opexContent;
private void deleteContentIfOlderThen() {
Iterator<String[]> itr = opexContent.iterator();
while (itr.hasNext()) {
String[] content = itr.next();
Date date = /* some date */;
Date checkDate=new Date(System.currentTimeMillis()-unixTime);
if(date.before(checkDate)) {
itr.remove(); //No more CME!
}
}
}

Whilst Rogue's answer is correct in saying that you shouldn't remove elements from the list while iterating it, there's a neater way of removing the elements since Java 8 that avoids using an iterator explicitly:
// Doesn't have to be moved outside, but at least doing this means
// you're comparing against a fixed check date, rather than potentially
// changing as you iterate.
Date checkDate=new Date(System.currentTimeMillis()-unixTime);
opexContent.removeIf(content -> {
Date date = /* some date */;
return date.before(checkDate);
});
This is better because
it's clearer, because you're not having to deal with the iterator
it's safer, as it leaves the list untouched if an exception is thrown for some element (i.e. it's failure atomic);
it's more efficient, because it avoids resizing the list repeatedly

Compilation message: unchecked method invocation; <t>sort(java.util.list<T>)

I am writing code that creates an appointment book and so I have several different classes. I can't see to get rid of this error. It says I have an unchecked method. This is what I get:
java:25: warning: [unchecked] unchecked method invocation: <T>sort(java.util.List<T>) in java.util.Collections is applied to (java.util.ArrayList<Appointment>)
My code
import java.io.*;
import java.util.*;
import java.util.ArrayList;
/*is a collection of Appointment objects. As such, the class must include
* a data structure to store an arbitrary number of Appointments*/
public class ApptBook implements Iterable {
private ArrayList<Appointment> list;
private Date startRange,endRange;
public ApptBook(Date _startRange, Date _endRange)
{
ArrayList ls =new ArrayList();
endRange=_endRange;
startRange=_startRange;
}
public void printAppointments(Date start, Date end)
{
startRange=start;
endRange=end;
Collections.sort(list);
System.out.println("Result list:");
for(Appointment counter: list){
System.out.println(counter.toString());
}
}
public void saveToFile() throws FileNotFoundException, IOException
{
OutputStream f = new FileOutputStream("apptbook.dat");
OutputStreamWriter writer = new OutputStreamWriter(f);
BufferedWriter out = new BufferedWriter(writer);
int i;
for(i=0;i<list.size();i++)
{
out.write("##\n");
out.write(list.get(i).forFile());
out.write("#\n");
}
out.close();
}
public void LoadFromFile() throws FileNotFoundException, IOException
{
InputStream f = new FileInputStream("apptbook.dat");
InputStreamReader reader = new InputStreamReader(f);
BufferedReader in = new BufferedReader(reader);
String str;
while ((str = in.readLine()) != null) {
if(str.equals("##"))//start read new object
{
//read date:
str = in.readLine();
String []param=str.split(" ");
//public Date(int _month, int _day, int _year)
// <year> <month> <day>
Date start=new Date(Integer.parseInt(param[1]),Integer.parseInt(param[2]),Integer.parseInt(param[0]));
//read time:
str = in.readLine();
String []paramTime=str.split(" ");
Time time=new Time(Integer.parseInt(paramTime[0]),Integer.parseInt(paramTime[1]));
int duration;
str=in.readLine();
duration=Integer.parseInt(str);
str=in.readLine();
Appointment newApp=new Appointment(start, time, duration, str);
addAppt(newApp);
str=in.readLine();//read #
}
}
in.close();
}
//should add a to this ApptBook, provided that a does not overlap
//with an Appointment that is already stored.
public boolean addAppt(Appointment a)
{
//check for overlap:
int cursor;
boolean isOverlap=false;
for(cursor = 0;cursor<list.size();cursor++)
if(a.overlaps(list.get(cursor)))
{
isOverlap=true;
break;
}
if(!isOverlap)
{
list.add(a);
}
return isOverlap;
}
public boolean removeAppt(Date d, Time t)
{
throw new UnsupportedOperationException("removal not implemented");
}
//#Override
public Iterator iterator() {
return new ApptBookIterator(list,startRange,endRange);
}
// Inner class example
private class ApptBookIterator implements
Iterator {
/*
ApptBookIterator
*/
private int cursor;
private Date startRange;
private Date endRange;
ArrayList<Appointment> list;
public ApptBookIterator(ArrayList<Appointment> _list,Date _startRange,Date _EndRange) {
list=_list;
startRange=_startRange;
endRange=_EndRange;
//find first in range:
boolean isFind=false;
for(cursor = 0;cursor<list.size();cursor++)
{
Appointment temp=list.get(cursor);
if(temp.isInDateRange(startRange,endRange))
{
isFind=true;
break;
}
}
if(!isFind)
cursor=-1;
}
public boolean hasNext() {
if(cursor==-1)
return false;
boolean isFind=false;
for(int i=cursor;i<list.size();i++)
{
Appointment temp=list.get(i);
if(temp.isInDateRange(startRange,endRange))
{
isFind=true;
break;
}
}
if(!isFind)
return false;
return true;
}
public Integer next() {
if(this.hasNext()) {
for(;cursor<list.size();cursor++)
{
Appointment temp=list.get(cursor);
if(temp.isInDateRange(startRange,endRange))
break;
}
}
throw new NoSuchElementException();
}
#Override
public void remove() {
throw new UnsupportedOperationException("Not supported yet.");
}
}
}

It doesn't know how to compare that list. You will have to use
Collections.sort(list, new Comparator<Appointment>() {
public int compare(Appointment app1, Appointment app2) {
// Compare your items here
}
});
and specify how to compare two items.

How t get specific value from html in java?

I am developing one Application which show Gold rate and create graph for this.
I find one website which provide me this gold rate regularly.My question is how to extract this specific value from html page.
Here is link which i need to extract = http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/ and this html page have following tag and content.
<p><em>10 gram gold Rate in pune = Rs.31150.00</em></p>
Here is my code which i use for extracting but i didn't find way to extract specific content.
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/";
//Here i need to extract this content by tag wise or content wise....
Thanks in Advance.......

You can use library like Jsoup
You can get it from here --> Download Jsoup
Here is its API reference --> Jsoup API Reference
Its really very easy to parse HTML content using Jsoup.
Below is a sample code which might be helpful to you..
public class GetPTags {
public static void main(String[] args){
Document doc = Jsoup.parse(readURL("http://www.todaysgoldrate.co.intodays-gold-rate-in-pune/"));
Elements p_tags = doc.select("p");
for(Element p : p_tags)
{
System.out.println("P tag is "+p.text());
}
}
public static String readURL(String url) {
String fileContents = "";
String currentLine = "";
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
fileContents = reader.readLine();
while (currentLine != null) {
currentLine = reader.readLine();
fileContents += "\n" + currentLine;
}
reader.close();
reader = null;
} catch (Exception e) {
JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
e.printStackTrace();
}
return fileContents;
}
}

http://java-source.net/open-source/crawlers
You can use any of that's apis, but don't parse the HTML with the pure JDK, because it's too painfull.

How do I trace a ChangedCharSetException in Java when parsing HTML?

I'm using the following code with the javax.swing.text.html.parser.ParserDelegator in order to parse hyperlinks from a website.
InputStream inputStream;
InputStreamReader inputStreamReader;
inputStream = rsc.getUrl().openStream();
inputStreamReader = new InputStreamReader(inputStream);
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (tag == Tag.A) {
String address = (String) attribute.getAttribute(Attribute.HREF);
if ((address != null) && !address.equalsIgnoreCase("null"))
links.add(address);
}
}
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleEndTag(Tag t, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleText(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(inputStreamReader, parserCallback, false);
This worked fine for most sites, but for example, when I'm trying to open http://www.univie.ac.at, I receive the following exception:
javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:172)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:413)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1943)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:2061)
at javax.swing.text.html.parser.Parser.parse(Parser.java:2228)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:105)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:84)
How would I go about catching this exception, but still keep parsing my remote document (e.g. my InputStream)?

The easiest way I found was just to ignore the charset completely:
Change
parserDelegator.parse(inputStreamReader, parserCallback, false);
to:
parserDelegator.parse(inputStreamReader, parserCallback, true);
Since the third option is boolean ignoreCharSet.

Convert HTML to plain text in Java

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.
Sample HTML pages for testing are:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
#Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}

Have your parser append text content and newlines to a StringBuilder.
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
#Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
#Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
#Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

I would guess you could use the ParserCallback.
You would need to add code to support the tags that require special handling. There are:
handleStartTag
handleEndTag
handleSimpleTag
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.
Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Problem querying an HTML file using HTMLEditorKit in Java - java

Override handleEndTag and check for "a"? However, this HTML parser is from the early 90's and these methods are not well specified.

Related

arraylist.isEmpty works only in a loop and not outside

Compilation message: unchecked method invocation; <t>sort(java.util.list<T>)

How t get specific value from html in java?

How do I trace a ChangedCharSetException in Java when parsing HTML?

Convert HTML to plain text in Java

Categories

Resources