How do I trace a ChangedCharSetException in Java when parsing HTML?

How do I trace a ChangedCharSetException in Java when parsing HTML? - java

I'm using the following code with the javax.swing.text.html.parser.ParserDelegator in order to parse hyperlinks from a website.
InputStream inputStream;
InputStreamReader inputStreamReader;
inputStream = rsc.getUrl().openStream();
inputStreamReader = new InputStreamReader(inputStream);
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (tag == Tag.A) {
String address = (String) attribute.getAttribute(Attribute.HREF);
if ((address != null) && !address.equalsIgnoreCase("null"))
links.add(address);
}
}
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleEndTag(Tag t, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleText(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(inputStreamReader, parserCallback, false);
This worked fine for most sites, but for example, when I'm trying to open http://www.univie.ac.at, I receive the following exception:
javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:172)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:413)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1943)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:2061)
at javax.swing.text.html.parser.Parser.parse(Parser.java:2228)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:105)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:84)
How would I go about catching this exception, but still keep parsing my remote document (e.g. my InputStream)?

The easiest way I found was just to ignore the charset completely:
Change
parserDelegator.parse(inputStreamReader, parserCallback, false);
to:
parserDelegator.parse(inputStreamReader, parserCallback, true);
Since the third option is boolean ignoreCharSet.

Related

Java I/O FileStream issue

I have an input file stream method that will load a file, I just can't figure out how to then use the file in another method. The file has one UTF string and two integers. How can I now use each of these different ints or strings in a main method? Lets say I want print the three different variables to the console, how would I go about doing that? Here's a few things I've tried with the method:
public static dataStreams() throws IOException {
int i = 0;
char c;
try (DataInputStream input = new DataInputStream(
new FileInputStream("input.dat"));
) {
while((i=input.read())!=-1){
// converts integer to character
c=(char)i;
}
return c;
return i;
/*
String stringUTF = input.readUTF();
int firstInt = input.readInt();
int secondInt = input.readInt();
*/
}
}

Maybe one container for those properties, like this:
public static void main(String [] args) {
DataContainer dContainer = null;
try {
dContainer = dataStreams();
} catch (IOException e) {
e.printStackTrace();
}
//do some logging with properties
System.out.println(dContainer.getFirst());
System.out.println(dContainer.getSecond());
System.out.println(dContainer.getUtf());
}
public static DataContainer dataStreams() throws IOException {
int i = 0;
char c;
try (DataInputStream input = new DataInputStream(
new FileInputStream("input.dat"));
) {
while((i=input.read())!=-1){
// converts integer to character
c=(char)i;
}
String stringUTF = input.readUTF();
int firstInt = input.readInt();
int secondInt = input.readInt();
DataContainer dContainer = new DataContainer(stringUTF, firstInt, secondInt);
return dContainer;
}
}
static class DataContainer {
String utf;
int first;
int second;
DataContainer(String utf, int first, int second) {
this.utf = utf;
this.first = first;
this.second = second;
}
public String getUtf() {
return utf;
}
public int getFirst() {
return first;
}
public int getSecond() {
return second;
}
}

How t get specific value from html in java?

I am developing one Application which show Gold rate and create graph for this.
I find one website which provide me this gold rate regularly.My question is how to extract this specific value from html page.
Here is link which i need to extract = http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/ and this html page have following tag and content.
<p><em>10 gram gold Rate in pune = Rs.31150.00</em></p>
Here is my code which i use for extracting but i didn't find way to extract specific content.
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/";
//Here i need to extract this content by tag wise or content wise....
Thanks in Advance.......

You can use library like Jsoup
You can get it from here --> Download Jsoup
Here is its API reference --> Jsoup API Reference
Its really very easy to parse HTML content using Jsoup.
Below is a sample code which might be helpful to you..
public class GetPTags {
public static void main(String[] args){
Document doc = Jsoup.parse(readURL("http://www.todaysgoldrate.co.intodays-gold-rate-in-pune/"));
Elements p_tags = doc.select("p");
for(Element p : p_tags)
{
System.out.println("P tag is "+p.text());
}
}
public static String readURL(String url) {
String fileContents = "";
String currentLine = "";
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
fileContents = reader.readLine();
while (currentLine != null) {
currentLine = reader.readLine();
fileContents += "\n" + currentLine;
}
reader.close();
reader = null;
} catch (Exception e) {
JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
e.printStackTrace();
}
return fileContents;
}
}

http://java-source.net/open-source/crawlers
You can use any of that's apis, but don't parse the HTML with the pure JDK, because it's too painfull.

Java: accessing a List of Strings as an InputStream

Is there any way InputStream wrapping a list of UTF-8 String? I'd like to do something like:
InputStream in = new XyzInputStream( List<String> lines )

You can read from a ByteArrayOutputStream and you can create your source byte[] array using a ByteArrayInputStream.
So create the array as follows:
List<String> source = new ArrayList<String>();
source.add("one");
source.add("two");
source.add("three");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (String line : source) {
baos.write(line.getBytes());
}
byte[] bytes = baos.toByteArray();
And reading from it is as simple as:
InputStream in = new ByteArrayInputStream(bytes);
Alternatively, depending on what you're trying to do, a StringReader might be better.

You can concatenate all the lines together to create a String then convert it to a byte array using String#getBytes and pass it into ByteArrayInputStream. However this is not the most efficient way of doing it.

In short, no, there is no way of doing this using existing JDK classes. You could, however, implement your own InputStream that read from a List of Strings.
EDIT: Dave Web has an answer above, which I think is the way to go. If you need a reusable class, then something like this might do:
public class StringsInputStream<T extends Iterable<String>> extends InputStream {
private ByteArrayInputStream bais = null;
public StringsInputStream(final T strings) throws IOException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
for (String line : strings) {
outputStream.write(line.getBytes());
}
bais = new ByteArrayInputStream(outputStream.toByteArray());
}
#Override
public int read() throws IOException {
return bais.read();
}
#Override
public int read(byte[] b) throws IOException {
return bais.read(b);
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
return bais.read(b, off, len);
}
#Override
public long skip(long n) throws IOException {
return bais.skip(n);
}
#Override
public int available() throws IOException {
return bais.available();
}
#Override
public void close() throws IOException {
bais.close();
}
#Override
public synchronized void mark(int readlimit) {
bais.mark(readlimit);
}
#Override
public synchronized void reset() throws IOException {
bais.reset();
}
#Override
public boolean markSupported() {
return bais.markSupported();
}
public static void main(String[] args) throws Exception {
List source = new ArrayList();
source.add("foo ");
source.add("bar ");
source.add("baz");
StringsInputStream<List<String>> in = new StringsInputStream<List<String>>(source);
int read = in.read();
while (read != -1) {
System.out.print((char) read);
read = in.read();
}
}
}
This basically an adapter for ByteArrayInputStream.

You can create some kind of IterableInputStream
public class IterableInputStream<T> extends InputStream {
public static final int EOF = -1;
private static final InputStream EOF_IS = new InputStream() {
#Override public int read() throws IOException {
return EOF;
}
};
private final Iterator<T> iterator;
private final Function<T, byte[]> mapper;
private InputStream current;
public IterableInputStream(Iterable<T> iterable, Function<T, byte[]> mapper) {
this.iterator = iterable.iterator();
this.mapper = mapper;
next();
}
#Override
public int read() throws IOException {
int n = current.read();
while (n == EOF && current != EOF_IS) {
next();
n = current.read();
}
return n;
}
private void next() {
current = iterator.hasNext()
? new ByteArrayInputStream(mapper.apply(iterator.next()))
: EOF_IS;
}
}
To use it
public static void main(String[] args) throws IOException {
Iterable<String> strings = Arrays.asList("1", "22", "333", "4444");
try (InputStream is = new IterableInputStream<String>(strings, String::getBytes)) {
for (int b = is.read(); b != -1; b = is.read()) {
System.out.print((char) b);
}
}
}

In my case I had to convert a list of string in the equivalent file (with a line feed for each line).
This was my solution:
List<String> inputList = Arrays.asList("line1", "line2", "line3");
byte[] bytes = inputList.stream().collect(Collectors.joining("\n", "", "\n")).getBytes();
InputStream inputStream = new ByteArrayInputStream(bytes);

You can do something similar to this:
https://commons.apache.org/sandbox/flatfile/xref/org/apache/commons/flatfile/util/ConcatenatedInputStream.html
It just implements the read() method of InputStream and has a list of InputStreams it is concatenating. Once it reads an EOF it starts reading from the next InputStream. Just convert the Strings to ByteArrayInputStreams.

you can also do this way create a Serializable List
List<String> quarks = Arrays.asList(
"up", "down", "strange", "charm", "top", "bottom"
);
//serialize the List
//note the use of abstract base class references
try{
//use buffering
OutputStream file = new FileOutputStream( "quarks.ser" );
OutputStream buffer = new BufferedOutputStream( file );
ObjectOutput output = new ObjectOutputStream( buffer );
try{
output.writeObject(quarks);
}
finally{
output.close();
}
}
catch(IOException ex){
fLogger.log(Level.SEVERE, "Cannot perform output.", ex);
}
//deserialize the quarks.ser file
//note the use of abstract base class references
try{
//use buffering
InputStream file = new FileInputStream( "quarks.ser" );
InputStream buffer = new BufferedInputStream( file );
ObjectInput input = new ObjectInputStream ( buffer );
try{
//deserialize the List
List<String> recoveredQuarks = (List<String>)input.readObject();
//display its data
for(String quark: recoveredQuarks){
System.out.println("Recovered Quark: " + quark);
}
}
finally{
input.close();
}
}
catch(ClassNotFoundException ex){
fLogger.log(Level.SEVERE, "Cannot perform input. Class not found.", ex);
}
catch(IOException ex){
fLogger.log(Level.SEVERE, "Cannot perform input.", ex);
}

I'd like to propose my simple solution:
public class StringListInputStream extends InputStream {
private final List<String> strings;
private int pos = 0;
private byte[] bytes = null;
private int i = 0;
public StringListInputStream(List<String> strings) {
this.strings = strings;
this.bytes = strings.get(0).getBytes();
}
#Override
public int read() throws IOException {
if (pos >= bytes.length) {
if (!next()) return -1;
else return read();
}
return bytes[pos++];
}
private boolean next() {
if (i + 1 >= strings.size()) return false;
pos = 0;
bytes = strings.get(++i).getBytes();
return true;
}
}

Convert HTML to plain text in Java

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.
Sample HTML pages for testing are:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
#Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}

Have your parser append text content and newlines to a StringBuilder.
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
#Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
#Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
#Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

I would guess you could use the ParserCallback.
You would need to add code to support the tags that require special handling. There are:
handleStartTag
handleEndTag
handleSimpleTag
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.
Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

Problem querying an HTML file using HTMLEditorKit in Java

My HTML contains tags of the following form:
<div class="author">Apple - October 22, 2009 - 01:07</div>
I'd like to extract the date, "October 22, 2009 - 01:07" in this example, from each tag
I've implemented javax.swing.text.html.HTMLEditorKit.ParserCallback as follows:
class HTMLParseListerInner extends HTMLEditorKit.ParserCallback {
private ArrayList<String> foundDates = new ArrayList<String>();
private boolean isDivLink = false;
public void handleText(char[] data, int pos) {
if(isDivLink)
foundDates.add(new String(data)); // Extracts "Apple" instead of the date.
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if (t.toString() == "div" && divValue != null && divValue.equals("author"))
isDivLink = true;
}
}
However, the above parser returns "Apple" which is inside a hyperlink within the tag. How can I fix the parser to extract the date?

Override handleEndTag and check for "a"?
However, this HTML parser is from the early 90's and these methods are not well specified.

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class ParserCallbackDiv extends HTMLEditorKit.ParserCallback
{
private boolean isDivLink = false;
private String divText;
public void handleEndTag(HTML.Tag tag, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
System.out.println( divText );
isDivLink = false;
}
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if ("author".equals(divValue))
isDivLink = true;
}
}
public void handleText(char[] data, int pos)
{
divText = new String(data);
}
public static void main(String[] args)
throws IOException
{
String file = "<div class=\"author\"><a href=\"/user/1\"" +
"title=\"View user profile.\">Apple</a> - October 22, 2009 - 01:07</div>";
StringReader reader = new StringReader(file);
ParserCallbackDiv parser = new ParserCallbackDiv();
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I trace a ChangedCharSetException in Java when parsing HTML? - java

The easiest way I found was just to ignore the charset completely: Change parserDelegator.parse(inputStreamReader, parserCallback, false); to: parserDelegator.parse(inputStreamReader, parserCallback, true); Since the third option is boolean ignoreCharSet.

Related

Java I/O FileStream issue

How t get specific value from html in java?

Java: accessing a List of Strings as an InputStream

Convert HTML to plain text in Java

Problem querying an HTML file using HTMLEditorKit in Java

Categories

Resources