Convert HTML to plain text in Java

Convert HTML to plain text in Java - java

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.
Sample HTML pages for testing are:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
#Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}

Have your parser append text content and newlines to a StringBuilder.
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
#Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
#Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
#Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

I would guess you could use the ParserCallback.
You would need to add code to support the tags that require special handling. There are:
handleStartTag
handleEndTag
handleSimpleTag
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.
Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

Related

Jumping pair lines using .readLine() method at a while loop

I am new at Java and I am having a little trouble:
I am trying to read chemical samples to represent them at a X-Y graph.
The input file looks like this:
La 0.85678
Ce 0.473
Pr 62.839
...
...
My code stocks only the unpair lines value (0.85678, jumps line, 62.839 at the example), and I cannot realize what is the problem:
public class Procces {
public void readREE() throws IOException {
try{
rEE = new BufferedReader (new FileReader ("src/files/test.txt"));
while ( (currentLine = rEE.readLine() ) != null) {
try {
for (int size = 3;size<10;size++) {
String valueDec=(currentLine.substring(3,size));
//char letra =(char)c;
if ((c=rEE.read()) != -1) {
System.out.println("Max size");
} else
valueD = Double.parseDouble(valueDec);
System.out.println(valueDec);
}
}
catch (Exception excUncertainDecimals) {
}
}
}finally {
try { rEE.close();
} catch (Exception exc) {
}
}
}
String line;
int c = 0;
int counter = 0;
String valueS = null;
String valueSimb = null;
Double valueD = null;
Double logValue = null;
Double YFin=450.0;
String currentLine;
BufferedReader rEE;
}
Thank you in advance, as I can't see why the program jumps the pair lines.

use Java Scanner class.
import java.io.*;
import java.util.Scanner;
public class MyClass {
public static void main(String[] args) throws IOException {
try (Scanner s = new Scanner(new BufferedReader(new FileReader("file.txt"))){
while (s.hasNext()) {
System.out.println(s.next());
}
}
}
}

Please have a look at Scanner.
In general is Java a well established language and in most cases you do not have to re-implemented "common" (e.g. reading custom text files) stuff on a low level way.

I get it. Thank you.
Here the code:
import java.io.*
import java.util.Scanner;
public class Process implements Samples{
public void readREE() throws IOException {
try
(Scanner rEE = new Scanner(new BufferedReader(new FileReader("src/files/test.txt")))){
while (rEE.hasNext()) {
element = rEE.next();
if (element.equals("La")) {
String elementValue = rEE.next();
Double value = Double.parseDouble(elementValue);
Double valueChond = 0.237;
Double valueNorm= value/valueChond;
Double logValue = (Math.log(valueNorm)/Math.log(10));
Double yLog = yOrd - logValue*133.33333333;
Sample NormedSampleLa=new Sample("La",yLog);
sampleREE.add(NormedSampleLa);
}
}
} finally {
}
}
public String LaS, CeS, PrS, NdS, PmS, SmS, EuS, GdS, TbS, DyS, HoS, ErS, TmS, YbS, LuS;
public String element, elementValue;
public Double yOrd=450.0;
}

read file and splitting it's content when finding a delimiter

the content of my file is the following :
nellkb:company_dc
rdfs:label "dC" "WASHINGTON" , "Washington" ;
skos:prefLabel "www.wikipedia.com" .
nellkb:politicsblog_quami_ekta
rdfs:label "Quami Ekta" ;
skos:prefLabel "Quami Ekta" .
nellkb:female_ramendra_kumar
rdfs:label "Ramendra Kumar" ;
skos:prefLabel "Ramendra Kumar" .
i need to split my file at the delimiter '.' and save what we have before it in a string. How can i do that ? i tried the following but it does not work
try {
String sCurrentLine = null;
int i = 0;
br = new BufferedReader(new FileReader(rdfInstanceFile));
while ((sCurrentLine = br.readLine()) != null) {
splitted = sCurrentLine.split(".");
}
} catch (IOException e) {
e.printStackTrace();
}

USE the Scanner class. This scenario is perfect for it. All you need to do is specify the '\\.' delimiter.
There is no need to build a string AND THEN split it...
import java.io.InputStream;
import java.util.Scanner;
public class ScanFile {
public static void main(String[] args) {
try {
InputStream is = ScanFile.class.getClassLoader().getResourceAsStream("resources/foo.txt");
Scanner scan = new Scanner(is);
scan.useDelimiter("\\.[\r\n]+"); // Tokenize at dots (.) followed by CR/LF.
int i = 1;
while (scan.hasNext()) {
String line = scan.next().trim();
System.out.printf("Line #%d%n-------%n%n%s%n%n", i++, line);
}
scan.close();
is.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Output
Line #1
-------
nellkb:company_dc
rdfs:label "dC" "WASHINGTON" , "Washington" ;
skos:prefLabel "WASHINGTON"
Line #2
-------
nellkb:politicsblog_quami_ekta
rdfs:label "Quami Ekta" ;
skos:prefLabel "Quami Ekta"
Line #3
-------
nellkb:female_ramendra_kumar
rdfs:label "Ramendra Kumar" ;
skos:prefLabel "Ramendra Kumar"
Additional Information
useDelimiter
public Scanner useDelimiter(String pattern)
Sets this scanner's delimiting pattern to a pattern constructed from the specified String.
An invocation of this method of the form useDelimiter(pattern) behaves in exactly the same way as the invocation useDelimiter(Pattern.compile(pattern)).
Invoking the reset() method will set the scanner's delimiter to the default.
Parameters:
   pattern - A string specifying a delimiting pattern
Returns:
   this scanner
The Scanner constructor takes six (6) different types of objects: File, InputStream, Path, Readable, ReadableByteChannel, and String.
// Constructs a new Scanner that produces values scanned from the specified file.
Scanner(File source)
// Constructs a new Scanner that produces values scanned from the specified file.
Scanner(File source, String charsetName)
// Constructs a new Scanner that produces values scanned from the specified input stream.
Scanner(InputStream source)
// Constructs a new Scanner that produces values scanned from the specified input stream.
Scanner(InputStream source, String charsetName)
// Constructs a new Scanner that produces values scanned from the specified file.
Scanner(Path source)
// Constructs a new Scanner that produces values scanned from the specified file.
Scanner(Path source, String charsetName)
// Constructs a new Scanner that produces values scanned from the specified source.
Scanner(Readable source)
// Constructs a new Scanner that produces values scanned from the specified channel.
Scanner(ReadableByteChannel source)
// Constructs a new Scanner that produces values scanned from the specified channel.
Scanner(ReadableByteChannel source, String charsetName)
// Constructs a new Scanner that produces values scanned from the specified string.
Scanner(String source)
Advanced Solution
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class ScanFile {
private static ClassLoader loader = ScanFile.class.getClassLoader();
private static interface LineProcessor {
void process(String line);
}
private static interface Reader<T> {
T read(String resource, String delimiter) throws IOException;
void flush();
}
private abstract static class FileScanner<T> implements Reader<T> {
private LineProcessor processor;
public void setProcessor(LineProcessor processor) {
this.processor = processor;
}
public T read(Scanner scan, String delimiter, boolean close) throws IOException {
scan.useDelimiter(delimiter);
while (scan.hasNext()) {
processor.process(scan.next().trim());
}
if (close) {
scan.close();
}
return null;
}
public T read(InputStream is, String delimiter, boolean close) throws IOException {
T t = read(new Scanner(is), delimiter, true);
if (close) {
is.close();
}
return t;
}
public T read(String resource, String delimiter) throws IOException {
return read(loader.getResourceAsStream("resources/" + resource), delimiter, true);
}
}
public static class FileTokenizer extends FileScanner<List<String>> {
private List<String> tokens;
public List<String> getTokens() {
return tokens;
}
public FileTokenizer() {
super();
tokens = new ArrayList<String>();
setProcessor(new LineProcessor() {
#Override
public void process(String token) {
tokens.add(token);
}
});
}
public List<String> read(Scanner scan, String delimiter, boolean close) throws IOException {
super.read(scan, delimiter, close);
return tokens;
}
#Override
public void flush() {
tokens.clear();
}
}
public static void main(String[] args) {
try {
FileTokenizer scanner = new FileTokenizer();
List<String> items = scanner.read("foo.txt", "\\.[\r\n]+");
for (int i = 0; i < items.size(); i++) {
System.out.printf("Line #%d%n-------%n%n%s%n%n", i + 1, items.get(i));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

First read the contents of the file into a string, split the string and save it in a string array.
try {
String sCurrentLine = "";
StringBuilder content = new StringBuilder();
String splitted[]= null;
int i = 0;
br = new BufferedReader(new FileReader(rdfInstanceFile));
while ((sCurrentLine = br.readLine()) != null) {
content.append(sCurrentLine) ;
}
splitted = content.toString().split("\\.");
} catch (IOException e) {
e.printStackTrace();
}

replace
splitted = sCurrentLine.split(".");
with
splitted = sCurrentLine.split("\\.");
EDIT
String sCurrentLine = null;
int i = 0;
br = new BufferedReader(new FileReader(rdfInstanceFile));
StringBuilder content = new StringBuilder();
while ((sCurrentLine = br.readLine()) != null) {
content.append(sCurrentLine);
}
splitted = content.toString().split("\\.");
It'll work.

How t get specific value from html in java?

I am developing one Application which show Gold rate and create graph for this.
I find one website which provide me this gold rate regularly.My question is how to extract this specific value from html page.
Here is link which i need to extract = http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/ and this html page have following tag and content.
<p><em>10 gram gold Rate in pune = Rs.31150.00</em></p>
Here is my code which i use for extracting but i didn't find way to extract specific content.
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/";
//Here i need to extract this content by tag wise or content wise....
Thanks in Advance.......

You can use library like Jsoup
You can get it from here --> Download Jsoup
Here is its API reference --> Jsoup API Reference
Its really very easy to parse HTML content using Jsoup.
Below is a sample code which might be helpful to you..
public class GetPTags {
public static void main(String[] args){
Document doc = Jsoup.parse(readURL("http://www.todaysgoldrate.co.intodays-gold-rate-in-pune/"));
Elements p_tags = doc.select("p");
for(Element p : p_tags)
{
System.out.println("P tag is "+p.text());
}
}
public static String readURL(String url) {
String fileContents = "";
String currentLine = "";
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
fileContents = reader.readLine();
while (currentLine != null) {
currentLine = reader.readLine();
fileContents += "\n" + currentLine;
}
reader.close();
reader = null;
} catch (Exception e) {
JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
e.printStackTrace();
}
return fileContents;
}
}

http://java-source.net/open-source/crawlers
You can use any of that's apis, but don't parse the HTML with the pure JDK, because it's too painfull.

Reading multiple xml documents from a socket in java

I'm writing a client which needs to read multiple consecutive small XML documents over a socket. I can assume that the encoding is always UTF-8 and that there is optionally delimiting whitespace between documents. The documents should ultimately go into DOM objects. What is the best way to accomplish this?
The essense of the problem is that the parsers expect a single document in the stream and consider the rest of the content junk. I thought that I could artificially end the document by tracking the element depth, and creating a new reader using the existing input stream. E.g. something like:
// Broken
public void parseInputStream(InputStream inputStream) throws Exception
{
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.newDocument();
XMLEventWriter domWriter = xof.createXMLEventWriter(new DOMResult(doc));
XMLStreamReader xmlStreamReader = factory.createXMLStreamReader(inputStream);
XMLEventReader reader = factory.createXMLEventReader(xmlStreamReader);
int depth = 0;
while (reader.hasNext()) {
XMLEvent evt = reader.nextEvent();
domWriter.add(evt);
switch (evt.getEventType()) {
case XMLEvent.START_ELEMENT:
depth++;
break;
case XMLEvent.END_ELEMENT:
depth--;
if (depth == 0)
{
domWriter.add(eventFactory.createEndDocument());
System.out.println(doc);
reader.close();
xmlStreamReader.close();
xmlStreamReader = factory.createXMLStreamReader(inputStream);
reader = factory.createXMLEventReader(xmlStreamReader);
doc = documentBuilder.newDocument();
domWriter = xof.createXMLEventWriter(new DOMResult(doc));
domWriter.add(eventFactory.createStartDocument());
}
break;
}
}
}
However running this on input such as <a></a><b></b><c></c> prints the first document and throws an XMLStreamException. Whats the right way to do this?
Clarification: Unfortunately the protocol is fixed by the server and cannot be changed, so prepending a length or wrapping the contents would not work.

Length-prefix each document (in bytes).
Read the length of the first document from the socket
Read that much data from the socket, dumping it into a ByteArrayOutputStream
Create a ByteArrayInputStream from the results
Parse that ByteArrayInputStream to get the first document
Repeat for the second document etc

IIRC, XML documents can have comments and processing-instructions at the end, so there's no real way of telling exactly when you have come to the end of the file.
A couple of ways of handling the situation have already been mentioned. Another alternative is to put in an illegal character or byte into the stream, such as NUL or zero. This has the advantage that you don't need to alter the documents and you never need to buffer an entire file.

just change to whatever stream
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
public class LogParser {
private XMLInputFactory inputFactory = null;
private XMLStreamReader xmlReader = null;
InputStream is;
private int depth;
private QName rootElement;
private static class XMLStream extends InputStream
{
InputStream delegate;
StringReader startroot = new StringReader("<root>");
StringReader endroot = new StringReader("</root>");
XMLStream(InputStream delegate)
{
this.delegate = delegate;
}
public int read() throws IOException {
int c = startroot.read();
if(c==-1)
{
c = delegate.read();
}
if(c==-1)
{
c = endroot.read();
}
return c;
}
}
public LogParser() {
inputFactory = XMLInputFactory.newInstance();
}
public void read() throws Exception {
is = new XMLStream(new FileInputStream(new File(
"./myfile.log")));
xmlReader = inputFactory.createXMLStreamReader(is);
while (xmlReader.hasNext()) {
printEvent(xmlReader);
xmlReader.next();
}
xmlReader.close();
}
public void printEvent(XMLStreamReader xmlr) throws Exception {
switch (xmlr.getEventType()) {
case XMLStreamConstants.END_DOCUMENT:
System.out.println("finished");
break;
case XMLStreamConstants.START_ELEMENT:
System.out.print("<");
printName(xmlr);
printNamespaces(xmlr);
printAttributes(xmlr);
System.out.print(">");
if(rootElement==null && depth==1)
{
rootElement = xmlr.getName();
}
depth++;
break;
case XMLStreamConstants.END_ELEMENT:
System.out.print("</");
printName(xmlr);
System.out.print(">");
depth--;
if(depth==1 && rootElement.equals(xmlr.getName()))
{
rootElement=null;
System.out.println("finished element");
}
break;
case XMLStreamConstants.SPACE:
case XMLStreamConstants.CHARACTERS:
int start = xmlr.getTextStart();
int length = xmlr.getTextLength();
System.out
.print(new String(xmlr.getTextCharacters(), start, length));
break;
case XMLStreamConstants.PROCESSING_INSTRUCTION:
System.out.print("<?");
if (xmlr.hasText())
System.out.print(xmlr.getText());
System.out.print("?>");
break;
case XMLStreamConstants.CDATA:
System.out.print("<![CDATA[");
start = xmlr.getTextStart();
length = xmlr.getTextLength();
System.out
.print(new String(xmlr.getTextCharacters(), start, length));
System.out.print("]]>");
break;
case XMLStreamConstants.COMMENT:
System.out.print("<!--");
if (xmlr.hasText())
System.out.print(xmlr.getText());
System.out.print("-->");
break;
case XMLStreamConstants.ENTITY_REFERENCE:
System.out.print(xmlr.getLocalName() + "=");
if (xmlr.hasText())
System.out.print("[" + xmlr.getText() + "]");
break;
case XMLStreamConstants.START_DOCUMENT:
System.out.print("<?xml");
System.out.print(" version='" + xmlr.getVersion() + "'");
System.out.print(" encoding='" + xmlr.getCharacterEncodingScheme()
+ "'");
if (xmlr.isStandalone())
System.out.print(" standalone='yes'");
else
System.out.print(" standalone='no'");
System.out.print("?>");
break;
}
}
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
new LogParser().read();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void printName(XMLStreamReader xmlr) {
if (xmlr.hasName()) {
System.out.print(getName(xmlr));
}
}
private static String getName(XMLStreamReader xmlr) {
if (xmlr.hasName()) {
String prefix = xmlr.getPrefix();
String uri = xmlr.getNamespaceURI();
String localName = xmlr.getLocalName();
return getName(prefix, uri, localName);
}
return null;
}
private static String getName(String prefix, String uri, String localName) {
String name = "";
if (uri != null && !("".equals(uri)))
name += "['" + uri + "']:";
if (prefix != null)
name += prefix + ":";
if (localName != null)
name += localName;
return name;
}
private static void printAttributes(XMLStreamReader xmlr) {
for (int i = 0; i < xmlr.getAttributeCount(); i++) {
printAttribute(xmlr, i);
}
}
private static void printAttribute(XMLStreamReader xmlr, int index) {
String prefix = xmlr.getAttributePrefix(index);
String namespace = xmlr.getAttributeNamespace(index);
String localName = xmlr.getAttributeLocalName(index);
String value = xmlr.getAttributeValue(index);
System.out.print(" ");
System.out.print(getName(prefix, namespace, localName));
System.out.print("='" + value + "'");
}
private static void printNamespaces(XMLStreamReader xmlr) {
for (int i = 0; i < xmlr.getNamespaceCount(); i++) {
printNamespace(xmlr, i);
}
}
private static void printNamespace(XMLStreamReader xmlr, int index) {
String prefix = xmlr.getNamespacePrefix(index);
String uri = xmlr.getNamespaceURI(index);
System.out.print(" ");
if (prefix == null)
System.out.print("xmlns='" + uri + "'");
else
System.out.print("xmlns:" + prefix + "='" + uri + "'");
}
}

A simple solution is to wrap the documents on the sending side in a new root element:
<?xml version="1.0"?>
<documents>
... document 1 ...
... document 2 ...
</documents>
You must make sure that you don't include the XML header (<?xml ...?>), though. If all documents use the same encoding, this can be accomplished with a simple filter which just ignores the first line of each document if it starts with <?xml

Found this forum message (which you probably already saw), which has a solution by wrapping the input stream and testing for one of two ascii characters (see post).
You could try an adaptation on this by first converting to use a reader (for proper character encoding) and then doing element counting until you reach the closing element, at which point you trigger the EOM.

Hi
I also had this problem at work (so won't post resulting the code). The most elegant solution that I could think of, and which works pretty nicely imo, is as follows
Create a class for example DocumentSplittingInputStream which extends InputStream and takes the underlying inputstream in its constructor (or gets set after construction...).
Add a field with a byte array closeTag containing the bytes of the closing root node you are looking for.
Add a field int called matchCount or something, initialised to zero.
Add a field boolean called underlyingInputStreamNotFinished, initialised to true
On the read() implementation:
Check if matchCount == closeTag.length, if it does, set matchCount to -1, return -1
If matchCount == -1, set matchCount = 0, call read() on the underlying inputstream until you get -1 or '<' (the xml declaration of the next document on the stream) and return it. Note that for all I know the xml spec allows comments after the document element, but I knew I was not going to get that from the source so did not bother handling it - if you can not be sure you'll need to change the "gobble" slightly.
Otherwise read an int from the underlying inputstream (if it equals closeTag[matchCount] then increment matchCount, if it doesn't then reset matchCount to zero) and return the newly read byte
Add a method which returns the boolean on whether the underlying stream has closed.
All reads on the underlying input stream should go through a separate method where it checks if the value read is -1 and if so, sets the field "underlyingInputStreamNotFinished" to false.
I may have missed some minor points but i'm sure you get the picture.
Then in the using code you do something like, if you are using xstream:
DocumentSplittingInputStream dsis = new DocumentSplittingInputStream(underlyingInputStream);
while (dsis.underlyingInputStreamNotFinished()) {
MyObject mo = xstream.fromXML(dsis);
mo.doSomething(); // or something.doSomething(mo);
}
David

I had to do something like this and during my research on how to approach it, I found this thread that even though it is quite old, I just replied (to myself) here wrapping everything in its own Reader for simpler use

I was faced with a similar problem. A web service I'm consuming will (in some cases) return multiple xml documents in response to a single HTTP GET request. I could read the entire response into a String and split it, but instead I implemented a splitting input stream based on user467257's post above. Here is the code:
public class AnotherSplittingInputStream extends InputStream {
private final InputStream realStream;
private final byte[] closeTag;
private int matchCount;
private boolean realStreamFinished;
private boolean reachedCloseTag;
public AnotherSplittingInputStream(InputStream realStream, String closeTag) {
this.realStream = realStream;
this.closeTag = closeTag.getBytes();
}
#Override
public int read() throws IOException {
if (reachedCloseTag) {
return -1;
}
if (matchCount == closeTag.length) {
matchCount = 0;
reachedCloseTag = true;
return -1;
}
int ch = realStream.read();
if (ch == -1) {
realStreamFinished = true;
}
else if (ch == closeTag[matchCount]) {
matchCount++;
} else {
matchCount = 0;
}
return ch;
}
public boolean hasMoreData() {
if (realStreamFinished == true) {
return false;
} else {
reachedCloseTag = false;
return true;
}
}
}
And to use it:
String xml =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<root>first root</root>" +
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<root>second root</root>";
ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());
SplittingInputStream splitter = new SplittingInputStream(is, "</root>");
BufferedReader reader = new BufferedReader(new InputStreamReader(splitter));
while (splitter.hasMoreData()) {
System.out.println("Starting next stream");
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println("line ["+line+"]");
}
}

I use JAXB approach to unmarshall messages from multiply stream:
MultiInputStream.java
public class MultiInputStream extends InputStream {
private final Reader source;
private final StringReader startRoot = new StringReader("<root>");
private final StringReader endRoot = new StringReader("</root>");
public MultiInputStream(Reader source) {
this.source = source;
}
#Override
public int read() throws IOException {
int count = startRoot.read();
if (count == -1) {
count = source.read();
}
if (count == -1) {
count = endRoot.read();
}
return count;
}
}
MultiEventReader.java
public class MultiEventReader implements XMLEventReader {
private final XMLEventReader reader;
private boolean isXMLEvent = false;
private int level = 0;
public MultiEventReader(XMLEventReader reader) throws XMLStreamException {
this.reader = reader;
startXML();
}
private void startXML() throws XMLStreamException {
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
return;
}
}
}
public boolean hasNextXML() {
return reader.hasNext();
}
public void nextXML() throws XMLStreamException {
while (reader.hasNext()) {
XMLEvent event = reader.peek();
if (event.isStartElement()) {
isXMLEvent = true;
return;
}
reader.nextEvent();
}
}
#Override
public XMLEvent nextEvent() throws XMLStreamException {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
level++;
}
if (event.isEndElement()) {
level--;
if (level == 0) {
isXMLEvent = false;
}
}
return event;
}
#Override
public boolean hasNext() {
return isXMLEvent;
}
#Override
public XMLEvent peek() throws XMLStreamException {
XMLEvent event = reader.peek();
if (level == 0) {
while (event != null && !event.isStartElement() && reader.hasNext()) {
reader.nextEvent();
event = reader.peek();
}
}
return event;
}
#Override
public String getElementText() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public XMLEvent nextTag() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public Object getProperty(String name) throws IllegalArgumentException {
throw new NotImplementedException();
}
#Override
public void close() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public Object next() {
throw new NotImplementedException();
}
#Override
public void remove() {
throw new NotImplementedException();
}
}
Message.java
#XmlAccessorType(XmlAccessType.FIELD)
#XmlRootElement(name = "Message")
public class Message {
public Message() {
}
#XmlAttribute(name = "ID", required = true)
protected long id;
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
#Override
public String toString() {
return "Message{id=" + id + '}';
}
}
Read multiply messages:
public static void main(String[] args) throws Exception{
StringReader stringReader = new StringReader(
"<Message ID=\"123\" />\n" +
"<Message ID=\"321\" />"
);
JAXBContext context = JAXBContext.newInstance(Message.class);
Unmarshaller unmarshaller = context.createUnmarshaller();
XMLInputFactory inputFactory = XMLInputFactory.newFactory();
MultiInputStream multiInputStream = new MultiInputStream(stringReader);
XMLEventReader xmlEventReader = inputFactory.createXMLEventReader(multiInputStream);
MultiEventReader multiEventReader = new MultiEventReader(xmlEventReader);
while (multiEventReader.hasNextXML()) {
Object message = unmarshaller.unmarshal(multiEventReader);
System.out.println(message);
multiEventReader.nextXML();
}
}
results:
Message{id=123}
Message{id=321}

Remove HTML tags from a String

Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>", "")
will work, but some things like & won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.
See also:
RegEx match open tags except XHTML self-contained tags
What are the pros and cons of the leading Java HTML parsers?
XSS prevention in JSP/Servlet web application

If you're writing for Android you can do this...
androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {
}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main(String[] args) {
try {
// the HTML to convert
FileReader in = new FileReader("java-new.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
ref : Remove HTML tags from a file to extract only the TEXT

I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).
Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());

On Android, try this:
String result = Html.fromHtml(html).toString();

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):
It removes line breaks from the text
It converts text <script> into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing &lt;script&gt; to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"A link", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"&lt;script&gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

This should work -
use this
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
and this
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.

You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &. The result should not be considered to be sanitized.

One more way can be to use com.google.gdata.util.common.html.HtmlToText class
like
MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* #author dan
* #see TagSoup
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
#Override
public void endDocument() throws SAXException {
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void endPrefixMapping(String prefix) throws SAXException {
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
}
#Override
public void setDocumentLocator(Locator locator) {
}
#Override
public void skippedEntity(String name) throws SAXException {
}
#Override
public void startDocument() throws SAXException {
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
#Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HTML2Text extends HTMLEditorKit.ParserCallback {
private static final Logger log = Logger
.getLogger(Logger.GLOBAL_LOGGER_NAME);
private StringBuffer stringBuffer;
private Stack<IndexType> indentStack;
public static class IndexType {
public String type;
public int counter; // used for ordered lists
public IndexType(String type) {
this.type = type;
counter = 0;
}
}
public HTML2Text() {
stringBuffer = new StringBuffer();
indentStack = new Stack<IndexType>();
}
public static String convert(String html) {
HTML2Text parser = new HTML2Text();
Reader in = new StringReader(html);
try {
// the HTML to convert
parser.parse(in);
} catch (Exception e) {
log.severe(e.getMessage());
} finally {
try {
in.close();
} catch (IOException ioe) {
// this should never happen
}
}
return parser.getText();
}
public void parse(Reader in) throws IOException {
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("StartTag:" + t.toString());
if (t.toString().equals("p")) {
if (stringBuffer.length() > 0
&& !stringBuffer.substring(stringBuffer.length() - 1)
.equals("\n")) {
newLine();
}
newLine();
} else if (t.toString().equals("ol")) {
indentStack.push(new IndexType("ol"));
newLine();
} else if (t.toString().equals("ul")) {
indentStack.push(new IndexType("ul"));
newLine();
} else if (t.toString().equals("li")) {
IndexType parent = indentStack.peek();
if (parent.type.equals("ol")) {
String numberString = "" + (++parent.counter) + ".";
stringBuffer.append(numberString);
for (int i = 0; i < (4 - numberString.length()); i++) {
stringBuffer.append(" ");
}
} else {
stringBuffer.append("* ");
}
indentStack.push(new IndexType("li"));
} else if (t.toString().equals("dl")) {
newLine();
} else if (t.toString().equals("dt")) {
newLine();
} else if (t.toString().equals("dd")) {
indentStack.push(new IndexType("dd"));
newLine();
}
}
private void newLine() {
stringBuffer.append("\n");
for (int i = 0; i < indentStack.size(); i++) {
stringBuffer.append(" ");
}
}
public void handleEndTag(HTML.Tag t, int pos) {
log.info("EndTag:" + t.toString());
if (t.toString().equals("p")) {
newLine();
} else if (t.toString().equals("ol")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("ul")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("li")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("dd")) {
indentStack.pop();
;
}
}
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("SimpleTag:" + t.toString());
if (t.toString().equals("br")) {
newLine();
}
}
public void handleText(char[] text, int pos) {
log.info("Text:" + new String(text));
stringBuffer.append(text);
}
public String getText() {
return stringBuffer.toString();
}
public static void main(String args[]) {
String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>";
System.out.println(convert(html));
}
}

Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}

Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.
Example
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile

Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
noHTMLString.replaceAll("\\&.*?\\;", "");
instead of this:
html = html.replaceAll(" ","");
html = html.replaceAll("&"."");

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}

Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
html = html.replace(tag, NEW_LINE_MARK+tag);
}
String text = Jsoup.parse(html).text();
text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()

Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text() can't remove tags.

Try this for javascript:
const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);

You can use this method to remove the HTML tags from the String,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}

My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}

To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert HTML to plain text in Java - java

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem. Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

Related

Jumping pair lines using .readLine() method at a while loop

read file and splitting it's content when finding a delimiter

How t get specific value from html in java?

Reading multiple xml documents from a socket in java

Remove HTML tags from a String

Categories

Resources