I've been parsing XML like this for years, and I have to admit when the number of different element becomes larger I find it a bit boring and exhausting to do, here is what I mean, sample dummy XML:
<?xml version="1.0"?>
<Order>
<Date>2003/07/04</Date>
<CustomerId>123</CustomerId>
<CustomerName>Acme Alpha</CustomerName>
<Item>
<ItemId> 987</ItemId>
<ItemName>Coupler</ItemName>
<Quantity>5</Quantity>
</Item>
<Item>
<ItemId>654</ItemId>
<ItemName>Connector</ItemName>
<Quantity unit="12">3</Quantity>
</Item>
<Item>
<ItemId>579</ItemId>
<ItemName>Clasp</ItemName>
<Quantity>1</Quantity>
</Item>
</Order>
This is relevant part (using sax) :
public class SaxParser extends DefaultHandler {
boolean isItem = false;
boolean isOrder = false;
boolean isDate = false;
boolean isCustomerId = false;
private Order order;
private Item item;
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) {
if (localName.equalsIgnoreCase("ORDER")) {
order = new Order();
}
if (localName.equalsIgnoreCase("DATE")) {
isDate = true;
}
if (localName.equalsIgnoreCase("CUSTOMERID")) {
isCustomerId = true;
}
if (localName.equalsIgnoreCase("ITEM")) {
isItem = true;
}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (isDate){
SimpleDateFormat formatter = new SimpleDateFormat("yyyy/MM/dd");
String value = new String(ch, start, length);
try {
order.setDate(formatter.parse(value));
} catch (ParseException e) {
e.printStackTrace();
}
}
if(isCustomerId){
order.setCustomerId(Integer.valueOf(new String(ch, start, length)));
}
if (isItem) {
item = new Item();
isItem = false;
}
}
}
I'm wondering is there a way to get rid of these hideous booleans which keep growing with number of elements. There must be a better way to parse this relatively simple xml. Just by looking the lines of code necessary to do this task looks ugly.
Currently I'm using SAX parser, but I'm open to any other suggestions (other than DOM, I can't afford in memory parsers I have huge XML files).
If you control the definition of the XML, you could use an XML binding tool, for example JAXB (Java Architecture for XML Binding.) In JAXB you can define a schema for the XML structure (XSD and others are supported) or annotate your Java classes in order to define the serialization rules. Once you have a clear declarative mapping between XML and Java, marshalling and unmarshalling to/from XML becomes trivial.
Using JAXB does require more memory than SAX handlers, but there exist methods to process the XML documents by parts: Dealing with large documents.
JAXB page from Oracle
Here's an example of using JAXB with StAX.
Input document:
<?xml version="1.0" encoding="UTF-8"?>
<Personlist xmlns="http://example.org">
<Person>
<Name>Name 1</Name>
<Address>
<StreetAddress>Somestreet</StreetAddress>
<PostalCode>00001</PostalCode>
<CountryName>Finland</CountryName>
</Address>
</Person>
<Person>
<Name>Name 2</Name>
<Address>
<StreetAddress>Someotherstreet</StreetAddress>
<PostalCode>43400</PostalCode>
<CountryName>Sweden</CountryName>
</Address>
</Person>
</Personlist>
Person.java:
#XmlRootElement(name = "Person", namespace = "http://example.org")
public class Person {
#XmlElement(name = "Name", namespace = "http://example.org")
private String name;
#XmlElement(name = "Address", namespace = "http://example.org")
private Address address;
public String getName() {
return name;
}
public Address getAddress() {
return address;
}
}
Address.java:
public class Address {
#XmlElement(name = "StreetAddress", namespace = "http://example.org")
private String streetAddress;
#XmlElement(name = "PostalCode", namespace = "http://example.org")
private String postalCode;
#XmlElement(name = "CountryName", namespace = "http://example.org")
private String countryName;
public String getStreetAddress() {
return streetAddress;
}
public String getPostalCode() {
return postalCode;
}
public String getCountryName() {
return countryName;
}
}
PersonlistProcessor.java:
public class PersonlistProcessor {
public static void main(String[] args) throws Exception {
new PersonlistProcessor().processPersonlist(PersonlistProcessor.class
.getResourceAsStream("personlist.xml"));
}
// TODO: Instead of throws Exception, all exceptions should be wrapped
// inside runtime exception
public void processPersonlist(InputStream inputStream) throws Exception {
JAXBContext jaxbContext = JAXBContext.newInstance(Person.class);
XMLStreamReader xss = XMLInputFactory.newFactory().createXMLStreamReader(inputStream);
// Create unmarshaller
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
// Go to next tag
xss.nextTag();
// Require Personlist
xss.require(XMLStreamReader.START_ELEMENT, "http://example.org", "Personlist");
// Go to next tag
while (xss.nextTag() == XMLStreamReader.START_ELEMENT) {
// Require Person
xss.require(XMLStreamReader.START_ELEMENT, "http://example.org", "Person");
// Unmarshall person
Person person = (Person)unmarshaller.unmarshal(xss);
// Process person
processPerson(person);
}
// Require Personlist
xss.require(XMLStreamReader.END_ELEMENT, "http://example.org", "Personlist");
}
private void processPerson(Person person) {
System.out.println(person.getName());
System.out.println(person.getAddress().getCountryName());
}
}
I've been using xsteam to serialize my own objects to xml and then load them back as Java objects. If you can represent everythign as POJOs and you properly annotate the POJOs to match the types in your xml file you might find it much easier to use.
When a String represents an object in XML, you can just write:
Order theOrder = (Order)xstream.fromXML(xmlString);
I have always used it to load an object into memory in a single line, but if you need to stream it and process as you go you should be able to use a HierarchicalStreamReader to iterate through the document. This might be very similar to Simple, suggested by #Dave.
In SAX the parser "pushes" events at your handler, so you have to do all the housekeeping as you are used to here. An alternative would be StAX (the javax.xml.stream package), which is still streaming but your code is responsible for "pulling" events from the parser. This way the logic of what elements are expected in what order is encoded in the control flow of your program rather than having to be explicitly represented in booleans.
Depending on the precise structure of the XML there may be a "middle way" using a toolkit like XOM, which has a mode of operation where you parse a subtree of the document into a DOM-like object model, process that twig, then throw it away and parse the next one. This is good for repetitive documents with many similar elements that can each be processed in isolation - you get the ease of programming to a tree-based API within each twig but still have the streaming behaviour that lets you parse huge documents efficiently.
public class ItemProcessor extends NodeFactory {
private Nodes emptyNodes = new Nodes();
public Nodes finishMakingElement(Element elt) {
if("Item".equals(elt.getLocalName())) {
// process the Item element here
System.out.println(elt.getFirstChildElement("ItemId").getValue()
+ ": " + elt.getFirstChildElement("ItemName").getValue());
// then throw it away
return emptyNodes;
} else {
return super.finishMakingElement(elt);
}
}
}
You can achieve a similar thing with a combination of StAX and JAXB - define JAXB annotated classes that represent your repeating element (Item in this example) and then create a StAX parser, navigate to the first Item start tag, and then you can unmarshal one complete Item at a time from the XMLStreamReader.
As others suggested, a Stax model would be a better approach to minimize the memory foot print since it is a push based model. I have personally used Axio (Which is used in Apache Axis) and parse elements using XPath expressions which is less verbose than going through node elements as you have done in the code snippet provided.
I've been using this library. It sits on top of the standard Java library and makes things easier for me. In particular, you can ask for a specific element or attribute by name, rather than using the big "if" statement you've described.
http://marketmovers.blogspot.com/2014/02/the-easy-way-to-read-xml-in-java.html
There is another library which supports more compact XML parsing, RTXML. The library and its documentation is on rasmustorkel.com. I implemented the parsing of the file in the original question and I am including the complete program here:
package for_so;
import java.io.File;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import rasmus_torkel.xml_basic.read.TagNode;
import rasmus_torkel.xml_basic.read.XmlReadOptions;
import rasmus_torkel.xml_basic.read.impl.XmlReader;
public class Q15626686_ReadOrder
{
public static class Order
{
public final Date _date;
public final int _customerId;
public final String _customerName;
public final ArrayList<Item> _itemAl;
public
Order(TagNode node)
{
_date = (Date)node.nextStringMappedFieldE("Date", Date.class);
_customerId = (int)node.nextIntFieldE("CustomerId");
_customerName = node.nextTextFieldE("CustomerName");
_itemAl = new ArrayList<Item>();
boolean finished = false;
while (!finished)
{
TagNode itemNode = node.nextChildN("Item");
if (itemNode != null)
{
Item item = new Item(itemNode);
_itemAl.add(item);
}
else
{
finished = true;
}
}
node.verifyNoMoreChildren();
}
}
public static final Pattern DATE_PATTERN = Pattern.compile("^(\\d\\d\\d\\d)\\/(\\d\\d)\\/(\\d\\d)$");
public static class Date
{
public final String _dateString;
public final int _year;
public final int _month;
public final int _day;
public
Date(String dateString)
{
_dateString = dateString;
Matcher matcher = DATE_PATTERN.matcher(dateString);
if (!matcher.matches())
{
throw new RuntimeException(dateString + " does not match pattern " + DATE_PATTERN.pattern());
}
_year = Integer.parseInt(matcher.group(1));
_month = Integer.parseInt(matcher.group(2));
_day = Integer.parseInt(matcher.group(3));
}
}
public static class Item
{
public final int _itemId;
public final String _itemName;
public final Quantity _quantity;
public
Item(TagNode node)
{
_itemId = node.nextIntFieldE("ItemId");
_itemName = node.nextTextFieldE("ItemName");
_quantity = new Quantity(node.nextChildE("Quantity"));
node.verifyNoMoreChildren();
}
}
public static class Quantity
{
public final int _unitSize;
public final int _unitQuantity;
public
Quantity(TagNode node)
{
_unitSize = node.attributeIntD("unit", 1);
_unitQuantity = node.onlyInt();
}
}
public static void
main(String[] args)
{
File xmlFile = new File(args[0]);
TagNode orderNode = XmlReader.xmlFileToRoot(xmlFile, "Order", XmlReadOptions.DEFAULT);
Order order = new Order(orderNode);
System.out.println("Read order for " + order._customerName + " which has " + order._itemAl.size() + " items");
}
}
You will notice that the retrieval functions end in N, E or D. They refer to what to do when the desired data item is not there. N stands for return Null, E stands for throw Exception and D stands for use Default.
Solution without using outside package, or even XPath: use an enum "PARSE_MODE", probably in combination with a Stack<PARSE_MODE>:
1) The basic solution:
a) fields
private PARSE_MODE parseMode = PARSE_MODE.__UNDEFINED__;
// NB: essential that all these enum values are upper case, but this is the convention anyway
private enum PARSE_MODE {
__UNDEFINED__, ORDER, DATE, CUSTOMERID, ITEM };
private List<String> parseModeStrings = new ArrayList<String>();
private Stack<PARSE_MODE> modeBreadcrumbs = new Stack<PARSE_MODE>();
b) make your List<String>, maybe in the constructor:
for( PARSE_MODE pm : PARSE_MODE.values() ){
// might want to check here that these are indeed upper case
parseModeStrings.add( pm.name() );
}
c) startElement and endElement:
#Override
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) {
String localNameUC = localName.toUpperCase();
// pushing "__UNDEFINED__" would mess things up! But unlikely name for an XML element
assert ! localNameUC.equals( "__UNDEFINED__" );
if( parseModeStrings.contains( localNameUC )){
parseMode = PARSE_MODE.valueOf( localNameUC );
// any "policing" to do with which modes are allowed to switch into
// other modes could be put here...
// in your case, go `new Order()` here when parseMode == ORDER
modeBreadcrumbs.push( parseMode );
}
else {
// typically ignore the start of this element...
}
}
#Override
private void endElement(String uri, String localName, String qName) throws Exception {
String localNameUC = localName.toUpperCase();
if( parseModeStrings.contains( localNameUC )){
// will not fail unless XML structure which is malformed in some way
// or coding error in use of the Stack, etc.:
assert modeBreadcrumbs.pop() == parseMode;
if( modeBreadcrumbs.empty() ){
parseMode = PARSE_MODE.__UNDEFINED__;
}
else {
parseMode = modeBreadcrumbs.peek();
}
}
else {
// typically ignore the end of this element...
}
}
... so what does this all mean? At any one time you have knowledge of the "parse mode" you're in ... and you can also look at the Stack<PARSE_MODE> modeBreadcrumbs if you need to find out what other parse modes you passed through to get here...
Your characters method then becomes substantially cleaner:
public void characters(char[] ch, int start, int length) throws SAXException {
switch( parseMode ){
case DATE:
// PS - this SimpleDateFormat object can be a field: it doesn't need to be created hundreds of times
SimpleDateFormat formatter. ...
String value = ...
...
break;
case CUSTOMERID:
order.setCustomerId( ...
break;
case ITEM:
item = new Item();
// this next line probably won't be needed: when you get to endElement, if
// parseMode is ITEM, the previous mode will be restored automatically
// isItem = false ;
}
}
2) The more "professional" solution:
abstract class which concrete classes have to extend and which then have no ability to modify the Stack, etc. NB this examines qName rather than localName. Thus:
public abstract class AbstractSAXHandler extends DefaultHandler {
protected enum PARSE_MODE implements SAXHandlerParseMode {
__UNDEFINED__
};
// abstract: the concrete subclasses must populate...
abstract protected Collection<Enum<?>> getPossibleModes();
//
private Stack<SAXHandlerParseMode> modeBreadcrumbs = new Stack<SAXHandlerParseMode>();
private Collection<Enum<?>> possibleModes;
private Map<String, Enum<?>> nameToEnumMap;
private Map<String, Enum<?>> getNameToEnumMap(){
// lazy creation and population of map
if( nameToEnumMap == null ){
if( possibleModes == null ){
possibleModes = getPossibleModes();
}
nameToEnumMap = new HashMap<String, Enum<?>>();
for( Enum<?> possibleMode : possibleModes ){
nameToEnumMap.put( possibleMode.name(), possibleMode );
}
}
return nameToEnumMap;
}
protected boolean isLegitimateModeName( String name ){
return getNameToEnumMap().containsKey( name );
}
protected SAXHandlerParseMode getParseMode() {
return modeBreadcrumbs.isEmpty()? PARSE_MODE.__UNDEFINED__ : modeBreadcrumbs.peek();
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes)
throws SAXException {
try {
_startElement(uri, localName, qName, attributes);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
// override in subclasses (NB I think caught Exceptions are not a brilliant design choice in Java)
protected void _startElement(String uri, String localName, String qName, Attributes attributes)
throws Exception {
String qNameUC = qName.toUpperCase();
// very undesirable ever to push "UNDEFINED"! But unlikely name for an XML element
assert !qNameUC.equals("__UNDEFINED__") : "Encountered XML element with qName \"__UNDEFINED__\"!";
if( getNameToEnumMap().containsKey( qNameUC )){
Enum<?> newMode = getNameToEnumMap().get( qNameUC );
modeBreadcrumbs.push( (SAXHandlerParseMode)newMode );
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
try {
_endElement(uri, localName, qName);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
// override in subclasses
protected void _endElement(String uri, String localName, String qName) throws Exception {
String qNameUC = qName.toUpperCase();
if( getNameToEnumMap().containsKey( qNameUC )){
modeBreadcrumbs.pop();
}
}
public List<?> showModeBreadcrumbs(){
return org.apache.commons.collections4.ListUtils.unmodifiableList( modeBreadcrumbs );
}
}
interface SAXHandlerParseMode {
}
Then, salient part of concrete subclass:
private enum PARSE_MODE implements SAXHandlerParseMode {
ORDER, DATE, CUSTOMERID, ITEM
};
private Collection<Enum<?>> possibleModes;
#Override
protected Collection<Enum<?>> getPossibleModes() {
// lazy initiation
if (possibleModes == null) {
List<SAXHandlerParseMode> parseModes = new ArrayList<SAXHandlerParseMode>( Arrays.asList(PARSE_MODE.values()) );
possibleModes = new ArrayList<Enum<?>>();
for( SAXHandlerParseMode parseMode : parseModes ){
possibleModes.add( PARSE_MODE.valueOf( parseMode.toString() ));
}
// __UNDEFINED__ mode (from abstract superclass) must be added afterwards
possibleModes.add( AbstractSAXHandler.PARSE_MODE.__UNDEFINED__ );
}
return possibleModes;
}
PS this is a starting point for more sophisticated stuff: for example, you might set up a List<Object> which is kept synchronised with the Stack<PARSE_MODE>: the Objects could then be anything you want, enabling you to "reach back" into the ascendant "XML nodes" of the one you're dealing with. Don't use a Map, though: the Stack can potentially contain the same PARSE_MODE object more than once. This in fact illustrates a fundamental characteristic of all tree-like structures: no individual node (here: parse mode) exists in isolation: its identity is always defined by the entire path leading to it.
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class JXML {
private DocumentBuilder builder;
private Document doc = null;
private DocumentBuilderFactory factory ;
private XPathExpression expr = null;
private XPathFactory xFactory;
private XPath xpath;
private String xmlFile;
public static ArrayList<String> XMLVALUE ;
public JXML(String xmlFile){
this.xmlFile = xmlFile;
}
private void xmlFileSettings(){
try {
factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
xFactory = XPathFactory.newInstance();
xpath = xFactory.newXPath();
builder = factory.newDocumentBuilder();
doc = builder.parse(xmlFile);
}
catch (Exception e){
System.out.println(e);
}
}
public String[] selectQuery(String query){
xmlFileSettings();
ArrayList<String> records = new ArrayList<String>();
try {
expr = xpath.compile(query);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
records.add(nodes.item(i).getNodeValue());
}
return records.toArray(new String[records.size()]);
}
catch (Exception e) {
System.out.println("There is error in query string");
return records.toArray(new String[records.size()]);
}
}
public boolean updateQuery(String query,String value){
xmlFileSettings();
try{
NodeList nodes = (NodeList) xpath.evaluate(query, doc, XPathConstants.NODESET);
for (int idx = 0; idx < nodes.getLength(); idx++) {
nodes.item(idx).setTextContent(value);
}
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(new DOMSource(doc), new StreamResult(new File(this.xmlFile)));
return true;
}catch(Exception e){
System.out.println(e);
return false;
}
}
public static void main(String args[]){
JXML jxml = new JXML("c://user.xml");
jxml.updateQuery("//Order/CustomerId/text()","222");
String result[]=jxml.selectQuery("//Order/Item/*/text()");
for(int i=0;i<result.length;i++){
System.out.println(result[i]);
}
}
}
Related
I am a novice in Java and I have written a code in which I am struggling to fetch the element value inside the tag. for example in the below xml- id = bk001 didn't appear in the output
<book id="bk001">
<author>Hightower, Kim</author>
<title>The First Book</title>
<genre>Fiction</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<date>
<auth_date>
2000-10-01
</auth_date>
<auth_date>
2000-10-05
</auth_date>
</date>
<review>An amazing story of nothing.</review>
</book>
We can expect XML of any type, we have to convert into a flat structure e.g. CSV
Code written
public class SAX
{
Map<String, String> list = new HashMap<String,String>();
public static void main(String[] args) throws IOException {
new SAX().printElementNames("input/books_1.xml");
}
public void printElementNames(String fileName) throws IOException
{
try {
SAXParserFactory parserFact = SAXParserFactory.newInstance();
SAXParser parser = parserFact.newSAXParser();
DefaultHandler handler = new DefaultHandler()
{
public void startElement(String uri, String lName, String ele, Attributes attributes) throws SAXException {
System.out.print(ele + " ");
if((attributes.getValue("TagValue"))==null)
{
return;
}
else
{
System.out.println(attributes.getValue("TagValue"));
}
}
public void characters(char ch[], int start, int length) throws SAXException {
String value = new String(ch, start, length).trim();
if(value.length() == 0) return;
System.out.println(value);
}
};
parser.parse(new File(fileName), handler);
}catch(Exception e){
e.printStackTrace();
}
}
}
Kindly help me with the same. I have tried to search the same on stackoverflow but couldn't get anything concrete.
Agenda of the code is that it should work for any valid XML.
Note - We are not allowed to use external libraries like gson etc.
The only attribute that your code is attempting to read is "TagValue", so why would you expect your code to display the value of an "id" attribute?
replace your startElement with:
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
System.out.print(qName + " ");
for(int i=0; i<attributes.getLength();i++) {
System.out.println(attributes.getQName(i) + " " + attributes.getValue(i));
}
}
Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>", "")
will work, but some things like & won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).
Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.
See also:
RegEx match open tags except XHTML self-contained tags
What are the pros and cons of the leading Java HTML parsers?
XSS prevention in JSP/Servlet web application
If you're writing for Android you can do this...
androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()
If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {
}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main(String[] args) {
try {
// the HTML to convert
FileReader in = new FileReader("java-new.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
ref : Remove HTML tags from a file to extract only the TEXT
I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}
Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).
Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());
On Android, try this:
String result = Html.fromHtml(html).toString();
The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):
It removes line breaks from the text
It converts text <script> into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing <script> to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"A link", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"<script>", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.
This should work -
use this
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
and this
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.
You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.
You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &. The result should not be considered to be sanitized.
One more way can be to use com.google.gdata.util.common.html.HtmlToText class
like
MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.
The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* #author dan
* #see TagSoup
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
#Override
public void endDocument() throws SAXException {
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void endPrefixMapping(String prefix) throws SAXException {
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
}
#Override
public void setDocumentLocator(Locator locator) {
}
#Override
public void skippedEntity(String name) throws SAXException {
}
#Override
public void startDocument() throws SAXException {
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
#Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HTML2Text extends HTMLEditorKit.ParserCallback {
private static final Logger log = Logger
.getLogger(Logger.GLOBAL_LOGGER_NAME);
private StringBuffer stringBuffer;
private Stack<IndexType> indentStack;
public static class IndexType {
public String type;
public int counter; // used for ordered lists
public IndexType(String type) {
this.type = type;
counter = 0;
}
}
public HTML2Text() {
stringBuffer = new StringBuffer();
indentStack = new Stack<IndexType>();
}
public static String convert(String html) {
HTML2Text parser = new HTML2Text();
Reader in = new StringReader(html);
try {
// the HTML to convert
parser.parse(in);
} catch (Exception e) {
log.severe(e.getMessage());
} finally {
try {
in.close();
} catch (IOException ioe) {
// this should never happen
}
}
return parser.getText();
}
public void parse(Reader in) throws IOException {
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("StartTag:" + t.toString());
if (t.toString().equals("p")) {
if (stringBuffer.length() > 0
&& !stringBuffer.substring(stringBuffer.length() - 1)
.equals("\n")) {
newLine();
}
newLine();
} else if (t.toString().equals("ol")) {
indentStack.push(new IndexType("ol"));
newLine();
} else if (t.toString().equals("ul")) {
indentStack.push(new IndexType("ul"));
newLine();
} else if (t.toString().equals("li")) {
IndexType parent = indentStack.peek();
if (parent.type.equals("ol")) {
String numberString = "" + (++parent.counter) + ".";
stringBuffer.append(numberString);
for (int i = 0; i < (4 - numberString.length()); i++) {
stringBuffer.append(" ");
}
} else {
stringBuffer.append("* ");
}
indentStack.push(new IndexType("li"));
} else if (t.toString().equals("dl")) {
newLine();
} else if (t.toString().equals("dt")) {
newLine();
} else if (t.toString().equals("dd")) {
indentStack.push(new IndexType("dd"));
newLine();
}
}
private void newLine() {
stringBuffer.append("\n");
for (int i = 0; i < indentStack.size(); i++) {
stringBuffer.append(" ");
}
}
public void handleEndTag(HTML.Tag t, int pos) {
log.info("EndTag:" + t.toString());
if (t.toString().equals("p")) {
newLine();
} else if (t.toString().equals("ol")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("ul")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("li")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("dd")) {
indentStack.pop();
;
}
}
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("SimpleTag:" + t.toString());
if (t.toString().equals("br")) {
newLine();
}
}
public void handleText(char[] text, int pos) {
log.info("Text:" + new String(text));
stringBuffer.append(text);
}
public String getText() {
return stringBuffer.toString();
}
public static void main(String args[]) {
String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>";
System.out.println(convert(html));
}
}
Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}
Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.
Example
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile
Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.
I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
noHTMLString.replaceAll("\\&.*?\\;", "");
instead of this:
html = html.replaceAll(" ","");
html = html.replaceAll("&"."");
It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}
One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
html = html.replace(tag, NEW_LINE_MARK+tag);
}
String text = Jsoup.parse(html).text();
text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()
Sometimes the html string come from xml with such <. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text() can't remove tags.
Try this for javascript:
const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
You can use this method to remove the HTML tags from the String,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}
My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}
To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");
I would appreciate any help on this.
This is my first handler I wrote.
I got I REST Webservice returning XML of links. It has quite simple structure and is not deep.
I wrote a handler for this:
public class SAXHandlerLnk extends DefaultHandler {
public List<Link> lnkList = new ArrayList();
Link lnk = null;
private StringBuilder content = new StringBuilder();
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals("link")) {
lnk = new Link();
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("link")) {
lnkList.add(lnk);
}
else if (qName.equals("applicationCode")) {
lnk.applicationCode = content.toString();
}
else if (qName.equals("moduleCode")) {
lnk.moduleCode = content.toString();
}
else if (qName.equals("linkCode")) {
lnk.linkCode = content.toString();
}
else if (qName.equals("languageCode")) {
lnk.languageCode = content.toString();
}
else if (qName.equals("value")) {
lnk.value = content.toString();
}
else if (qName.equals("illustrationUrl")) {
lnk.illustrationUrl = content.toString();
}
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
content.append(ch, start, length);
}
}
Some XML returned can be empty eg. or . When this happens my handler unfortunatelly adds previous value to the Object lnk. So when is empty in XML, I got lnk.illustrationUrl = content; equal to lnk.value.
Link{applicationCode='onedownload', moduleCode='onedownload',...}
In the above example, I would like moduleCode to be empty or null, because in XML it is an empty tag.
Here is the calling class:
public class XMLRepositoryRestLinksFilterSAXParser {
public static void main(String[] args) throws Exception {
SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandlerLnk handler = new SAXHandlerLnk();
parser.parse({URL}, handler);
for ( Link lnk : handler.lnkList){
System.out.println(lnk);
}
}
}
Like stated in my comment, you'd do the following. The callbacks are usually called in startElement, characters, (nested?), characters, endElement order, where (nested?) represents an optional repeat of the entire sequence.
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
content = null;
if (qName.equals("link")) {
lnk = new Link();
}
}
Note that characters may be called multiple times per a single XML element in your document, so your current code might fail to capture all content. You'd be better off using a StringBuilder instead of a String object to hold your character content and append to it. See this answer for an example.
I'm writing a client which needs to read multiple consecutive small XML documents over a socket. I can assume that the encoding is always UTF-8 and that there is optionally delimiting whitespace between documents. The documents should ultimately go into DOM objects. What is the best way to accomplish this?
The essense of the problem is that the parsers expect a single document in the stream and consider the rest of the content junk. I thought that I could artificially end the document by tracking the element depth, and creating a new reader using the existing input stream. E.g. something like:
// Broken
public void parseInputStream(InputStream inputStream) throws Exception
{
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.newDocument();
XMLEventWriter domWriter = xof.createXMLEventWriter(new DOMResult(doc));
XMLStreamReader xmlStreamReader = factory.createXMLStreamReader(inputStream);
XMLEventReader reader = factory.createXMLEventReader(xmlStreamReader);
int depth = 0;
while (reader.hasNext()) {
XMLEvent evt = reader.nextEvent();
domWriter.add(evt);
switch (evt.getEventType()) {
case XMLEvent.START_ELEMENT:
depth++;
break;
case XMLEvent.END_ELEMENT:
depth--;
if (depth == 0)
{
domWriter.add(eventFactory.createEndDocument());
System.out.println(doc);
reader.close();
xmlStreamReader.close();
xmlStreamReader = factory.createXMLStreamReader(inputStream);
reader = factory.createXMLEventReader(xmlStreamReader);
doc = documentBuilder.newDocument();
domWriter = xof.createXMLEventWriter(new DOMResult(doc));
domWriter.add(eventFactory.createStartDocument());
}
break;
}
}
}
However running this on input such as <a></a><b></b><c></c> prints the first document and throws an XMLStreamException. Whats the right way to do this?
Clarification: Unfortunately the protocol is fixed by the server and cannot be changed, so prepending a length or wrapping the contents would not work.
Length-prefix each document (in bytes).
Read the length of the first document from the socket
Read that much data from the socket, dumping it into a ByteArrayOutputStream
Create a ByteArrayInputStream from the results
Parse that ByteArrayInputStream to get the first document
Repeat for the second document etc
IIRC, XML documents can have comments and processing-instructions at the end, so there's no real way of telling exactly when you have come to the end of the file.
A couple of ways of handling the situation have already been mentioned. Another alternative is to put in an illegal character or byte into the stream, such as NUL or zero. This has the advantage that you don't need to alter the documents and you never need to buffer an entire file.
just change to whatever stream
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
public class LogParser {
private XMLInputFactory inputFactory = null;
private XMLStreamReader xmlReader = null;
InputStream is;
private int depth;
private QName rootElement;
private static class XMLStream extends InputStream
{
InputStream delegate;
StringReader startroot = new StringReader("<root>");
StringReader endroot = new StringReader("</root>");
XMLStream(InputStream delegate)
{
this.delegate = delegate;
}
public int read() throws IOException {
int c = startroot.read();
if(c==-1)
{
c = delegate.read();
}
if(c==-1)
{
c = endroot.read();
}
return c;
}
}
public LogParser() {
inputFactory = XMLInputFactory.newInstance();
}
public void read() throws Exception {
is = new XMLStream(new FileInputStream(new File(
"./myfile.log")));
xmlReader = inputFactory.createXMLStreamReader(is);
while (xmlReader.hasNext()) {
printEvent(xmlReader);
xmlReader.next();
}
xmlReader.close();
}
public void printEvent(XMLStreamReader xmlr) throws Exception {
switch (xmlr.getEventType()) {
case XMLStreamConstants.END_DOCUMENT:
System.out.println("finished");
break;
case XMLStreamConstants.START_ELEMENT:
System.out.print("<");
printName(xmlr);
printNamespaces(xmlr);
printAttributes(xmlr);
System.out.print(">");
if(rootElement==null && depth==1)
{
rootElement = xmlr.getName();
}
depth++;
break;
case XMLStreamConstants.END_ELEMENT:
System.out.print("</");
printName(xmlr);
System.out.print(">");
depth--;
if(depth==1 && rootElement.equals(xmlr.getName()))
{
rootElement=null;
System.out.println("finished element");
}
break;
case XMLStreamConstants.SPACE:
case XMLStreamConstants.CHARACTERS:
int start = xmlr.getTextStart();
int length = xmlr.getTextLength();
System.out
.print(new String(xmlr.getTextCharacters(), start, length));
break;
case XMLStreamConstants.PROCESSING_INSTRUCTION:
System.out.print("<?");
if (xmlr.hasText())
System.out.print(xmlr.getText());
System.out.print("?>");
break;
case XMLStreamConstants.CDATA:
System.out.print("<![CDATA[");
start = xmlr.getTextStart();
length = xmlr.getTextLength();
System.out
.print(new String(xmlr.getTextCharacters(), start, length));
System.out.print("]]>");
break;
case XMLStreamConstants.COMMENT:
System.out.print("<!--");
if (xmlr.hasText())
System.out.print(xmlr.getText());
System.out.print("-->");
break;
case XMLStreamConstants.ENTITY_REFERENCE:
System.out.print(xmlr.getLocalName() + "=");
if (xmlr.hasText())
System.out.print("[" + xmlr.getText() + "]");
break;
case XMLStreamConstants.START_DOCUMENT:
System.out.print("<?xml");
System.out.print(" version='" + xmlr.getVersion() + "'");
System.out.print(" encoding='" + xmlr.getCharacterEncodingScheme()
+ "'");
if (xmlr.isStandalone())
System.out.print(" standalone='yes'");
else
System.out.print(" standalone='no'");
System.out.print("?>");
break;
}
}
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
new LogParser().read();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void printName(XMLStreamReader xmlr) {
if (xmlr.hasName()) {
System.out.print(getName(xmlr));
}
}
private static String getName(XMLStreamReader xmlr) {
if (xmlr.hasName()) {
String prefix = xmlr.getPrefix();
String uri = xmlr.getNamespaceURI();
String localName = xmlr.getLocalName();
return getName(prefix, uri, localName);
}
return null;
}
private static String getName(String prefix, String uri, String localName) {
String name = "";
if (uri != null && !("".equals(uri)))
name += "['" + uri + "']:";
if (prefix != null)
name += prefix + ":";
if (localName != null)
name += localName;
return name;
}
private static void printAttributes(XMLStreamReader xmlr) {
for (int i = 0; i < xmlr.getAttributeCount(); i++) {
printAttribute(xmlr, i);
}
}
private static void printAttribute(XMLStreamReader xmlr, int index) {
String prefix = xmlr.getAttributePrefix(index);
String namespace = xmlr.getAttributeNamespace(index);
String localName = xmlr.getAttributeLocalName(index);
String value = xmlr.getAttributeValue(index);
System.out.print(" ");
System.out.print(getName(prefix, namespace, localName));
System.out.print("='" + value + "'");
}
private static void printNamespaces(XMLStreamReader xmlr) {
for (int i = 0; i < xmlr.getNamespaceCount(); i++) {
printNamespace(xmlr, i);
}
}
private static void printNamespace(XMLStreamReader xmlr, int index) {
String prefix = xmlr.getNamespacePrefix(index);
String uri = xmlr.getNamespaceURI(index);
System.out.print(" ");
if (prefix == null)
System.out.print("xmlns='" + uri + "'");
else
System.out.print("xmlns:" + prefix + "='" + uri + "'");
}
}
A simple solution is to wrap the documents on the sending side in a new root element:
<?xml version="1.0"?>
<documents>
... document 1 ...
... document 2 ...
</documents>
You must make sure that you don't include the XML header (<?xml ...?>), though. If all documents use the same encoding, this can be accomplished with a simple filter which just ignores the first line of each document if it starts with <?xml
Found this forum message (which you probably already saw), which has a solution by wrapping the input stream and testing for one of two ascii characters (see post).
You could try an adaptation on this by first converting to use a reader (for proper character encoding) and then doing element counting until you reach the closing element, at which point you trigger the EOM.
Hi
I also had this problem at work (so won't post resulting the code). The most elegant solution that I could think of, and which works pretty nicely imo, is as follows
Create a class for example DocumentSplittingInputStream which extends InputStream and takes the underlying inputstream in its constructor (or gets set after construction...).
Add a field with a byte array closeTag containing the bytes of the closing root node you are looking for.
Add a field int called matchCount or something, initialised to zero.
Add a field boolean called underlyingInputStreamNotFinished, initialised to true
On the read() implementation:
Check if matchCount == closeTag.length, if it does, set matchCount to -1, return -1
If matchCount == -1, set matchCount = 0, call read() on the underlying inputstream until you get -1 or '<' (the xml declaration of the next document on the stream) and return it. Note that for all I know the xml spec allows comments after the document element, but I knew I was not going to get that from the source so did not bother handling it - if you can not be sure you'll need to change the "gobble" slightly.
Otherwise read an int from the underlying inputstream (if it equals closeTag[matchCount] then increment matchCount, if it doesn't then reset matchCount to zero) and return the newly read byte
Add a method which returns the boolean on whether the underlying stream has closed.
All reads on the underlying input stream should go through a separate method where it checks if the value read is -1 and if so, sets the field "underlyingInputStreamNotFinished" to false.
I may have missed some minor points but i'm sure you get the picture.
Then in the using code you do something like, if you are using xstream:
DocumentSplittingInputStream dsis = new DocumentSplittingInputStream(underlyingInputStream);
while (dsis.underlyingInputStreamNotFinished()) {
MyObject mo = xstream.fromXML(dsis);
mo.doSomething(); // or something.doSomething(mo);
}
David
I had to do something like this and during my research on how to approach it, I found this thread that even though it is quite old, I just replied (to myself) here wrapping everything in its own Reader for simpler use
I was faced with a similar problem. A web service I'm consuming will (in some cases) return multiple xml documents in response to a single HTTP GET request. I could read the entire response into a String and split it, but instead I implemented a splitting input stream based on user467257's post above. Here is the code:
public class AnotherSplittingInputStream extends InputStream {
private final InputStream realStream;
private final byte[] closeTag;
private int matchCount;
private boolean realStreamFinished;
private boolean reachedCloseTag;
public AnotherSplittingInputStream(InputStream realStream, String closeTag) {
this.realStream = realStream;
this.closeTag = closeTag.getBytes();
}
#Override
public int read() throws IOException {
if (reachedCloseTag) {
return -1;
}
if (matchCount == closeTag.length) {
matchCount = 0;
reachedCloseTag = true;
return -1;
}
int ch = realStream.read();
if (ch == -1) {
realStreamFinished = true;
}
else if (ch == closeTag[matchCount]) {
matchCount++;
} else {
matchCount = 0;
}
return ch;
}
public boolean hasMoreData() {
if (realStreamFinished == true) {
return false;
} else {
reachedCloseTag = false;
return true;
}
}
}
And to use it:
String xml =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<root>first root</root>" +
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<root>second root</root>";
ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());
SplittingInputStream splitter = new SplittingInputStream(is, "</root>");
BufferedReader reader = new BufferedReader(new InputStreamReader(splitter));
while (splitter.hasMoreData()) {
System.out.println("Starting next stream");
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println("line ["+line+"]");
}
}
I use JAXB approach to unmarshall messages from multiply stream:
MultiInputStream.java
public class MultiInputStream extends InputStream {
private final Reader source;
private final StringReader startRoot = new StringReader("<root>");
private final StringReader endRoot = new StringReader("</root>");
public MultiInputStream(Reader source) {
this.source = source;
}
#Override
public int read() throws IOException {
int count = startRoot.read();
if (count == -1) {
count = source.read();
}
if (count == -1) {
count = endRoot.read();
}
return count;
}
}
MultiEventReader.java
public class MultiEventReader implements XMLEventReader {
private final XMLEventReader reader;
private boolean isXMLEvent = false;
private int level = 0;
public MultiEventReader(XMLEventReader reader) throws XMLStreamException {
this.reader = reader;
startXML();
}
private void startXML() throws XMLStreamException {
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
return;
}
}
}
public boolean hasNextXML() {
return reader.hasNext();
}
public void nextXML() throws XMLStreamException {
while (reader.hasNext()) {
XMLEvent event = reader.peek();
if (event.isStartElement()) {
isXMLEvent = true;
return;
}
reader.nextEvent();
}
}
#Override
public XMLEvent nextEvent() throws XMLStreamException {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
level++;
}
if (event.isEndElement()) {
level--;
if (level == 0) {
isXMLEvent = false;
}
}
return event;
}
#Override
public boolean hasNext() {
return isXMLEvent;
}
#Override
public XMLEvent peek() throws XMLStreamException {
XMLEvent event = reader.peek();
if (level == 0) {
while (event != null && !event.isStartElement() && reader.hasNext()) {
reader.nextEvent();
event = reader.peek();
}
}
return event;
}
#Override
public String getElementText() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public XMLEvent nextTag() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public Object getProperty(String name) throws IllegalArgumentException {
throw new NotImplementedException();
}
#Override
public void close() throws XMLStreamException {
throw new NotImplementedException();
}
#Override
public Object next() {
throw new NotImplementedException();
}
#Override
public void remove() {
throw new NotImplementedException();
}
}
Message.java
#XmlAccessorType(XmlAccessType.FIELD)
#XmlRootElement(name = "Message")
public class Message {
public Message() {
}
#XmlAttribute(name = "ID", required = true)
protected long id;
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
#Override
public String toString() {
return "Message{id=" + id + '}';
}
}
Read multiply messages:
public static void main(String[] args) throws Exception{
StringReader stringReader = new StringReader(
"<Message ID=\"123\" />\n" +
"<Message ID=\"321\" />"
);
JAXBContext context = JAXBContext.newInstance(Message.class);
Unmarshaller unmarshaller = context.createUnmarshaller();
XMLInputFactory inputFactory = XMLInputFactory.newFactory();
MultiInputStream multiInputStream = new MultiInputStream(stringReader);
XMLEventReader xmlEventReader = inputFactory.createXMLEventReader(multiInputStream);
MultiEventReader multiEventReader = new MultiEventReader(xmlEventReader);
while (multiEventReader.hasNextXML()) {
Object message = unmarshaller.unmarshal(multiEventReader);
System.out.println(message);
multiEventReader.nextXML();
}
}
results:
Message{id=123}
Message{id=321}
Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>", "")
will work, but some things like & won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).
Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.
See also:
RegEx match open tags except XHTML self-contained tags
What are the pros and cons of the leading Java HTML parsers?
XSS prevention in JSP/Servlet web application
If you're writing for Android you can do this...
androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()
If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {
}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main(String[] args) {
try {
// the HTML to convert
FileReader in = new FileReader("java-new.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
ref : Remove HTML tags from a file to extract only the TEXT
I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}
Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).
Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());
On Android, try this:
String result = Html.fromHtml(html).toString();
The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):
It removes line breaks from the text
It converts text <script> into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing <script> to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"A link", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"<script>", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.
This should work -
use this
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
and this
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.
You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.
You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &. The result should not be considered to be sanitized.
One more way can be to use com.google.gdata.util.common.html.HtmlToText class
like
MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.
The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* #author dan
* #see TagSoup
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
#Override
public void endDocument() throws SAXException {
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void endPrefixMapping(String prefix) throws SAXException {
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
}
#Override
public void setDocumentLocator(Locator locator) {
}
#Override
public void skippedEntity(String name) throws SAXException {
}
#Override
public void startDocument() throws SAXException {
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
#Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HTML2Text extends HTMLEditorKit.ParserCallback {
private static final Logger log = Logger
.getLogger(Logger.GLOBAL_LOGGER_NAME);
private StringBuffer stringBuffer;
private Stack<IndexType> indentStack;
public static class IndexType {
public String type;
public int counter; // used for ordered lists
public IndexType(String type) {
this.type = type;
counter = 0;
}
}
public HTML2Text() {
stringBuffer = new StringBuffer();
indentStack = new Stack<IndexType>();
}
public static String convert(String html) {
HTML2Text parser = new HTML2Text();
Reader in = new StringReader(html);
try {
// the HTML to convert
parser.parse(in);
} catch (Exception e) {
log.severe(e.getMessage());
} finally {
try {
in.close();
} catch (IOException ioe) {
// this should never happen
}
}
return parser.getText();
}
public void parse(Reader in) throws IOException {
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("StartTag:" + t.toString());
if (t.toString().equals("p")) {
if (stringBuffer.length() > 0
&& !stringBuffer.substring(stringBuffer.length() - 1)
.equals("\n")) {
newLine();
}
newLine();
} else if (t.toString().equals("ol")) {
indentStack.push(new IndexType("ol"));
newLine();
} else if (t.toString().equals("ul")) {
indentStack.push(new IndexType("ul"));
newLine();
} else if (t.toString().equals("li")) {
IndexType parent = indentStack.peek();
if (parent.type.equals("ol")) {
String numberString = "" + (++parent.counter) + ".";
stringBuffer.append(numberString);
for (int i = 0; i < (4 - numberString.length()); i++) {
stringBuffer.append(" ");
}
} else {
stringBuffer.append("* ");
}
indentStack.push(new IndexType("li"));
} else if (t.toString().equals("dl")) {
newLine();
} else if (t.toString().equals("dt")) {
newLine();
} else if (t.toString().equals("dd")) {
indentStack.push(new IndexType("dd"));
newLine();
}
}
private void newLine() {
stringBuffer.append("\n");
for (int i = 0; i < indentStack.size(); i++) {
stringBuffer.append(" ");
}
}
public void handleEndTag(HTML.Tag t, int pos) {
log.info("EndTag:" + t.toString());
if (t.toString().equals("p")) {
newLine();
} else if (t.toString().equals("ol")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("ul")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("li")) {
indentStack.pop();
;
newLine();
} else if (t.toString().equals("dd")) {
indentStack.pop();
;
}
}
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
log.info("SimpleTag:" + t.toString());
if (t.toString().equals("br")) {
newLine();
}
}
public void handleText(char[] text, int pos) {
log.info("Text:" + new String(text));
stringBuffer.append(text);
}
public String getText() {
return stringBuffer.toString();
}
public static void main(String args[]) {
String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>";
System.out.println(convert(html));
}
}
Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}
Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.
Example
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile
Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.
I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
noHTMLString.replaceAll("\\&.*?\\;", "");
instead of this:
html = html.replaceAll(" ","");
html = html.replaceAll("&"."");
It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}
One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
html = html.replace(tag, NEW_LINE_MARK+tag);
}
String text = Jsoup.parse(html).text();
text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()
Sometimes the html string come from xml with such <. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text() can't remove tags.
Try this for javascript:
const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
You can use this method to remove the HTML tags from the String,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}
My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}
To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");