getting text that will be displayed to user from html - java

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.

Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.

I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.
If you have
<strong> test </strong>
(and other tags that change the look of the test) you could place in conditions to not place a full stop here.

HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.
there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

Related

How can I convert HTML text, including unordered and ordererd lists, to plain text with Jsoup?

I need to convert HTML to plain text for sending it per mail.
Currently I'm using
Jsoup.parse(html).wholeText();
This preserves line breaks, but not lists. Something like
- List item
- List item 2
- Nested list item
gets converted to List itemList item2Nested list item
How can I keep most of the text formatting, but remove all HTML tags with images, links etc.?
What you're asking for is to render HTML (not parse it; though parsing it is, naturally, part of any HTML rendering engine). Not render it the way e.g. Chromium would render it (as an image to a screen), but to render it into a string.
This is highly complicated, and involves CSS support as well. In basis, what you are asking for is multiple personyears of effort, and as far as I know no library exists that did it. You can have a look at text-based HTML renderers such as Lynx or w3m - you can probably install them, execute these with ProcessBuilder (this does, of course, make your app entirely arch+OS dependent, and you'll have to ship a w3m or lynx binary for each and every platform you want to support, or ask the one who installs your app to take care of also installing a lynx and/or w3m and telling your app where it is). Note that lynx/w3m tend to assume full terminal support, meaning: Bold, colours, etc.
Imagine an HTML page that doesn't use <ul> and <li> to create a bulleted list, but instead uses some CSS to make something that looks a lot like a bulleted list. Or what if inline CSS is used to align something to the right. Presumably then you would expect the string to also do this right alignment, except that is completely impossible unless either [A] you know the size of the 'window' the string will be rendered into or [B] the output is not basic text strings but some sort of markup language that supports right aligning (be it HTML or RTF or similar), or [C] terminal command sequences are available to move the cursor around.
This should highlight how your question is in essence 'weird' - it's either incredibly complicated, or a seemingly arbitrary tiny subselection of what HTML can do.
If the latter piques your interest, it isn't too difficult to just write a simplistic tree walker that specifically inserts newlines and "\n * " any time a <li> element inside a <ul> is visited, and a String.format("\n%2d. ") anytime a <li> is visited inside an <ol>.
In other words, given that what you ask for is either impossible or is an arbitrary choice of HTML and CSS stylings that you do and don't want to support, write it yourself. If truly you are only interested specifically in <ol>/<ul> based lists and nothing else, this will be about a page full of code and no more.

JSOUP Deleting closing and/or opening divs only

Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.

How do I truncate HTML strings to remove broken invalid HTML fragments?

In my Java webapp, I create summary text of long HTML text. In the process of truncation, the HTML fragments in the string often break, producing HTML string with invalid & broken fragments. Like this example HTML string:
Visit this link <img src="htt
Is there any Java library to deal with this better so that such broken fragments as above are avoided ?
Or could I let this be included in the HTML pages & somehow deal with this using client side code ?
Since browsers will usually be able to deal with almost any garbage you feed into it (if it ain't XHTML...), if the only thing that actually happens with the input (assuming it's valid HTML of any kind) is being sliced, then the only thing you have to worry about is to actually get rid of invalid opening tags; you won't be able to distinguish broken 'endings' of tags, since they, in themselves, ain't special in any way. I'd just take a slice I've generated and parse it from the end; if I encounter a stray '<', I'd get rid of everything after it. Likewise, I'd keep track of the last opened tag - if the next close after it wasn't closing that exact tag, it's likely the closing tag got out, so I'd insert it.
This would still generate a lot of garbage, but would at least fix some rudimentary problems.
A better way would be to manage a stack of opened/closed tags and generate/remove the needed/broken/unnecessary ones as they emerge. A stack is a proper solution since HTML tags musn't 'cross' [by the spec, AFAIR it's this way from HTML 4], i.e. <span><div></span></div> isn't valid.
A much better way would be to splice the document after first parsing it as SGML/HTML/XML (depends on the exact HTML doctype) - then you could just remove the nodes, without damaging the structure.
Note that you can't actually know if a tag is correct without providing an exact algorithm you use to generate this 'garbled' content.
I used owasp-java-html-sanitizer to fix those broken fragments to generate safe HTML markup from Java.
PolicyFactory html_sanitize_policy = Sanitizers.LINKS.and(Sanitizers.IMAGES);
String safeHTML = html_sanitize_policy.sanitize(htmlString);
This seemed to be easiest of all solutions I came across.

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.
The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.
Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.
How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.
If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)
Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.
I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

JSP Document/JSPX: what determines how tabs/space/linebreaks are removed in the output?

I've got a "JSP Document" ("JSP in XML") nicely formatted and when the webpage is generated and sent to the user, some linebreaks are removed.
Now the really weird part: apparently the "main" .jsp always gets all its linebreak removed but for any subsequent .jsp included from the main .jsp, linebreaks seems to be randomly removed (some are there, others aren't).
For example, if I'm looking at the webpage served from Firefox and ask to "view source", I get to see what is generated.
So, what determines when/how linebreaks are kept/removed?
This is just an example I made up... Can you force a .jsp to serve this:
<body><div id="page"><div id="header"><div class="title">...
or this:
<body>
<div id="page">
<div id="header">
<div class="title">...
?
I take it that linebreaks are removed to save on bandwidth, but what if I want to keep them? And what if I want to keep the same XML indentation as in my .jsp file?
Is this doable?
EDIT
Following skaffman's advice, I took a look at the generated .java files and the "main" one doesn't have lots of out.write but not a single one writing tabs nor newlines. Contrary to that file, all the ones that I'm including from that main .jsp have lots of lines like:
out.write("\t...\n");
So I guess my question stays exactly the same: what determines how tabs/space/linebreaks are included/removed in the output?
As per the JSP specification:
JSP.6.2.3 Semantic Model
...
To clearly explain the processing of whitespace, we follow the structure of the
XSLT specification. The first step in processing a JSP document is to identify the
nodes of the document. Then, all textual nodes that have only white space are
dropped from the document; the only exception are nodes in a jsp:text element,
which are kept verbatim. The resulting nodes are interpreted as described in the
following sections. Template data is either passed directly to the response or it is
mediated through (standard or custom) actions.
So, if you want to preserve whitespace, you need to wrap the desired parts in <jsp:text>.
Not a very precise answer but according to this message, Jasper might be doing this (I admit I didn't check myself):
I had a search around, and don't think
this one has been asked before.
I've been writing my JSP pages as XML
documents (.jspx suffix), and one
difference that annoys me a little
between using the terse XML document
syntax and the (pre-2.0) legacy syntax
are line breaks, or lack of in the
former.
The processed XML document results in
the XHTML where the tags are unbroken
by either spacing or line breaks.
I understand fully the concept behind
why this is the case; the first XML
document is parsed down to it's node
tree, and then the XHTML is generated
based on this tree.
However, I am of the understanding
that the white spacing doesn't have to
be lost along the way. I checked the
jasper code, particularly
org.apache.jasper.compiler.Node, and a
cursory inspection reveals a lot of
.trim() calls, that may be
unnecessarily removing the spacing.
However, I am not so familiar with
this code to say decisively.
In summation, this may be a bug/area
of improvement. Regardless, is there a
way to embed line breaks via
configuration? I feel confident to be
able to achieve these with ugly, ugly
CDATA sections, or some like technique
(jsp:text?)... but before I go to that
trouble, is there an easier/neater
way?
The message is pretty old but the same behavior is also reported in this comment and your question seems to imply it still applies.
Not sure you can change this behavior (without using a Filter).
<trim-directive-whitespaces>true</trim-directive-whitespaces>
Whitespace inside <jsp:text> is stripped if you use the above directive in your web.xml. Given the semantic model quoted in BalusC's answer, this directive is only useful if you're not using XML syntax.
(Tomcat 6.0.16; Servlet 2.5 declarations; JSP 2.1 declarations in XML syntax)

Categories