I would like to let users customize pages, let's call them A and B. So basically I want to provide a hyperlink to a jps page with big text box where a user should be able to enter any text, html (to appear on page A), with ability to preview it and save.
I haven't really deal with this sort of issues before and would appreciate help on how implement it (examples and reference would be very helpful too)
Thanks
Are you using any kind of web framework(Spring MVC / Struts / Tapestry / etc...)? If you are, they all have tutorials on dealing with user inputs / form submission, so take a look at that. They all differ slightly in how user input is processed so it's impossible to answer this question generically.
If you're not (e.g. this is straight JSP), take a look at this tutorial.
Basically, what you want to do is to define an HTML form on your page B with textarea where user would input custom HTML. When form is submitted, you'll get the text user entered as a request parameter and you can store it somewhere (in the database / flat file / memory / what have you). On your page A you'll need to retrieve that text and bind it to request or page scope, you can then display it using <%= %> or <jsp:getProperty> tags.
To ChssPly76's answer I'd just add that if you're going to provide text entry of html on a web page (or anywhere, really) you're going to want to provide some kind of validation and a mechanism to provide feedback if the html is bad. You might dispense with this for a raw internal tool but anything for public consumption will need it. e.g. what do you do if someone enters
<b>sometext
You can deal with this with simple rules that parse away html tags, a preview that lets people know how they're doing so far ala stackoverflow, an rtf input option, or just a validate and if the tags don't balance a big honking "Try again", but you'll want some kind of check that you won't just be putting up broken pages.
Related
Let's say I have a simple CRUD application with a form to add new object and edit an existing one. From a security point of view I want to defend against cross-site scripting. Fist I would validate the input of submitted data on the server. But after that, I would escape the values being displayed in the view because maybe I have more than one application writing in my database (some developer by mistake inserts unvalidated data in the DB in the future). So I will have this jsp:
<%# taglib prefix="esapi" uri="http://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API" %>
<form ...>
<input name="myField" value="<esapi:encodeForHTMLAttribute>${myField}</esapi:encodeForHTMLAttribute>" />
</form>
<esapi:encodeForHTMLAttribute> does almost the same thing as <c:out>, it HTML escapes sensitive characters like < > " etc
Now, if I load an object that somehow was saved in the database with myfield=abc<def the input will display correctly the value abc<def while the value in the html behind will be abc<def.
The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc<def. So this is not correct. How should I implement the protection in this case?
The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc
Easy. In this case HTML decode the value, and then validate.
Though as noted in a few comments, you should see how we operate with the OWASP ESAPI-Java project. By default we always canonicalize the data which means we run a series of decoders to detect multiple/mixed encoding as well as to create a string safe to validate against with regex.
For the part that really guarantees you protection however, you normally want to have raw text stored on the server--not anything that contains HTML characters, so you may wish to store the unescaped string, if only that you can safely encode it when you send it back to the user.
Encoding is the best protection for XSS, and I would in fact recommend it BEFORE input validation if for some reason you had to choose.
I say may because in general I think its a bad practice to store altered data. It can make troubleshooting a chore. This can be even more complicated if you're using a technology like TinyMCE, a rich-text editor in the browser. It also renders html so its like dealing with a browser within a browser.
I want to retrieve a set of results, which consist of all results produced by (looping) all the options of one of the request-form fields.
I'm using Java language, and HtmlUnit API.
I have managed to do this looping form-fill using the URL to 'fill' the field's variables (I don't know if its the best method, and actually am quite worried it's one of the worst...But it's the one i could do with the knowledge i have).
But i'm having problems figuring out how to make the program submit the form in order to reach the result page, and on how to download (scrape) that page before moving to the next.
NOTES:
-If you have a better way of filling the 'request-form', that is welcome as well.
UPDATE:
This solves the issues when using HtmlUnit API (thank you, touti):
HtmlPage resultado = pageNow.getElementByName("buscar").click();
System.out.println(resultado.asText());
A better way than loading both the request and response pages is still hugely welcome tough!
you can simulate using Jquery the click on your submit input like this
$("#submit_id").trigger("click");
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.
I need to tidy user input in a web application so that I remove certain HTML-tags and encode < to > etc.
I've made a couple of simple util methods that strips the HTML, but I find myself adding these EVERYWHERE in my application.
Is there a smarter way to tidy the user input? E.g. in the binding process, or as a filter somehow?
I've seen JTidy that can act as a servlet filter, but I'm not sure that this is what I want because I need to clean user input, not output of my JSP's.
From JTidy's homepage:
It can be used as a tool for cleaning up malformed and faulty HTML generated by your dynamic web application.
It can Validate HTML without changing the output and generate warnings for each page so you could identify JSP or Servlet that need to be fixed.
It can save you hours of time. The more HTML you write in JSP or Servlets, the more time you will save. Don't waste time manually looking for problems, figuring out why your HTML doesn't display like it should.
In addition to JTidy validation you could submit dynamically generated pages to online HTML validators for example W3C Markup Validation Service, WAVE Accessibility Tool or WDG HTML Validator even if you are behind the firewall.
I find myself adding these EVERYWHERE in my application.
Really? It's unusual to have many user inputs that accept HTML. Most inputs should be plain text, so that when the user types < they literally get a less-than sign, not a (potentially-tidied/filtered-out) tag. This requires HTML-encoding at the output stage. Typically you'd get that from the <c:out> tag.
(Old-school JSP before JSTL, lamentably, provided no HTML-encoder, so if for some reason that's what you're working with you would have to provide your own HTML-encoding method built out of string replacments, or use one of the many third-party tools that contain one.)
For the usually-few-if-any ‘rich text’ fields that are deliberately meant to accept user-supplied HTML, you should be filtering them strongly to prevent JavaScript injection from the markup. This is a difficult job! A “couple of simple util methods that strips the HTML” are highly unlikely to do it correctly and securely.
The proper way to do this is to parse the input HTML into a DOM; walk over it checking that only known-safe element and attribute names are used; then serialise it back to well-formed [X]HTML. There are a number of tools that can do this and yes, jTidy is one. You would use the method Tidy.parseDOM on the input field value, remove unwanted items from the resulting DOM with removeChild and removeAttribute, then reserialise using pprint.
A good alternative to HTML-based rich text is to give the user a simpler form of textual markup that you can then convert to known-safe HTML tags. Like this SO text box I'm typing into now.
There's Interceptor interface in Spring MVC which may be used to do some common stuff on every request. Regardless of tool you are using for tidying, you may use it for getting what you need at one point. See this manual to manage using ut. Just put the tidying routine into preHandle method and walk through data in HttpServletRequest to update it.
I am developing a Java project in which I will take the text from a user in a textArea and generate an HTML page with that text.
I want to allow the user to add links in the generated HTML Page. As I am taking the page contents from the user, I also need to take the link name and the URL from the user.
How can I provide this functionality?
This is a bit vague ...
You don't say if you also somehow know where in the text the links need to be generated. If you don't, your problem is not solvable.
If you do, then I guess you just need to add some logic that tests if the current location in the output stream somehow corresponds to one of the locations where there should be a link, and generate the link.
The syntax for a HTML link can be found wherever, and I would assume you know it already, but just to be complete it looks like this:
link to example.com
Replace the part inside the quotes with the reference, and the text between > and < with the link name.
You'll need a database to store the links. Unless you want to show them only once.
String links = request.getParameter("textArea_name");
// then save it on your prefered database
To show it you can then load it from the database and just output them.