One request opens multiple pages

One request opens multiple pages - java

I have a problem that is not easy to summarize in title of the question. Here's the situation:
There is one very simple form that contains only one for selecting a xml and one submit button. When I submit this form, some work is done and I get org.w3c.dom.Document. This Document contains all information I need. Until now controller that was handling this, was returning page that renders this xml.
What I want to do now is except rendering xml, I want to display one or more pages that will contain some date that is inside this xml. Of course, depending of how many particular items there is in that xml, that much additional pages I will display, but I don't think it would be more than 3 or 4 in worst case scenario.
Any suggestion would be helpful, if this is even a possible.

Related

Plausibility of plan to use DOM4J with JTables & Action Listeners

I am preparing to embark on a large solo project at my place of employment. First let me describe the project. I have been asked to create a Java program that can take a CamT54 file (which is just a xml file) and have java display the information in table form. Then users should be given the ability to remove certain components from the table and have it go back to xml format with the changes.
I'm not well versed in dealing with XML in Java so this is going to be a learn and work task. Before I begin investing time I would like to know that my approach is the best approach.
My plan is to use DOM4J to do the parsing and handling of the xml. I will use a JTable to display the data and incorporate some buttons to the GUI that allow the modifications of the data through the use of some action listeners.
Would this be a plausible plan? Can DOM4J effectively allow xml data to be displayed in a table format and furthermore could that data be easily modified or deleted then resaved to a new xml?

I thought I would go ahead and answer this as I finished the program and wanted to post what I thought was the easiest solution in case anyone else needed help.
It turned out the easiest approach (for me at least) was to use the standard DOM parser, here are the steps I took.
Parsed the entire XML into String array lists. XPath was required for this, I also had to convert the elements into Strings and remove the extra tag information from the string using substrings since I only wanted the actual value.
I populated a JTable with these arrays.
Once users finished editing and clicked a save button then another Dom parser would take the original XML and change each and every attribute using the values from the Arrays (that were deleted and repopulated with the JTable cell values when the user clicked "save").

Displaying results on same page

I am new in servlets, I would like to display results in the very same page that I am on when I click a search button, then the results should be on the very same page, how can I achieve that without going to another JSP or if I am supposed to do it behind the scene moving to another without the user noticing that its another page, how do I make it seem as if its the same page with results on it. Any shed of light is highly appreciated.

You have to use javascript to receive the search results from the server. This technique is called ajax. Javascript libraries like jquery (or many others) can help you a lot with this.

Your form submits the search term to the current page, i.e. you can leave the form action attribute empty.
In your servlet check whether a search term has been submitted. If this is the case, execute your search function and write the result to the response.

You don't mention where you data is coming from, but that is fairly unimportant in this question. The data could come from a local variable, from DOM storage or have been served to you using AJAX. But here is an example of setting text data in a textarea element (there are nearly unlimited ways to format what you want, this is one example only). Dynamically changing your current page in this manner means that you do not need to navigate to another.
It works by getting a reference to the element by it's name ("data" in this case) using document.getElementById
We then set the element's content value to your data, and it is displayed in the textarea.
HTML
<input type="text"></input>
<div>
<textarea id="data"></textarea>
</div>
Javascript
var data = "Here is your data that was returned from your source";
document.getElementById("data").value = data;
On jsfiddle

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

Customizing jsp pages

I would like to let users customize pages, let's call them A and B. So basically I want to provide a hyperlink to a jps page with big text box where a user should be able to enter any text, html (to appear on page A), with ability to preview it and save.
I haven't really deal with this sort of issues before and would appreciate help on how implement it (examples and reference would be very helpful too)
Thanks

Are you using any kind of web framework(Spring MVC / Struts / Tapestry / etc...)? If you are, they all have tutorials on dealing with user inputs / form submission, so take a look at that. They all differ slightly in how user input is processed so it's impossible to answer this question generically.
If you're not (e.g. this is straight JSP), take a look at this tutorial.
Basically, what you want to do is to define an HTML form on your page B with textarea where user would input custom HTML. When form is submitted, you'll get the text user entered as a request parameter and you can store it somewhere (in the database / flat file / memory / what have you). On your page A you'll need to retrieve that text and bind it to request or page scope, you can then display it using <%= %> or <jsp:getProperty> tags.

To ChssPly76's answer I'd just add that if you're going to provide text entry of html on a web page (or anywhere, really) you're going to want to provide some kind of validation and a mechanism to provide feedback if the html is bad. You might dispense with this for a raw internal tool but anything for public consumption will need it. e.g. what do you do if someone enters
<b>sometext
You can deal with this with simple rules that parse away html tags, a preview that lets people know how they're doing so far ala stackoverflow, an rtf input option, or just a validate and if the tags don't balance a big honking "Try again", but you'll want some kind of check that you won't just be putting up broken pages.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.