Are there any tools to isolate the content of a webpage?

Are there any tools to isolate the content of a webpage? - java

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.

I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/

You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.

im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure

I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

Related

Why is all code I see confined to only a few lines?

I'm fairly new to programming and have been taking courses on Lynda to learn the fundamentals. I have some knowledge of Java and HTML, but I wanted to refresh my memory so I can start learning Objective-C. The Lynda course has us working in JavaScript because of its pretty core syntax. So, in order to get a point of reference, I tried downloading some .js files integrated into HTML pages from various sources. However, this proved to be unhelpful and I am at a loss for understanding because of the way the files are formatted. It seems as if most files put one line of code after the other. I realize that because of flexible whitespace restrictions with JavaScript that this does not hinder the way the code runs, but why did the developers choose to put it all on one line like that? They obviously didn't write the code that way, as that would be extremely tedious and hard to work with, so why did it come out that way when I try to view it? Is it just something that happens when you try to download the resources of a page? Any clarification would be appreciated.
Below is a photo of a JavaScript file I tried viewing. As you can see, all the code is restricted to one single line.
Also, if anyone could offer some insight about where to go after I've finished my course if I'm looking to develop for iOS, that would be greatly appreciated. Lynda also offers an Objective-C Essentials course, as well as an iOS Development course, but I feel like it's a pretty linear path that could be expanded on greatly with some literature or other online documentation.

what you are seeing is a minified version of the javascript file. The main advantage of minification is that it reduces the amount of data that needs to be transferred (bandwidth usage).
If you wish to view the code in human readable format, you can use online tools like this

Yes as karthikr said it's minified. Which means its all there but without the line breaks. So to see it all you have to scroll right.
Or you can use http://jsbeautifier.org/ to bring back the break lines.

There are several reasons for minifying javascript. One is that it makes the code less readable (yeah, some devs don't want you to "steal" functions and see what it does easily). Another, and a big part of why, is that it reduces bandwidth. A file with long variable names and whitespaces everywhere can be multiple times bigger than a minified version - so it improves performance!

Bandwidth costs money, especially for users and especially if they're on mobile devices with a bandwidth limit.
So to solve this problem developers will minimize the file size of what ever they can.
The JavaScripts you are seeing have been minified by libraries such as Uglify or YUI compressor (list not exhaustive).
Doing this will take out unnecessary whitespace and reduce the lengths of variable and function names that are not globally exported.
Developers may also gzip the files too which will reduce the filesize even further.
Edit: grammar

Is there a well-designed, maintained RSS-parsing library for Java?

I know this question has been asked before, but that was several years ago, and of the two answers, Rome and Abdera, the first no-longer seems to be maintained (there aren't even any download links on the website, nor can I find documentation). The latter also appears rather complicated, and neither appears up to contemporary standards of Java library design.
Are there any new alternatives out there that are well designed, and well maintained?

Sorry, I do not know of any library, but, that said, seeing as RSS is an XML format you should be able to roll your own using SAX/JAXB/DOM. Which one to use depends on whether you wan ease of integration with Java (JAXB) or speed (SAX). There is a middle ground in DOM.
RSS is not a complicated format so I think you could just develop the features you need as you come across them and it'll be faster (and the skills you learn more transferable) than exhaustice searching for a library if one cannot be found easily.
Hope this helps.

I did find this class RSSDigester. It might help, I don't realy have the time to investigate it right now, sorry.

RSS reading hasn't really needed changing for some time. ROME really is quite nice, and as far as fetching it you can get it from http://download.java.net/maven/2/rome/.

I eventually found HorroRSS, which is exactly what I was hoping for. Its simple, easy to use, and appears robust.

Suggestion with best customizable crawlers and scrapers

I have a website which is pretty good but with very less information.
So i felt like adding informtion like news regarding particular sector(for eg politics, hollywood etc). I believe crawlers are best approach to do so? Is my understanding correct, please suggest if you feel any other way to get information without using crawlers from various sources.
Secondly I am doing research from last 2 days and I cannot find a particular source which is capable of doing so. Now I want crawlers to find information, normalize and store in mysql database. Sounds pretty simple ha. But It isnt for me.
As this is very resource and time consuming . what all things should i take into consideration before choosing a crawler. Also I wish to customize it so any tool which is open source and good to be customized will be great.
Any source giving information and research about factors need to take into consideration while creating crawlers or educating about crawlers will be great.
I prefer coding in java but i can code in any other language in case you feel that you have some language.
I hope i have given enough information. Please dont hesitate if you need any more information to give suggestion.

You can use httrack to copy a target website. There is one firefox plugin as well named spiderzilla. But, they will just save the pages.
If you want to parse the data in the pages, then you can use simple_html_dom and store the information in mySQL.

Try the GNU Wget tool. You can add a lot of intelligence to the way it crawls and creates data dumps of web pages. It is open-source and customisable as well, and very fast too.

Are there some good and modern alternatives to Javadoc? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Let's face it: You don't need to be a designer to see that
default Javadoc looks ugly.
There are some resources on the web which offer re-styled Javadoc. But the default behaviour represents the product and should be as reasonably good-looking.
Another problem is the fact that the usability of Javadoc is not up-to-date compared to other similar resources.
Especially huge projects are hard to navigate using Firefox's quick search.
Practical question:
Are there any standalone (desktop) applications which are able to browse
existing Javadoc in a more usable way than a browser would?
I'm thinking about something like Mono's documentation browser.
Theoretical question:
Does anyone know, if there some plans to evolve Javadoc, in a
somehow-standardized way?
EDIT: A useful link to Sun' wiki on this topic.

I have created a Markdown (java) Doclet which will take source comments in Markdown formatted text and create the same HTML Javadocs.
The new doclet also does some restyling on the text, but the HTML generated is not changed at this stage.
That goes some way to address the HTML-in-java-commenting issues which is probably the biggest usability problem with current Javadoc.

I don't think that the concepts of Javadoc are outdated. As far as i can see, these concepts are rooted years ago in a product named doxygen, which is still available for other languages (i.e. Objective-C where it is heavily used). Even this has it's predecessors - have a look at the programming environment used by Donald Knuth to create TeX (Literate programming).
Nevertheless it is a intriguing idea to have a single source for program code and documentation.
Besides of that, the presentation of the documentation can be customized to your special needs using a plug-in system supported by the JavaDoc tool. You might provide a plug-in (as we do) that publishes directly into a database which is directly accessible via web. Using collaborations anyone can provide additional comments or clarifications to the documentation that might find their way back into the original source.

Javadoc is the best source code auto-documentation generation system I've ever seen. Large part of that is that it's so simple - I can browse javadocs even with my 5 year old cell phone if I want to! While I agree that a bit of a facelift could be in order and especially JDK is a pain to browse through, I wouldn't dare reinventing the wheel entirely because what we currently have is a RESTful, easy to use solution for its purpose which works just about anywhere.

I recently got a mail forwarded that Sun is working on modernizing the Javadoc HTML output. From said mail:
We are proposing improvements to javadoc/doclet for JDK7. The
project wiki page is located at
http://wikis.sun.com/display/Javadoc/Home. As a part of the proposed
improvements, the UI of the javadoc output will be revamped. The new
design screenshots are uploaded to the project wiki. The javadoc output
markup will be modified to be valid HTML and WCAG 2.0 compliant.
So there is definitely still work going on there, even if somewhat late. However, in my eyes one of the biggest drawbacks of Javadoc is its very close coupling with HTML. Many classes have Javadoc which includes literal HTML and relies on the output being HTML, too. Unfortunate, but this won't change anytime, I think. Still, this means that developers are free to include whatever they want in HTML there which might as well be invalid, non-well-formed, etc. So adapting the output from the javadoc tool is only one part of this, the other won't and can't change and thus remains.
As for browsing documentation I also find the HTML documentation a little unwieldy. I usually use the Javadoc view in Eclipse. It has drawbacks as well (slow and you can't really search) but it's Good Enough™ for most things.

Personally I still find Javadoc to be very useful. Especially since it is standardized. I don't know of any major documentation style that I find easier to navigate (that might very well be subjective, but I personally find MSDN horrible to use, for example).
For the search: Use the Javadoc Search Frame, it makes using Javadoc of all kinds a lot easier. It's available as a Userscript for Firefox and as a Google Chrome Extension.

To answer your Practical Question, I googled and asked friends and came up with these. Forrestdoc,doclet and doxygen.
The second question, I would say that yes, its not very "Web-oh-twoeye" but At least your guaranteed to work in an offline environment, and its small enough to ship along with your API. i dispise the use of frames, but then it works rather well for javadoc. I have not seen any plans to change it.
Eclipse has some support for javadoc as far as reading, interpreting and generating it goes.

You might want to phrase that in a less agressive and overbearing manner. Most people don't care what a technical resource looks like, and "It's not Web 2.0 enough!" sounds like vapid marketroidspeak.
And what exactly would you consider "more usable"? Personally, I would definitely like a full text search and a better useage browser, and AJAX could probable help with those.
Well, the nice thing about JavaDoc is that it's the opposite of outdated - it's arbitrarily extensible. Why don't you go ahead and write a doclet that produces the kind of API doc you want?
Why nobody else has done that so far (which apparently is the case) is anyone's guess - maybe nobody else feels as strongly about it as you.

There's a DocBook doclet. DocBook is a richer document type than (X)HTML and is better for describing technical content. From DocBook source you can generate all sorts of different output formats.

I personally would like a more readable "comment documentation" standard than the HTML (and hence tag-wieldy) JavaDoc.
For example, MarkDown, as used here, would be excellent, human readable in the source, nicely formatted external to the source.
With the current JavaDoc, I imagine many people use JavaDoc comments, but don't actually document to the extent they could. I'm sure everyone has browsed an API's online JavaDoc that has been non-documented or barely-documented, and thus far harder to use than it should be.
This isn't helped by code-reformatters (e.g., within Eclipse, or maybe upon source commit) that totally destroy any readable structure you might have put within a JavaDoc comment (e.g., a list of items) into one big blob of text, unless you literally use two carriage returns where you wish to use one).

Does anyone know, if there some plans to evolve Javadoc, in a somehow-standardized way?
The corresponding JSR (JSR 260), which specifies enhancements to Javadoc, has been voted out of JDK 7 (for now). An overview of what was planned (from this site):
Upgrade Javadoc to provide a richer set of tags to allow more structured presentation of Javadoc documentation. This JSR covers: categorization of methods and fields, semantical index of classes and packages, distinction of static, factory, deprecated methods from ordinary methods, distinction of property accessors, combining and splitting information into views, embedding of examples and common use-cases, and more.
The overall outlook for JDK 7 is pretty grim.

JavaDoc is itself extremely flexible because you can replace the standard doclet with a custom doclet to provide something that meets your projects specific needs.
On the project I've been working on, we created an HTML/XML-based documentation system (using client-side XSLT 2.0 on JS) for our product with JavaDoc fully integrated. For this, a custom doclet was used to produce JavaDoc data in XML, this used tagsoup to ensure even HTML markup within code comments were well formed.
With this, we were able to deliver an interactive user experience using a single-page app (similar to a desktop tool), but all from within the browser - without any server-side code/infrastructure. The viewer included standard features such as search, tree navigation etc.
Here's a link to a sample entry point in the rather vast documentation:
JavaDoc viewer sample
Here's an image also:

A smart seachable javadoc viewer:
For many times, I face the problem of browsing JavaDoc. I was looking for something just like Adnroid doc search option. At last I get something like that. If you use firefox the solution is here.
Install the plugin GreaseMonkey, its kinda customizing web page the way we see. ( We need to customize any java doc page, so we can search on class name)
https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/
For greasemonkey to work, we need some user script for customization. This can be downloaded by greasemonkey automatically. Install the userscript from JavaDoc search frame or JavaDoc incremental search.
This works great for me.

Has anyone migrated from Struts 1 to another web framework?

On my current project, we've been using Struts 1 for the last few years, and ... ahem ... Struts is showing its age. We're slowly migrating our front-end code to an Ajax client that consumes XML from the servers. I'm wondering if any of you have migrated a legacy Struts application to a different framework, and what challenges you faced in doing so.

Sure. Moving from Struts to an AJAX framework is a very liberating experience. (Though we used JSON rather than XML. Much easier to parse.) However, you need to be aware that it's effectively a full rewrite of your application.
Instead of the classic Database/JSP/Actions scheme for MVC, you'll find yourself moving to a Servlet/Javascript scheme whereby the model is represented by HTTP GET requests, actions are represented by POST/PUT/DELETE requests, and the view is rendered on the fly by the web browser. This leads to interesting challenges in each area:
Server Side - On the server side you will need to develop a standard for exposing data to the client. The simplest and easiest method is to adopt a REST methodology that best matches your data's hierarchy. This is fairly simple to implement with servlets, but Sun also has developed a Java 1.6 scheme using attributes that looks pretty cool.
Another aspect of the server side is to choose a transmission protocol. I know you mentioned XML already, but you might want to reconsider. XML parsers vary greatly between browsers. One browser might make the document root the first child, another one might add a special content object, and they all parse whitespace differently. Even worse, the normalize() function doesn't seem to be correctly implemented by the major browsers. Which means that XML parsing is liable to be full of hacks.
JSON is much easier to parse and more consistent in its results. Javascript and Actionscript (Flash) can both translate JSON directly to objects. This makes accessing the data a simple matter of x.y or x[y]. There are also plenty of APIs to handle JSON in every language imaginable. Because it's so easy to parse, it's almost supported BETTER than XML!
Client Side - The first issue you're going to run into is the fact that no one understands how to write Javascript. ESPECIALLY those who think they do. If you have any books on Javascript, throw them out the window NOW. There are practically no good books on the language as they all follow the same "hacking" pattern without really diving into what they are doing.
From the lowest level, your team is going to need remedial training on Javascript development. Start with the Javascript Client Guide. It's the de facto source of information on the language. The next stop is Douglas Crockford's videos on Javascript. I don't agree with everything he has to say, but he's one of the few experts on the language.
Once you've got that down, consider what frameworks, if any, you want to use. Generally speaking, I dislike stuff like Prototype and Mootools. They tend to take a simple problem and make it worse. None the less, you can feel free to evaluate these tools and decide if they'll work for you.
If you absolutely feel that you cannot live without a framework because your team is too inexperienced, then GWT might fit the bill. GWT allows you to quickly write DHTML web apps in Java code, then compile them to Javascript. The PROBLEM is that you're giving up massive amounts of flexibility by doing this. The Javascript language is far more powerful than GWT exposes. However, GWT does let Java developers get up to speed faster. So pick your battles.
Those are the key areas I can think of. I can say that you'll heave a sigh of relief once you get struts out of your application. It can be a bit of a beast. Especially if you've had inexperienced developers working on your Struts model. :-)
Any questions?
Edit 1: I forgot to add that your team should study the W3C specs religiously. These are the APIs available to you in modern browsers. If you catch anyone using the DOM 0 APIs (e.g. document.forms['myform'].blah.value instead of document.getElementById("blah").value) force them to transcribe the entire DOM 1 specification until they understand it top to bottom.
Edit 2: Another key issue to consider is how to document your fancy new AJAX application. REST style interfaces lend themselves well to being documented in a Wiki. What I did was a had a top level page that listed each of the services and a description. By clicking on the service path, you would be taken to a document with detailed information on each of the sub-paths. In theory, this scheme can document as deep as you need the tree to go.
If you go with JSON, you will need to develop a scheme to document the objects. I just listed out the possible properties in the Wiki as documentation. That works well for simple object trees, but can get complex with larger, more sophisticated objects. You can consider supplementing with something like IDL or WebIDL in that case. (Can't be much worse than XML DTDs and Schemas. ;-))
The DHTML code is a bit more classical in its documentation. You can use a tool like JSDoc to create JavaDoc-style documentation. There's just one caveat. Javascript code does not lend itself well to being documented in-code. If for no other reason that the fact that it bloats the download. However, you may find yourself regularly writing code that operates as a cohesive object, but is not coded behind the scenes as such an object. Thus the best solution is to create JSDoc skeleton files that represent and document the Javascript objects.
If you're using GWT, documentation should be a no-brainer.

Check out the Stripes Framework. If you are familiar with struts then stripes will make sense to you, but it's so much better. They have a Stripes vs Struts section on their website. You could check that out and see if it interests you. It allows you to work with any ajax framework you want, and I don't think it would take long to migrate from struts to stripes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.