Looking for a simple Java spider [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?
Thanks.

I have used Web Harvest a couple of times, and it is quite good for web scraping.
Web-Harvest is Open Source Web Data
Extraction tool written in Java. It
offers a way to collect desired Web
pages and extract useful data from
them. In order to do that, it
leverages well established techniques
and technologies for text/xml
manipulation such as XSLT, XQuery and
Regular Expressions. Web-Harvest
mainly focuses on HTML/XML based web
sites which still make vast majority
of the Web content. On the other hand,
it could be easily supplemented by
custom Java libraries in order to
augment its extraction capabilities.
Alternatively, you can roll your own web scraper using tools such as JTidy to first convert an HTML document to XHTML, and then processing the information you need with XPath. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com, would be something like //a[contains(#href,'wired')]/#href. You can find some sample code for this approach in this answer to a similar question.

'Simple' is perhaps not a relevant concept here. it's a complex task. I recommend nutch.

Related

Java library for filtering user-entered content? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I am doing a Java-based web application. It allows users to enter content, which is displayed to other users.
Naturally for security reasons, I have to filter user content to prevent XSS and other attacks.
I understand that filtering user content is a much-discussed topic. I found many posts at SO, but they are related to theory discussion, PHP, ideas, etc. I need a Java library to use to avoid re-writing/inventing everything. I feel there must be one out there.
Is there such a library I can use?
Thanks for any info!
If you want to sanitise user input to prevent XSS then OWASP provide the standard implementation for doing that in their AntiSamy project.
There is a better implementation of this on google code called owasp-java-html-sanitizer, this allows you to define policies programmatically and then run the suspect HTML through the policy which will strip out all nonsense.
Here is an example from their website:
PolicyFactory policy = Sanitizers.FORMATTING.and(Sanitizers.LINKS);
String safeHTML = policy.sanitize(untrustedHTML);
This creates a policy that only allows formatting and links in the suspect HTML, everything else is removed.

what is a good technology compatible with javato add simple field boxes in HTML? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a Web Service and I'm trying to add a simple Web User interface with the ability of adding some text and uploading file. what is the simplest and easiest one to use and compatible with Java. I'm using eclipse to develop my application.
I don't need a lot of support I just want it to be easy to use.
I can recommend wicket (http://wicket.apache.org/) you won't have licence restrictions (it's an Apache licence) and it's a time resistent solution: JSF are too complex for simple use cases, and Struts is quite as complicated but a rather old technology. GWT is too complex and time conuming for small projects.
You can naturally use bare Servlets or JSPs if you're really in very simple use cases.
Best Regards,
Zied Hamdi
http://1vu.fr

Is there a dynamic word/tag cloud Java API somewhere? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
There are loads of great word and tag clouds available, the most prominent being wordle.net. But I am looking to display something akin to what some folks did for a twitter replay of the 2010 world cup, just not using flash. I'm not too familiar with R, but it seems to be the best tool for generating some statistical decay of font size over time. Is there a Java API (or combination of APIs) that might make this capability easier from the start?
I'm not aware of a good R package for that. There are some functions, like cloud in the snippets package, and maybe other functions, but nothing compared to http://wordle.net, http://tagcrowd.com/, or Many Eyes. Drew Conway has done some nice stuff with tm + ggplot2; I also played with it a while ago, but this was more of to play with 3D tag cloud (with rgl) than wordle.
In Python or Processing, there are some ongoing projects detailed on this related question. To my knowledge, Tagxedo looks great but it has no API and it relies on Silverlight.
Pierre Lindenbaum also has some Java code, see his blog post Playing with the Wordle algorithm: a tag cloud of Mesh Terms.
It's not great, but there is an open-source project (alas, in PHP) that does word clouds over time. The example uses presidential speeches.
http://chir.ag/projects/preztags/
Here is one that I created in Java as part of a larger project for deriving information from unstructured data : https://github.com/regunathb/Sift. The "tagcloud" project has all the required classes for generating a tag cloud and writing it to multiple putput image formats.

Java Charting libraries [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking for a good charting library for Java. It can be open source or not and I need it to work in a stand alone client application rather than web-based.
We do have some dynamic charts however which scroll across the screen as data are provided that were done in MS chart and will need to be redone so not sure if JChart will accomplish this in an acceptable manner.
Are there any java charting libraries right in the J2SE API? I've also run across Oracle Chart Builder, but can't seem to find any information on it other than this link: http://download.oracle.com/docs/html/A96127_01/jcb_intro.htm Has anyone ever heard of it before?
JFreeChart is an excellent open source charting library for java.
The samples demo (Java Web Start version or in the distribution) contains a section under Miscellaneous called Dynamic Charts (in addition to lots of others). The source code for the demos is available via the official documentation (the purchase of which supports the project).
http://www.jfree.org/jfreechart/ - I used it for small project. Rendering dynamic data was quite complex but possible.
It's open source but you probably will have to pay for doc if you would like to do something serious.

Are there faster XML parsers in Java than Xalan/Xerces [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I haven't found many ways to increase the performance of a Java application that does intensive XML processing other than to leverage hardware such as Tarari or Datapower. Does anyone know of any open source ways to accelerate XML parsing?
Take a look at Stax (streaming) parsers. See the sun reference manual. One of the implementations is the woodstox project.
Since it hasn't been directly mentioned, I'll throw in Aalto, which is fastest java xml parser according to some measurements, like:
JVM-serializers (which compares, XML, JSON, protobuf, Thrift etc etc)
Alternative serialization methods for WSTest (Java web services)
which are not written by Aalto developers.
VTD-XML is very fast.
It has a DOM-like API and even XPath queries.
Piccolo claims to be pretty fast. Can't say I've used it myself though. You might also try JDOM. As ever, benchmark with representative data of your real load.
It partly depends on what you're trying to do. Do you need to pull the whole document into memory, or can you operate in a streaming manner? Different approaches have different trade-offs and are better for different situations.
Depending on the complexity of your XML messages you might find a custom parser can be 10x faster (though more work to write) However if performance is critical, I wouldn't suggest using a generic parser. (Also I wouldn't suggest using XML as its not designed for performance, but that's another story, .. ;)
Check Javolution as well

Categories