Suggestion with best customizable crawlers and scrapers - java

I have a website which is pretty good but with very less information.
So i felt like adding informtion like news regarding particular sector(for eg politics, hollywood etc). I believe crawlers are best approach to do so? Is my understanding correct, please suggest if you feel any other way to get information without using crawlers from various sources.
Secondly I am doing research from last 2 days and I cannot find a particular source which is capable of doing so. Now I want crawlers to find information, normalize and store in mysql database. Sounds pretty simple ha. But It isnt for me.
As this is very resource and time consuming . what all things should i take into consideration before choosing a crawler. Also I wish to customize it so any tool which is open source and good to be customized will be great.
Any source giving information and research about factors need to take into consideration while creating crawlers or educating about crawlers will be great.
I prefer coding in java but i can code in any other language in case you feel that you have some language.
I hope i have given enough information. Please dont hesitate if you need any more information to give suggestion.

You can use httrack to copy a target website. There is one firefox plugin as well named spiderzilla. But, they will just save the pages.
If you want to parse the data in the pages, then you can use simple_html_dom and store the information in mySQL.

Try the GNU Wget tool. You can add a lot of intelligence to the way it crawls and creates data dumps of web pages. It is open-source and customisable as well, and very fast too.

Related

Run Apache Sling's samples

I want to run simple-demo from samples but I don't have success. What is its URL and how to reach its content?
The internationalization of usermanager-ui also doesn't work, although I installed the org.apache.sling.i18n bundle. If someone can give me a guidance I will be very happy.
In general why everything is so poor explained? The motto of Sling is "Bringing Back the Fun!", but in the last few days I didn't have so much fun! It's really painful to try to test or make something. Is there any good tutorial or book about this framework?
P.S. If I run all samples without any problems, I would create a detailed tutorial.
My personal opinion is that simple-demo is not a very current sample, I'd rather recommend hat you look at the slingbucks or espblog samples, as mentioned at http://sling.apache.org/documentation/getting-started/discover-sling-in-15-minutes.html
We might need to cleanup the Sling samples at some point, and concentrate on a few representative ones - I've put that on my way-too-long-list-of-things-to-do.
Just go through the readme.txt of the respective samples.

Producing statistics on Google App Engine

I want to show my users some statistics such as hits/second on Google App Engine. I started to roll my own:
On each page view, add 1 to a count in memcache.
Each minute:
Read and reset the count and also set a "since" variable to now.
Divide the number of hits by the amount of time since I last calculated.
Save the data to an entity in the datastore.
Throwing out data that's really old.
I then realised that this is non-trivial and there must be a library to do it, however I can't find one that works for me. I looked briefly at rrd4j and JRobin but I'm not sure they they're usable on Google App Engine without quite a lot of rewriting. Does anyone have any more ideas?
Try new technique mentioned in this post http://googleappengine.blogspot.com/2012/07/analyzing-your-google-app-engine-logs.html.
It requires some additional work but it's worth trying. I'm using Mache (java framework) to ingest appengine logs into BigQuery and BigQuery API to query for results. Now pick a fancy javascript charts library and impress your users. Very powerful, flexible and scalable solution.
Perhaps ProdEagle works for you. I think they do pretty much exactly what you want, and I belive they also have logic for handling data that is deleted from memcache without making a big hole in your graph.
I seem to remember Twitter commons has what you need, but I don't know if they could be easily portable to GAE: https://github.com/twitter/commons
Consider using Mixpanel. You can submit arbitrary events and then extract aggregate information from an API... or just use the provided charts & graphs.
Since no-one seems to have an answer for me, I'm going to assume that there's no common library for doing this and I'll have to write one. I'll open source it and link from here if it feels like good code.
You could probably use Google Analytics. You'd just need to copy and paste some javascript into your templates.

Java Swing for Web Development by converting to Javascript

Good day!
With regards to my previous question about Java Swing being used for web development, I have a job interview today and they told me that their company uses Swing then convert it to javascript then deploy it on the web.
Can anyone explain this to me better? What books / websites should I study so that I could understand how this is done. Is this a good / common practice?
Thank you very much.
You can take a look at CreamTec's AjaxSwing. I've played around with it several times and it's the only product I know so far that takes your existing Swing GUI and converts it into something displayable in your browser.
Whether this is good practice or not is not really easy to answer. This solution works well as long as your application does not need to scale largely. CreamTec states that their solution is suited for about 50 clients IIRC.
The markup generated by AjaxSwing can in no way be called semantic but that is a common thing with these kinds of generators.
You can try AjaxSwing pretty easily since it does not require you to do much configuration but my recommendation is to use a dedicated web framework if you want higher scalability.
It sounds like GWT also. Granted that is not what the person said, but if it was a recruiter, they may been confused on the exact technology.
AjaxSwing is a run-time tool and needs a server license for commercial use.
You might want to try Mia Transformer www.mia-software.com. They change Swing Java code to GWT Java Code and GWT changes it to Javascript and then if you want you can use Google V8 compiler for faster execution. Of course it is not 100%. We are going to try it for a large project and see if it works.
The other link provided to us was http://swingweb.sourceforge.net/swingweb/. Have not checked it out though.
Will keep you posted. If you come across a workable solution please share.

Information Gathering on Dynamic Website Building & choosing its Architecture

I wanted to build a dynamic Website. The Architecture that I am planning to have is Linux+Apache+MySQL+JSP/Java/Servlets. I have heard a great deal about LAMP stack, but i dont know PHP. Please cite some differences between the two architectures in terms of scalability, security, code re-use, and stuff etc.
Also, having said all this, i need to know where could i get started from. any case-study that could give me an insight as to how to go about building a complete dynamic website.
Thanks.
The short answer is: Architecture isn't about languages, it is about usages. You can make a really slow, non-scalable, insecure, kludgy mess of java just as easily as PHP.
That said.
PHP is a traditionally less structured language. It is not type safe, and that is a double-edged sword, not a negative. My advice, as always, is to stick with what you know for anything mission critical. But if you want to fool around with PHP, the best way is to install it and start playing.
Good resources are this site, and php.net. And google. PHP has a huge hacker culture around it, you'll find loads of info about just about every possible topic. Good, bad, and all in between.
EDIT::
One thing I would do is avoid learning a framework first. Learn raw, un-cut PHP first.

Are there any tools to isolate the content of a webpage?

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.
I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/
You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.
im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure
I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

Categories