Crawlers in JSP/Struts/Session controlled Webapps

Crawlers in JSP/Struts/Session controlled Webapps - java

i got a struts web application (running on tomcat 6) with all files except the first one which invokes a starting action located in the WEB-INF and u always need a Session to use it otherwise you will be redirected to the starting action and starting page again.
The app main function is a search which provide products from a database. How does a crawler navigate in my app? Does it trigger the search which could lead it to error pages? Or can it only follow links that are not embedded in forms (well struts makes nearly everything to forms therefore there are only some links and mostly onclick redirects and form actions)
How can i provide useful information that can be indexed to a crawler like this?
thanks for advice :)

Sounds like you would best off reading up on some seo guidelines: http://www.google.com.au/search?q=seo+guidelines&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&safe=high,
To answer your questsions:
Crawlers will generally navigate to your app from external links on the web, or after you submit your site to the search engine.
The crawler won't fill in inputs and submit forms, it will follow hyperlinks between your pages.
If you want the crawler to index your search results (can't really see why you would want this) you can put links to common searches on one of your already indexed pages.
You should make sure that your product pages are SEO friendly and are indexed instead of your search results.

Related

JSP/Tomcat: Navigation system with sub-folders but one page

My JSP project is the back-end of a fairly simple site with the purpose to show many submissions which I want to present on the website. They are organized in categories, basically similar to a typical forum.
The content is loaded entirely from a database since making separate files for everything would be extremely redundant.
However, I want to give the users the possibility to navigate properly on my site and also give unique links to each submission.
So for example a link can be: site.com/category1/subcategory2/submission3.jsp
I know how to generate those links, but is there a way to automatically redirect all the theoretically possible links to the main site.com/index.jsp ?
The Java code of the JSP needs access to the original link of course.
Hope someone has an idea..
Big thanks in advance! :)

Alright, in case someone stumbles across this one day...
The way I've been able to solve this was by using a Servlet. Eclipse allows their creation directly in the project and the wizard even allows you to set the url-mapping, for example /main/* so you don't have to mess with the web.xml yourself.
The doGet function simply contains the redirection as follows:
request.getRequestDispatcher("/index.jsp").forward(request,response);
This kind of redirection unfortunately causes all relative links in the webpage to fail. This can be solved by hardlinking to the root directory for example though. See the neat responses here for alternatives: Browser can't access/find relative resources like CSS, images and links when calling a Servlet which forwards to a JSP

how to extract HTML data from a webpage which scrolls down for a fixed number of times?

I want to extract HTML data from a website using JAVA. The problem is the webpage keeps scrolling down once the user reaches the bottom of the page. Number of times it scrolls down is fixed. My JAVA code can extract only for the 1st part. How do I extract for the remaining scrolls? Is there a way to load the whole page at once with JAVA? ANy help would be appreciated :)

This might be the type of thing that PhantomJS (http://phantomjs.org/) was designed for. It will crawl entire web pages and even execute JavaScript, using a "real" browser in headless mode. I suggest stopping what you're doing with Java and take a look at PhantomJS instead. It could save you a LOT of time. :)

This type of behavior is implemented in the browser, interpreting the user's scrolling actions to load more content via AJAX and dynamically modifying the in-memory DOM in the browser. Consider that your Java runs in a web container on the server, and that web container (i.e. Tomcat, JBoss, etc) provides a huge amount of underlying code so your app doesn't have to worry about the plumbing.
Conceptually, a similar thing occurs at the client, with the DHTML web page running in its own "container" (the browser), which provides a wealth of functionality, from UI to networking, to DOM, etc. If you remove the browser from the equation and replace it with a Java program, you will need to provide the equivalent of the browser in which the DHTML/Javascript can execute.
I believe that HTMLUnit may fill the bill, but have not worked with it personally.

how can I know if the user has left the page in wicket?

I am searching for a way to know when the user leaves the page and has not saved the changes then show wicket's modal(preferable but could be a confirmation box).
Additional info:
the solution should have minimal effects in code, because I have about 30 pages that will have the behavior, actually all my web pages extends from one called LayoutPage, something similar to this
I tried with pure JavaScript solution like in this question, but the application send a lot of data via AJAX requests, so I couldn't determine a nice way to know if the data has been sent to the server
Ihen I start to look in the source code of the Form.class of Wicket. It has a nice method called isSubmitted(); I could use it if I was able to know from wicket if the user is about to quit the page.
I don't want to write a validation for each page in the system.

Simply generate your browser onbeforeunload using https://cwiki.apache.org/WICKET/calling-wicket-from-javascript.html. In the callback you can then check the state of your form or page.

When is it acceptable to use a FRAMESET

In my Java EE application page, I have header.jsp, a side menu.jsp, a body.jsp and footer.jsp. The side menu contains the jQuery dynatree plugin. When a user clicks a menu item from the tree, the body should be changed with the appropriate page (also a .jsp). I am using tiles framework, where I am importing all js code in layout.jsp page. I want to achieve an effect replicating a frameset, but without actually using a frameset. I think framesets are difficult to be managed and take time to load.
Can anyone suggest how I can approach this problem? If I use AJAX to fetch each page when dynatree node is activated, then I have to manually update the page. If I use an IFRAME in body.jsp, then I have to reimport all plugin js code as the frame will not be able to access js functionality on the main page.
I want efficient html page management.

Since you are using jQuery, you should be able to use AJAX in combination with the live method of applying events (see the docs or here). This method is called "event delegation", and even though jQuery will do it for you like magic, you should understand what is happening. Depending on what version of jQuery you are using, you might use delegate instead of live - essentially the same thing.
Framesets are actually deprecated in HTML5 -- you should avoid using them because soon they will not be supported at all in newer user agents. See http://www.useit.com/alertbox/9612.html for a lengthy discussion that should hopefully dissuade you from considering that approach.
The IFRAME approach is a hack. You might be able to make it work, but you're hammering a square peg into a round hole.
Bottom line, if you don't want to directly deep link to inner pages, AJAX is the best and preferred solution. In combination with event delegation, it really is superior to any older or hacky solution. And, be sure to use the idea of "progressive enhancement" -- if someone clicks those links and has javascript turned off, the content should still load. That means you start with regular direct links, then add the fancy stuff on to it for those users that have javascript enabled. Otherwise, you close a percentage of users off from anything past your home page.
When you use AJAX for your navigation, you still need to plan for a user that doesn't understand the difference between when they click a link on your site or any other site. They'll use the browser's "back" button and end up back at Google instead of on the last page! That's because their navigation through your site does not look like unique pages to their browser. There are tools in newer browsers to deal with this, but the details are a little beyond the scope of this answer. Check out this article on MDN for more info on manipulating the browser history.
Documentation
jQuery's live - http://api.jquery.com/live/
jQuery's delegate - http://api.jquery.com/delegate/
David Walsh on event delegation - http://davidwalsh.name/event-delegate
Jakob Nielsen on Framesets - http://www.useit.com/alertbox/9612.html
MDN on Browser History Modification - https://developer.mozilla.org/en/DOM/Manipulating_the_browser_history
jQuery for Designers blog with a sample use-case for delegate - http://jqueryfordesigners.com/simple-use-of-event-delegation/
Wikipedia article about progressive enhancement - http://en.wikipedia.org/wiki/Progressive_enhancement

What technologies to use for web app hosting web page and logic to interrogate web page

I would like to create a web application that holds a web site in the right half of the screen, and some some widgets with logic in the left half.
For example, the user might specify that they want www.somesite.com to be analysed, so that site will show in the right half of my web page. I then want the user to be able to hit something like a 'record' button on the LHS, and that will then record what they do on the RHS (in the www.somesite.com) until the 'record' button is pressed again.
My first thought was an IFrame holding the web site (www.somesite.com in this case) and an applet in the LHS. The user hits record, the applet gets hold of the DOM for the web page and adds listeners on every textbox/button etc. I am not sure how the applet would get hold of the DOM (netscape.javascript.JSObject looked promising), but instead of fighting that, I thought I would ask if there is more appropriate technology other than IFrame/Applet? Any solution would have to be Java based. The above explanation of what the web app does is a vast over-simplification; I am only after help on the technologies that could be used to achieve something like it and not on the web app itself, which is really just for illustration purposes.
Many Thanks,
Paul

Javascript/Java from your site will -not- be able to access properties of the DOM from another domain, whether in an iFrame or otherwise. This is a basic security principle that browsers implement, and rightly so.
See this Wikipedia page: http://en.wikipedia.org/wiki/Same_origin_policy

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.