Get URL hierarchy from a base link - java

Before asking my question (which is basically what the title says) I want to provide some background, so as to give a better knowledge about my situation.
I am writing a little application in Java, mainly for academic purposes, but also with a very specific task in mind. What this application does is basically build an URL hierarchy starting from a base URL, and later on give the ability to organize the links and perform some actions on them.
Imagine the following URLs:
http://www.example.com
http://www.example.com/sub001
http://www.example.com/sub002
http://www.example.com/sub002/ultrasub
I would like my program to retrieve this hierarchy when provided with the base URL http://www.example.com (or http://www.example.com/).
In my code I have a class capable of encoding URLs and I have already thought of a way to validate them, I just couldn't find a way to find out the URL hierarchy beneath the base URL.
Is there a direct way of doing it, or do I just have to download the files from the base URL and start building the hierarchy from the relative and absolute links present in the file?
I am not asking for specific code, just a (somewhat) complete explanation of what way I could take to do it, with maybe some skeleton code to guide me.
Also, I am storing the URLs in a TreeMap<URL,Boolean> structure, in which the Boolean states if the URL has already been analyzed or not. I chose this structure after a quick peek in the Java 7 API specification, but do you suggest any structure that's better for this specific purpose?
Thanks in advance :)

There is no way in the HTTP protocol to request all the URL's that are 'under' a given URL. You are out of luck.
Some protocols (ftp://... for example) do have explicit mechanisms.....
Some HTTP Servers will print an index page if you request a 'directory' but this practice is not recommended and not many servers will do that.
Bottom line is that you have to follow links in order to determine what the server hierarchy is, and even then you may not discover a link to all the areas of the hierarchy.
EDIT: I should add that you should, as a well-behaved nettizen, obey the robots.txt file on any servers you access....
EDIT2: (after comment on FTP mechanism)
The FTP protocol has many commands: See this wiki list. One of the commands is: NLIST which "Returns a list of file names in a specified directory."
The URL specification makes special provision in the URL format for FTP protocol URL's, and in section 3.2.2 :
The url-path of a FTP URL has the following syntax:
<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>
....
If the typecode is "d", perform a NLST (name list) command with as the argument, and interpret the results as a file directory listing.
I can see the effects when I try this from the commandline (not from a browser):
rolf#home ~ $ curl 'ftp://sunsite.unc.edu/README'
Welcome to ftp.ibiblio.org, the public ftp server of ibiblio.org. We
hope you find what you're looking for.
If you have any problems or questions, please see
http://www.ibiblio.org/help/
Thanks!
and type=d I get:
rolfl#home ~ $ curl 'ftp://sunsite.unc.edu/README;type=d'
HEADER.images
incoming
HEADER.html
pub
unc
README

Related

Unable to access certain URLs with Selenium Java

I am in the process of writing a program whose purpose is centered around generating custom URLs for intelius.com and then extracting data from them with selenium. I have observed interesting behavior that I am unsure how to address.
My program creates URLs after the following pattern: https://intelius.com/people-search/LASTNAME/CITY-STATE, but I have found that attempting to access these constructed links consistently leads to a timeout error.
For example, http://intelius.com/people-search/Williams/Brooklyn-NY does not load the expected results page
Digging around in the website's source, I have found what appears to be a link validator script — what exactly that means, I do not know — and am unsure how to proceed.
How exactly would I go about authenticating my queries, without programming selenium to manually input the data into the search textbox and to press the submit button? Is my link-construction approach flawed in some blatantly obvious manner? I am a bit lost and would appreciate some direction. Thanks!
I think your problem is using http instead of https, and omitting www from URL. So this works:
https://www.intelius.com/people-search/Williams/Brooklyn-NY
The problem lies in the way the URL being formed. You need to construct and pass the arguments the way the web application understands it. The following works -
https://www.intelius.com/people-search/William-Brooklyn/NY

Create new site with REST API

I want to create a new site in Alfresco through the REST API first i tried with the url /alfresco/service/api/sites the site was created but i could not open it. I read the method description and it says
Note: this method only creates a site at the repository level, it
does not create a fully functional site. It should be considered for
internal use only at the moment. Currently, creating a site
programmatically needs to be done in the Share context, using the
create-site module. Further information can be found at the address
http://your_domain:8080/share/page/index/uri/modules/create-site.post
within your Alfresco installation.
I tried to go to the suggested url but it gives 404 !
Any help or suggestions ?
Note - the links/references in this answer all assume you've got the Alfresco Share application installed and available at http://localhost:8081/share/ - tweak as needed for your machine
When wanting to understand or discover webscripts, the first place you want to head to is http://localhost:8081/share/service/index - the Share WebScripts home. (The Alfresco repo tier has an equivalent one too, available at a similar URL).
When there, you'll see all of the Share-side WebScripts listed, of which there are a lot. You could search through that list for create-site. However, you can restrict the webscript listing by URL or by module. For the Create Site webscripts, the URL to see just those is http://localhost:8081/share/page/index/uri/modules/create-site
Head there, and you'll discover there are two create site related webscripts, a get and a post. As you've discovered already, the one you'll want is the POST webscript. Click on that to get the details, at http://localhost:8081/share/page/script/org/alfresco/modules/create-site.post - that's the latest (Alfresco 5.x) URL for the thing you've been directed to in your question. If your Share installation is at a different URL, then once you've navigated from the Share webscripts home, you'll get the specific one for your machine
Finally, you'd need to post the required JSON to that webscript's URI, which is given in the webscript listing, eg http://localhost:8081/share/page/modules/create-site . Easiest way to see exactly what JSON you need is to use firebug / developer tools / etc to see the handful of keys/values that Share sends when you create one through the UI.

JSP/Tomcat: Navigation system with sub-folders but one page

My JSP project is the back-end of a fairly simple site with the purpose to show many submissions which I want to present on the website. They are organized in categories, basically similar to a typical forum.
The content is loaded entirely from a database since making separate files for everything would be extremely redundant.
However, I want to give the users the possibility to navigate properly on my site and also give unique links to each submission.
So for example a link can be: site.com/category1/subcategory2/submission3.jsp
I know how to generate those links, but is there a way to automatically redirect all the theoretically possible links to the main site.com/index.jsp ?
The Java code of the JSP needs access to the original link of course.
Hope someone has an idea..
Big thanks in advance! :)
Alright, in case someone stumbles across this one day...
The way I've been able to solve this was by using a Servlet. Eclipse allows their creation directly in the project and the wizard even allows you to set the url-mapping, for example /main/* so you don't have to mess with the web.xml yourself.
The doGet function simply contains the redirection as follows:
request.getRequestDispatcher("/index.jsp").forward(request,response);
This kind of redirection unfortunately causes all relative links in the webpage to fail. This can be solved by hardlinking to the root directory for example though. See the neat responses here for alternatives: Browser can't access/find relative resources like CSS, images and links when calling a Servlet which forwards to a JSP

Deleting a file that resides on an internal Linux server via my application

I've recently been working on an enhancement for an application at work that will allow users to delete presentations stored not on their local machine but on a Linux server present on the internal network. My problem is that I am not sure how to go about performing this delete. The location of the files are as follows:
http://ipaddress/dataconf/productusers/**ACCOUNT**/presentations/
I have access to the ACCOUNT name which is a parameter that will need to be passed in to navigate to the right directory. I will also have access to the presentation name which will be needed to specify the correct presentation to delete.
What I am having trouble with is where to begin.
I am using the Spring framework so my code is a mixture of Java, JSP, and JavaScript.
Essentially I have a .jsp page where I layout the presentations that are associated with each account. I.E when you click on an account it makes a call to a database and lists the presentations that are associated with that account. You can then select individual accounts and delete, or press one delete all button and delete them all.
I currently have it working so that when you delete a presentation in my application, it deletes the appropriate record from the database, but I also need to delete the physical presentation which is the basis for this question. Just as an FYI, these requests (get presentations from database, remove presentations from database) are all being handled through AJAX and JSON.
I am hoping to learn how to create a connection to the correct server, navigate to the proper directory as specified above, and issue the Linux command "sudo rm file-name" all in the same delete process that I described in the prior paragraph.
If you could point me in the right direction, any help would be much appreciated. Also, if you need any further clarification please feel free to let me know.
Thanks again,
Dave
This will not be easy. Or maybe it will. Please understand first that simply knowing where some files are published on an HTTP server is basically useless in terms of manipulating those files.
So I understand the following: You have your own web application on server A, a database somewhere, and some files located on another web server B. Internally on server B, the files will be in some weird directory e.g. /var/www/docs/whoknowswhat/somefolder/dataconf/productusers.
What you need to do is to somehow expose this folder from within server B over the network to your server A. Talk to your admin people. Maybe NFS is an option, or maybe Samba, or SSHFS. Make sure you have write permissions, and also make sure that noone else does.
Once you have mounted the location from B in your server A and it is available to you as some directory /mnt/serverB/productusers, then all you have to do is something like this, i.e. File f = ...; f.delete();
I did a little research and stumbled upon a neat solution to accomplish what I am trying to do. If you take a look at the following link:
http://www.journaldev.com/246/java-program-to-run-shell-commands-on-ssh-enabled-system
The above site describes a method in which you can open an ssh connection in Java and execute commands as if you were running them from the terminal. It has come in handy for my problem and I hope that if anyone else is experiencing the same problem that this will help them as well. Feel free to let me know what you think.

Make GWT Crawlable (SEO)

I like to make my GWT-App crawlable by the google bot. I found this article (https://developers.google.com/webmasters/ajax-crawling/). It states there should be a servlet filter, that serves a different view to the google bot. But how can this work? If i use for example the activities and places pattern, than the page changes are on the client-side only and there is no servlet involved -> servlet filter does not work here.
Can someone give me an explanation? Or is there another good tutorial tailored to gwt how to do this?
If you use Activities&Places your "pages" will have a bookmarkable URL (usually composed of the HTML host page, a #, and some tokens separated by ! or other character).
Thus, you can place links ('s) in your application to make it crawlable. If the link contains the proper structure (the one with # and tokens), it will navigate to the proper Place.
Have a look at https://developers.google.com/web-toolkit/doc/latest/DevGuideMvpActivitiesAndPlaces
So here is the solution to the actual problem:
I wanted to make my GWT (running on Google App Engine) crawlable by the google bot and followed this documentation: "https://developers.google.com/webmasters/ajax-crawling/". I was trying to apply a servlet filter that filters every request to my app and checks for the special fragment in the escaped url that is added by the google bot and present a special view to the bot with a headless browser.
But the servlet did not work for the "MyApp.html"-file. I found out then, that all files are treated as static files and are not affected by the filter. I had to exclude the ".html"-Files from these static files. I did this by adding the line "" to the static files in the "appengine-web.xml".
I hope this will help some people with the same problem to save some time :)
Thanks and best regards
jan

Categories