Setting HTTP_REFERER for URLS on webpages? - java

I have a webpage which refers other pages.
I want to be able to set the HTTP_REFERER on the URL's that are clicked.
What options do I have?

What options do I have?
None really. The browser sets this automatically.
The only thing you can do is redirect to a script (under your control) like
http://example.com/redirect.php?url=........
That file (in this case, PHP) would then do a header redirect to the target, and show up in the receiving site's HTTP_REFERER header.
Also, linking to a https:// page from a http:// one or vice versa will drop the referrer. See the Wikipedia article on referrer hiding.
Other than that, there is nothing you can do to alter it. There is definitely no way to set it to an arbitrary value from within a web site.

Related

JAVA: how to download webpage dynamically created by servlet

I want to download a source of a webpage to a file (*.htm) (i.e. entire content with all html markups at all) from this URL:
http://isap.sejm.gov.pl/DetailsServlet?id=WDU20061831353
which works perfectly fine with FileUtils.copyURLtoFile method.
However, the said URL has also some links, for instance one which I'm very interested in:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
This link works perfectly fine If open it with a regular browser, but when I try to download it in Java by means of FileUtils -- I got only a no-content page with single message "trwa ladowanie danych" (which means: "loading data...") but then nothing happens, the target page is not loaded.
Could anyone help me with this? From the URL I can see that the page uses Servlets -- is there a special way to download pages created with servlets?
Regards --
This isn't a servlet issue - that just happens to be the technology used to implement the server, but generally clients don't need to care about that. I strongly suspect it's just that the server is responding with different data depending on the request headers (e.g. User-Agent). I see a very different response when I fetch it with curl compared to when I load it in Chrome, for example.
I suggest you experiment with curl, making a request which looks as close as possible to a request from a browser, and then fiddling until you can find out exactly which headers are involved. You might want to use Wireshark or Fiddler to make it easy to see the exact requests/responses involved.
Of course, even if you can fetch the original HTML correctly, there's still all the Javascript - it would be entirely feasible for the HTML to contain none of the data, but for it to include Javascript which does the actual data fetching. I don't believe that's the case for this particular page, but you may well find it happens for
try using selenium webdriver to the main page
HtmlUnitDriver driver = new HtmlUnitDriver(true);
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
driver.get(baseUrl);
and then navigate to the link
driver.findElement(By.name("name of link")).click();
UPDATE: I checked the following: if I turn off the cookies in Firefox and then try to load my page:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
then I yield the incorrect result just like in my java app (i.e. page with "loading data" message instead of the proper content).
Now, how can I manage the cookies in java to download this page properly then?

App context root path prepended to external link

Organizations registering on my application can provide their external websites url for their profile page. The resulting html when displaying the link to their site is ​example.com​​ (Confirmed by inspecting the page in chrome). When hovering over the link or actually clicking it. The url is apparently interpreted as relative and https://localhost:8443/MyWebApp/profile/ is prepended to it.
Do I have to check and possibly modify links that users input or is there likely something in my configuration that is causing this behavior?
EDIT: Is there a simple method of countering this? Such as a jsp tag or using a url rewriter? (Tuckey)
This is the expected behaviour. Since the provided URL does not begin with a protocol (http, https, ftp, whatever) it is considered relative, and since it does not start with a /, it is considered relative to the current URL.

how to force tomcat to always redirect the 1st request to login page in case of form based authentication(tomcat realm)

I have a bit strange requirement. My application is written is jsp and server is tomcat 7. I am using form-based authentication. Here is my problem description.
Let's say I am logged in to my application in one of the IE browser tab. Now, I open a new tab and click the bookmarked application URL. As expected since I was already logged in and browser session is detected, instead of landing to login page, the application directly lands to status home page.
The requirement is that even if user is logged in one of the browser tab and valid browser session is available, the user should always be navigated to login page rather than directly landing to home page when tried to login in another browser tab.
Appreciated for quick help.
I do not think your client fully understands what they are asking of you.
Imagine we could invent something quite nasty in javascript or with referer header, or something like that, in order to achieve what they want. What if your user entered then different credentials in your tab #2? Is your client aware that the session open in tab #1 is the same for both of them?
Make them understand they are trying to override a basic behavior of web browsers, and even if they did it would be useless. Besides that, from a usability point of view, it would harm your application, since it would trick naive users into thinking they can open many sessions in the same browser instance... good luck!
Have the domain name (assuming that is the URL that is book marked) redirect to the login page and ensure this page is displayed even if the user is already logged in.
Then, if necessary (if they currently use a URL that's just your domain name) change your 'home' link, logo link etc to the URL of your home page.

Advice with crawling web site content

I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url

URl Switching between subdomain

I have site say http://info.sys.com
I want the info in the url to be replaced to knowledge.sys.com when i select knowledge tab in my website.
info.sys.com should be replaced to knowledge.sys.com when i select knowledge tab.
I use jdk 1.5 update 9 and tomcat 6.0.16
Looking forward for your reply.
If you change the URL (location.href = 'http://knowledge.sys.com';), the page will be reloaded -- well, actually, the page at that address will be loaded (whether that's the same page or not will depend on your server).
There are games you can play with anchors, though (the "hash" part of the location). Check out Really Simple History for more on that.
Changing the URL field programmatically on the client-side will trigger the browser to refresh the page with the updated URL.
This is considered a security feature which guarantees that the URL field is always showing the address of the rendered resource.
You can use a URL Rewriting Engine on your server if you cannot host your knowledge base at knowledge.sys.com. This could be configured to handle requests to knowledge.sys.com without having to change your application file structure.

Categories