This question already has an answer here:
How reliable is HTTP_REFERER?
(1 answer)
Closed 8 years ago.
I am interested in logging from where a user comes in order to access my web app.
I thought of using HTTP's referrer header for that, but from e.g.HTTP referrer wiki
it seems that this is not a accurate/reliable way since in many cases it is not send.
I was wondering is the referrer header the only way? Is there a better/standard approach?
Reliable way would be to have ?ref=somehash a GET parameter
For example:
Consider this site SO, they have list of questions, now there is a portlet which streams the recent questions to some other site for example abcd.com now to see if user clicked the link from abcd.com you pass a parameter ?ref=423jahjaghr where this string maps to abcd.com
Referrer header isn't the only way, but it is the most standard.
You can consider using Google Analytics, which has extra referrer capabilities, but you'd have to manually setup collecting the data from their services to input into your logging infrastructure.
Nothing is going to be 100% fool proof though. It's pretty straight-foward to block Google Analytics, and spoof referrers, and HTML5 will make it even easier to prevent sending referrer information.
If it's mission critical that you know the referrer of all inbound traffic, you'll have come up with a more draconian approach (like #Jigar Joshi has suggested)
Depending on the browser, you may OR may not get the referrer header. You may not get it always. You have to have a request parameter OR a form field to get the referrer.
HTTP Referer is a good way to analyze logs. And to maintain analytics on user interactions. However a browser or any other system which displays webpages and is able to traverse a webpage might not send this header entry.
You might also consider to use a 3rd party application like google analytics. But you should check if this 3rd party tool is legal in your country. Most of them have data-privacy issues.
Very important is. For analytics its ok if you have a certain error in the expected outcome. However never do any security related checks on the http referer. Someone might enter whatever he wants as referer.
Related
I know this is a pretty noob question but I've been reading some manuals and documentations and can't figure something out.
I have an automation suite (in Java/Groovy) that in some cases needs to query an email inbox to check that a message with a given subject has been received and also probably delete all messages older than X. That's pretty much all I need to do and I've been looking into creating a gmail account and using the Google API Java client that's available here -> https://developers.google.com/api-client-library/java/apis/gmail/v1 but I can't figure our how to actually do it.
Right now what I have absolutely no clue how to do is the authentication. I can probably figure out how to interact with emails by going through the methods/code but I can't find any examples on how to authenticate so that the code can get access.
I tried looking for examples here and checking the code here. I know the answer is there but I still can't wrap my head around how to implement the code to sign in/authorize based on a username and password.
Thanks!.
This is the link you need. In this page it's explained authentication mechanism for Google API. They are using OAuth 2.0, which is probably the most used authentication method nowdays.
There is a standard flow that takes the client from credentials to an access token that can be used to perform authorised requests. This flow is described in the OAuth specification which is very useful to understand. Many APIs use it.
If you have specific questions, please let us know.
I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot. I am using Java on linux server , so java based solution I am looking for.
I am only letting in top good search engine spiders such as google yahoo bing alexa etc and keep the others out to reduce server loads. But its very important top spiders index my site.
For a more complete answer to your question, you can't rely on only one approach. The problem is the conflicting nature of what you want to do. Essentially you want to allow good bots to access your site and index it so you can appear on search engines; but you want to block bad bots from sucking up all your bandwidth and stealing your information.
First line of defense:
Create a robots.txt file at the root of your site. See http://www.robotstxt.org/ for more information about that. This will keep good, well behaved bots in the areas of the site that make the most sense. Keep in mind that robots.txt relies on the User-Agent string if you provide different behavior for one bot vs. another bot. See http://www.robotstxt.org/db.html
Second line of defense:
Filter on User-Agent and/or IP address. I've already been criticized for suggesting that, but it's surprising how few bots disguise who and what they are--even the bad ones. Again, it's not going to stop all bad behavior, but it provides a level of due diligence. More on leveraging User-Agent later.
Third line of defense:
Monitor your Web server's access logs. Use a log analyzer to figure out where the bulk of your traffic is comming from. These logs include both IP address and user-agent strings so you can detect how many instances of a bot is hitting you, and whether it is really who it says it is: see http://www.robotstxt.org/iplookup.html
You may have to whip up your own log analyzer to find out the request rate from different clients. Anything above a certain threshold (like maybe 10/second) would be a candidate to rate limit later on.
Leveraging User Agent for Alternative Site Content:
An approach we had to take to protect our users from even legitimate bots hammering our site is to split traffic based on the User-Agent. Basically, if the User-Agent was a known browser, they got the full featured site. If it was not a known browser it was treated as a bot, and was given a set of simple HTML files with just the meta information and links they needed to do their job. The bot's HTML files were statically generated four times a day, so there was no processing overhead. You can also render RSS feeds instead of stripped down HTML which provide the same function.
Final Note:
You only have so many resources, and not every legitimate bot is well behaved (i.e. ignores robots.txt and puts a lot of stress on your server). You will have to update your approach over time. For example, if one IP address turns out to be a custom search bot your client (or their client) made, you may have to resort to rate-limiting that IP address instead of blocking it completely.
Essentially you are trying to get a good balance between serving your users, and keeping your site available for search engines. Do enough to keep your site responsive to the users, and only resort to the more advanced tactics as necessary.
I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
The normal approach to this is to configure a robots.txt file to allow the crawlers that you want, and disallow the rest. Of course, this does depend on crawlers following the rules, but for those that don't you can fall back on things like user-agent strings, ip address checking, etc.
The nice things about "robots.txt" are:
It is simple to set up.
It has minimal impact on your site. A well behaved crawler would fetch the file, and (assuming you disallowed the crawler) just go away.
You can specify what parts of your site can be crawled.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
I believe so. See Google's view on what you are doing.
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot.
It probably will, but it is rather expensive. Robots.txt is a simpler approach, and easy to implement in the first instance.
The correct and fast way to identify Googlebot is:
check the user-agent string
if Googlebot, verify the IP by DNS
Only clients that identify as Googlebot pay a one-time price for the IP/DNS verification. Assuming that you will locally cache the result per IP for a while of course.
For the user-agent checking, you can use simple Java String functionality. Something like userAgent.contains("Googlebot") according to https://support.google.com/webmasters/answer/1061943 or else you can use this library: https://github.com/before/uadetector
Regarding DNS, that's what Google recommends https://support.google.com/webmasters/answer/80553
You can verify that a bot accessing your server really is Googlebot
(or another Google user-agent) by using a reverse DNS lookup,
verifying that the name is in the googlebot.com domain, and then doing
a forward DNS lookup using that googlebot name. This is useful if
you're concerned that spammers or other troublemakers are accessing
your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Bing works the same way with bingbot, see http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
Because I needed the same thing, I've put some Java code into a library and published it on GitHub: https://github.com/optimaize/webcrawler-verifier it's available from Maven Central. And here's a blog post describing it: http://www.flowstopper.org/2015/04/is-that-googlebot-user-agent-really-from-google.html
Check out this site:
http://www.user-agents.org/
They also have an XML version of the database that you can download and incorporate. They classify known "User-Agent" header identities by whether they are a Browser, Link/Server-Checker, Downloading tool, Proxy Server, Robot/Spider/Search Engine, or Spam/bad-bot.
NOTE
I have experienced a couple User-Agent strings that represent the Java runtime someone hacked together to scrape a site. It turned out that someone was doing their own search engine scraper in that case, but it could just as well be a spider to download all your content for offsite/disconnected use.
Is there any web language that allows the client itself to create HTTP posts to external sites.
I know that JavaScript does this with XMLHttpRequest, but it does not allow cross-domain posting, unless the recipient domain wants to allow the sending domain.
I want to post data to an external site (that I don't control) and have the request be authenticated with what the client's browser already has (cookies, etc).
Is this possible? I tried cURL but it seems to make a server HTTP post, not a client HTTP post.
Edit:
A bit more insight of what I am trying to do:
I am trying to POST JSON to the website using the user's session (I said cookies but I believe they are PHP sessions, which I guess I still consider cookies).
The website does NOT check the referral (poor security #1)
I can execute javascript and html on the webpage using my personal homepage (poor security #2)
The JSON code will still work even if the content-type is form (poor security #3)
There is no security checking at all, just PHP session checking.
The form idea is wonderful and it works. The probably again is that its JSON. So having sent postdata as foo={"test":"123", "test2":"456"} the whole foo= part messes it up. Plus forms seem to turn JSON into form encoding, so its sending:
foo=%7B%22
test%22%3A+%22
123%22%2C+%22
test2%22%3A+%22
456%22%7D
when i need it to send;
{"test":"123", "test2":"456"}
So with everything known, is there a better chance of sending JSON or not?
I don't think so: You won't get hold of the user's auth cookies on the third party site from server side (because of the Single Origin Policy) and you can't make Ajax requests to the third party site.
The best you can do is probably create a <form> (maybe in an <iframe>), point it to the third party site, populate it with data, and have the user submit it (or auto-submit it). You will not be able to get hold of the request results programmatically (again because of the Single Origin Policy), but maybe it'll do - you can still show the request results to the user.
I think for obvious reasons this is not allowed. If this was allowed what would stop a malicious person from posting form data from a person's browser to any number of sites in some hidden iframe or popup window.
If this is a design of your application you need to rethink what you are trying to accomplish.
EDIT: As #Pekka was pointing out I know you can submit a form to a remote site using typical form submits. I was referring to using some client side ajax solution. Sorry for the confusion.
You should follow the way OpenID and other single-sign-on system works. How openID works is your website POSTs some token to openID service and in return gets authentication result. Refer How Does it Work? section here
Yes, you can use a special flash library that supports cross-domain calls: YUI connection manager
Added: not sure about the cookie authentication issue though...
The client cannot post to an external site directly; it's a breach of basic cross-domain security models. The exception is accessing javascript with JSONP. What you describe would require access to a user's cookies for another website, which is impossible as the browser only allows cookie access within the same domain/path.
You would need to use a server-side proxy to make cross-domain requests, but you still cannot access external cookies: http://jquery-howto.blogspot.com/2009/04/cross-domain-ajax-querying-with-jquery.html
On a very first access to a website, is it possible to know if browser cookies are activated thanks to the GET request?
Is it possible particularly in a J2EE context? On the server side (no ajax/js solution)
Short answer: No.
The HTTP request does simply not carry that kind of information. It's only implicitly testable by trying to send one to the client and see if it uses it or not. There are probably also various javascript options, but you explicitly did not want one of those.
you can send cookies with first page, and then redirect to some index. if anyone tries to get page other then first without cookies then it is not supporting it.
This problem relates to the Restlet framework and Java
When a client wants to discover the resources available on a server - they must send an HTTP request with OPTIONS as the request type. This is fine I guess for non human readable clients - i.e. in code rather than a browser.
The problem I see here is - browsers (human readable) using GET, will NOT be able to quickly discover the resources available to them and find out some extra help documentation etc - because they do not use OPTIONS as a request type.
Is there a way to make a browser send an OPTIONS/GET request so the server can fire back formatted XML to the client (as this is what happens in Restlet - i.e. the server response is to send all information back as XML), and display this in the browser?
Or have I got my thinking all wrong - i.e. the point of OPTIONS is that is meant to be used inside a client's code and not meant to be read via a browser.
Use the TunnelService (which by default is already enabled) and simply add the method=OPTIONS query parameter to your URL.
(The Restlet FAQ Q19 is a similar question.)
I think OPTIONS is not designed to be 'user-visible'.
How would you dispatch an OPTIONS request from the browser ? (note that the form element only allows GET and POST).
You could send it using XmlHttpRequest and then get back XML in your Javascript callback and render it appropriately. But I'm not convinced this is something that your user should really know about!