Finding out your website visitor IP address in Java - java

Is there simple and reliable way to detect your website visitor IP address using Java. I am trying to make use of Akismet to detect spam on my blog post/comment. The API require me to specify the IP address of the commenter.
Thanks =)

A call to ServletRequest.getRemoteAddr() should do it.

ServletRequest.getRemoteAddr() does this in the simplest scenarios. If you're behind a load balancer, you may instead want to look at the X-Forwarded-For header, as getRemoteAddr() will be the address of your load balancer. It's a comma-separated list of IP numbers, where the last one is the address that connected to your load balancer. The last address is the only one you can really trust (as it will be added by the load balancer), the others may be spoofed.

If you are using JSP on the server-side, then you can look at this link:
http://www.rgagnon.com/javadetails/java-0363.html
If you are using a servlet then you can use HttpServletRequest.getRemoteAddr()

Problem traffic is about 80% folks who will work to be sure they do not do work - every site that I have seen that stays up uses some sort of human-has-to-think authentication, IPV4 is a constant source of spoofing, intrusions, and news reports ( which you want to stay out of ) IPv6 approaches the matter with engineering-grade work.
At that point, I think they will move over to using human shields or something.

Related

jQuery.post() with URL set as a Java weblet

I have a java app on my server and I can access it with my browser by going to server.com:8080/app.
I've been trying to get my application to access this weblet but because of XSS jQuery.post() gives me errors. Both the app and weblet are on the same server, but since I have to access the weblet through port 8080 Javascript thinks it's another server.
My question: Is there a way to avoid this XSS issue?
I don't want to use a PHP proxy or .htaccess. I also don't want to use the $.getJSON(url + '&callback?') method.
I'm looking for any other solutions.
Thanks in advance.
It' SOP(Same Origin Policy) that's stopping you here, not XSS. XSS is a security vulnerability, which breaks SOP. And yes it limits access so both pages have to run on the same protocol, port and domain.
Can you use a reverse proxy from the webserver on port 80 to 8080? If not you could take a look at easyXDM. Another alternative is to have the 8080 service return rhe access control header mentioned in one of your comments, but this is not supported in older browsers.

What is the reliable approach to get the end user IP address from the server side?

I use JSP in server side and want to validate that, an user must not log in from two different IP addresses. What is the method to do this validation?
And some says that the client IP address may not be get from the server side because of some proxies involved. Then how the google and facebook are doing this? Will this be reliable in production environmennt? please explain. Thank you !
Then how are Google and Facebook doing it?
They are probably using the X-Forwarded-For header that a lot of proxy servers add to the request on the way through.
This is only reliable to the extent that the proxies are telling the truth.
Well, getting the IP address is as easy as using ServletRequest#getRemoteAddr()
But as you've noted already, there is no way to get this reliably - if the client is using a proxy, the connection will appear to originate at that IP address. I don't believe Facebook or Google can get around this either - which is why you can access US only features (for example, Google Voice in Gmail) by logging in via a proxy/ssh tunnel that has a US IP address.
If you just want to stop a user from logging in from two different IP addresses simultaneously, all you need to do is track what IP address their current session (if one exists) originates from and either
stop the second login attempt, or
expire the first session
I'm not sure what value there might be in preventing a login from different IP addresses at different times since that's very likely going to happen with users who are travelling around or moving from network to network.

How to detect top legit search engine bots?

I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot. I am using Java on linux server , so java based solution I am looking for.
I am only letting in top good search engine spiders such as google yahoo bing alexa etc and keep the others out to reduce server loads. But its very important top spiders index my site.
For a more complete answer to your question, you can't rely on only one approach. The problem is the conflicting nature of what you want to do. Essentially you want to allow good bots to access your site and index it so you can appear on search engines; but you want to block bad bots from sucking up all your bandwidth and stealing your information.
First line of defense:
Create a robots.txt file at the root of your site. See http://www.robotstxt.org/ for more information about that. This will keep good, well behaved bots in the areas of the site that make the most sense. Keep in mind that robots.txt relies on the User-Agent string if you provide different behavior for one bot vs. another bot. See http://www.robotstxt.org/db.html
Second line of defense:
Filter on User-Agent and/or IP address. I've already been criticized for suggesting that, but it's surprising how few bots disguise who and what they are--even the bad ones. Again, it's not going to stop all bad behavior, but it provides a level of due diligence. More on leveraging User-Agent later.
Third line of defense:
Monitor your Web server's access logs. Use a log analyzer to figure out where the bulk of your traffic is comming from. These logs include both IP address and user-agent strings so you can detect how many instances of a bot is hitting you, and whether it is really who it says it is: see http://www.robotstxt.org/iplookup.html
You may have to whip up your own log analyzer to find out the request rate from different clients. Anything above a certain threshold (like maybe 10/second) would be a candidate to rate limit later on.
Leveraging User Agent for Alternative Site Content:
An approach we had to take to protect our users from even legitimate bots hammering our site is to split traffic based on the User-Agent. Basically, if the User-Agent was a known browser, they got the full featured site. If it was not a known browser it was treated as a bot, and was given a set of simple HTML files with just the meta information and links they needed to do their job. The bot's HTML files were statically generated four times a day, so there was no processing overhead. You can also render RSS feeds instead of stripped down HTML which provide the same function.
Final Note:
You only have so many resources, and not every legitimate bot is well behaved (i.e. ignores robots.txt and puts a lot of stress on your server). You will have to update your approach over time. For example, if one IP address turns out to be a custom search bot your client (or their client) made, you may have to resort to rate-limiting that IP address instead of blocking it completely.
Essentially you are trying to get a good balance between serving your users, and keeping your site available for search engines. Do enough to keep your site responsive to the users, and only resort to the more advanced tactics as necessary.
I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
The normal approach to this is to configure a robots.txt file to allow the crawlers that you want, and disallow the rest. Of course, this does depend on crawlers following the rules, but for those that don't you can fall back on things like user-agent strings, ip address checking, etc.
The nice things about "robots.txt" are:
It is simple to set up.
It has minimal impact on your site. A well behaved crawler would fetch the file, and (assuming you disallowed the crawler) just go away.
You can specify what parts of your site can be crawled.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
I believe so. See Google's view on what you are doing.
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot.
It probably will, but it is rather expensive. Robots.txt is a simpler approach, and easy to implement in the first instance.
The correct and fast way to identify Googlebot is:
check the user-agent string
if Googlebot, verify the IP by DNS
Only clients that identify as Googlebot pay a one-time price for the IP/DNS verification. Assuming that you will locally cache the result per IP for a while of course.
For the user-agent checking, you can use simple Java String functionality. Something like userAgent.contains("Googlebot") according to https://support.google.com/webmasters/answer/1061943 or else you can use this library: https://github.com/before/uadetector
Regarding DNS, that's what Google recommends https://support.google.com/webmasters/answer/80553
You can verify that a bot accessing your server really is Googlebot
(or another Google user-agent) by using a reverse DNS lookup,
verifying that the name is in the googlebot.com domain, and then doing
a forward DNS lookup using that googlebot name. This is useful if
you're concerned that spammers or other troublemakers are accessing
your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Bing works the same way with bingbot, see http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
Because I needed the same thing, I've put some Java code into a library and published it on GitHub: https://github.com/optimaize/webcrawler-verifier it's available from Maven Central. And here's a blog post describing it: http://www.flowstopper.org/2015/04/is-that-googlebot-user-agent-really-from-google.html
Check out this site:
http://www.user-agents.org/
They also have an XML version of the database that you can download and incorporate. They classify known "User-Agent" header identities by whether they are a Browser, Link/Server-Checker, Downloading tool, Proxy Server, Robot/Spider/Search Engine, or Spam/bad-bot.
NOTE
I have experienced a couple User-Agent strings that represent the Java runtime someone hacked together to scrape a site. It turned out that someone was doing their own search engine scraper in that case, but it could just as well be a spider to download all your content for offsite/disconnected use.

Voting mechanism with IP validation and allowing only one by user: problem in getting user IP address with two app servers and apache in front

We have a voting mechanism that we want to restrict to only allow one vote by user.
We've tried to validate by IP address, but the problem is that when we get the user's IP address in the applicational server it shows always the apache IP address (we have two applicational servers with apache in front of them).
We are using ColdFusion variable CGI.REMOTE_ADDR to get the user IP.
Anyone knows how to fix this?
We would like to avoid the use of sessions or cookies.
Thanks in advance.
You probably want to use the X-Forward-For header header
instead of the source ip, assuming your apache instances are putting it into the request.

Displaying a website in a website?

What I need to do is this:
<iframe src="http://www.google.com" width="800" height="600"></iframe>
But the constraint is, I want my website to fetch a requested website and display it in
frame. That is, the clients browser must only have a connection with my web server. My website
in turn will fetch requested url's and display them to the client.
The only way I have thought I could do this is perhaps passing the url to an application that in turn downloads the page and then redirects the clients browser to the page (now stored locally on my web server). The problem is however that this would only work with rather boring and static sites, I require the website in the website do be fully functioning, ie streaming video, secure connections...
What would be the best way to do this?
I hate to break it to you, but I don't think there's a foolproof way to do this. What you're trying to do is make a proxy, and there's several ways to do it, but either way you won't be able to take things like Flash and JavaScript into account. I've used a lot of different proxies to get around the filter at my school and not one of them has been 100% effective. In fact, I don't think a single one has been able to load the music player on either PureVolume or MySpace.
If you still want to give it a try, read this article: Using Apache As A Proxy Server
If one of your requirements is
... secure connections
that is not possible at all. By definition a secure end-to-end connections cannot go thru a proxy (see Man-in-the-middle)
I have found a solution, to who ever mentioned it and then deleted their answer, thanks.
Making use of a reverse proxy could do this, http://docsrv.sco.com/INT_Proxy/revpxy.htm shows some ways in which a reverse proxy may be used.
Paramesh Gunasekaran wrote a tutorial on creating your own reverse proxy with code supplied.
http://www.codeproject.com/KB/IP/reverseproxy.aspx

Categories