How to detect top legit search engine bots?

How to detect top legit search engine bots? - java

I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot. I am using Java on linux server , so java based solution I am looking for.
I am only letting in top good search engine spiders such as google yahoo bing alexa etc and keep the others out to reduce server loads. But its very important top spiders index my site.

For a more complete answer to your question, you can't rely on only one approach. The problem is the conflicting nature of what you want to do. Essentially you want to allow good bots to access your site and index it so you can appear on search engines; but you want to block bad bots from sucking up all your bandwidth and stealing your information.
First line of defense:
Create a robots.txt file at the root of your site. See http://www.robotstxt.org/ for more information about that. This will keep good, well behaved bots in the areas of the site that make the most sense. Keep in mind that robots.txt relies on the User-Agent string if you provide different behavior for one bot vs. another bot. See http://www.robotstxt.org/db.html
Second line of defense:
Filter on User-Agent and/or IP address. I've already been criticized for suggesting that, but it's surprising how few bots disguise who and what they are--even the bad ones. Again, it's not going to stop all bad behavior, but it provides a level of due diligence. More on leveraging User-Agent later.
Third line of defense:
Monitor your Web server's access logs. Use a log analyzer to figure out where the bulk of your traffic is comming from. These logs include both IP address and user-agent strings so you can detect how many instances of a bot is hitting you, and whether it is really who it says it is: see http://www.robotstxt.org/iplookup.html
You may have to whip up your own log analyzer to find out the request rate from different clients. Anything above a certain threshold (like maybe 10/second) would be a candidate to rate limit later on.
Leveraging User Agent for Alternative Site Content:
An approach we had to take to protect our users from even legitimate bots hammering our site is to split traffic based on the User-Agent. Basically, if the User-Agent was a known browser, they got the full featured site. If it was not a known browser it was treated as a bot, and was given a set of simple HTML files with just the meta information and links they needed to do their job. The bot's HTML files were statically generated four times a day, so there was no processing overhead. You can also render RSS feeds instead of stripped down HTML which provide the same function.
Final Note:
You only have so many resources, and not every legitimate bot is well behaved (i.e. ignores robots.txt and puts a lot of stress on your server). You will have to update your approach over time. For example, if one IP address turns out to be a custom search bot your client (or their client) made, you may have to resort to rate-limiting that IP address instead of blocking it completely.
Essentially you are trying to get a good balance between serving your users, and keeping your site available for search engines. Do enough to keep your site responsive to the users, and only resort to the more advanced tactics as necessary.

I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
The normal approach to this is to configure a robots.txt file to allow the crawlers that you want, and disallow the rest. Of course, this does depend on crawlers following the rules, but for those that don't you can fall back on things like user-agent strings, ip address checking, etc.
The nice things about "robots.txt" are:
It is simple to set up.
It has minimal impact on your site. A well behaved crawler would fetch the file, and (assuming you disallowed the crawler) just go away.
You can specify what parts of your site can be crawled.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
I believe so. See Google's view on what you are doing.
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot.
It probably will, but it is rather expensive. Robots.txt is a simpler approach, and easy to implement in the first instance.

The correct and fast way to identify Googlebot is:
check the user-agent string
if Googlebot, verify the IP by DNS
Only clients that identify as Googlebot pay a one-time price for the IP/DNS verification. Assuming that you will locally cache the result per IP for a while of course.
For the user-agent checking, you can use simple Java String functionality. Something like userAgent.contains("Googlebot") according to https://support.google.com/webmasters/answer/1061943 or else you can use this library: https://github.com/before/uadetector
Regarding DNS, that's what Google recommends https://support.google.com/webmasters/answer/80553
You can verify that a bot accessing your server really is Googlebot
(or another Google user-agent) by using a reverse DNS lookup,
verifying that the name is in the googlebot.com domain, and then doing
a forward DNS lookup using that googlebot name. This is useful if
you're concerned that spammers or other troublemakers are accessing
your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Bing works the same way with bingbot, see http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
Because I needed the same thing, I've put some Java code into a library and published it on GitHub: https://github.com/optimaize/webcrawler-verifier it's available from Maven Central. And here's a blog post describing it: http://www.flowstopper.org/2015/04/is-that-googlebot-user-agent-really-from-google.html

Check out this site:
http://www.user-agents.org/
They also have an XML version of the database that you can download and incorporate. They classify known "User-Agent" header identities by whether they are a Browser, Link/Server-Checker, Downloading tool, Proxy Server, Robot/Spider/Search Engine, or Spam/bad-bot.
NOTE
I have experienced a couple User-Agent strings that represent the Java runtime someone hacked together to scrape a site. It turned out that someone was doing their own search engine scraper in that case, but it could just as well be a spider to download all your content for offsite/disconnected use.

Related

How to put a my own proxy between any client and any server (via a web page)

what I want to do is to build a web application(proxy) that user use to request the webpage he want and
my application forward the request to the main server,
modify HTML code,
send to the client the modified one.
The question now is
How to keep my application between the client and main serevr
(for example when the user click any link inside the modified page-
ajax request - submit forms - and so on)
in another words
How to grantee that any request (after the first URL request) from the client sent to my proxy and any response come first to my proxy

The question is: Why do you need a proxy? Why do you want to build it - why not use already existing one like HAProxy ?
EDIT: sorry, I didn't read your whole post correctly. You can start with:
http://www.jtmelton.com/2007/11/27/a-simple-multi-threaded-java-http-proxy-server/

If the user is willing to, or can be forced1 to configure his clients (e.g. web browser) to use a web proxy, then your problem is already solved. Another way to do this (assuming that the user is cooperative) is to get them to install a trusted browser plugin that dynamically routes selected URLs through your proxy. But you can't do this using an untrusted webapp: the Browser sandbox won't (shouldn't) let you.
Doing it without the user's knowledge and consent requires some kind of interference at the network level. For example, a "smart" switch could recognizes TCP/IP packets on port 80 and deliberately route them to your proxy instead of the IP address that the client's browser specifies. This kind of thing is known as "deep packet inspection". It would be very difficult to implement yourself, and it requires significant compute power in your network switch if you are going to achieve high network rates through the switch.
The second problem is that making meaningful on-the-fly modifications to arbitrary HTML + Javascript responses is a really difficult problem.
The final problem is that this is only going to work with HTTP. HTTPS protects against "man in the middle" attacks ... such as this ... that monitor or interfere with the requests and responses. The best you could hope to do would be to capture the encrypted traffic between the client and the server.
1 - The normal way to force a user to do this is to implement a firewall that blocks all outgoing HTTP connections apart from those made via your proxy.
UPDATE
The problem now what should I change in the html code to enforce client to request any thing from my app --- for example for link href attribute may be www.aaaa.com?url=www.google.com but for ajax and form what I should do?
Like I said, it is a difficult task. You have to deal with the following problems:
Finding and updating absolute URLs in the HTML. (Not hard)
Finding and dealing with the base URL (if any). (Not hard)
Dealing with the URLs that you don't want to change; e.g. links to CSS, javascript (maybe), etc. (Harder ...)
Dealing with HTML that is syntactically invalid ... but not to the extent that the browser can't cope. (Hard)
Dealing with cross-site issues. (Uncertain ...)
Dealing with URLs in requests being made by javascript embedded in / called from the page. This is extremely difficult, given the myriad ways that javascript could assemble the URL.
Dealing with HTTPS. (Impossible to do securely; i.e. without the user not trusting the proxy to see private info such as passwords, credit card numbers, etc that are normally sent securely.)
and so on.

How to append html to a website response before it reaches the browser in Java?

Recently I used a Mac application called Spotflux. I think it's written in Java (because if you hover over its icon it literally says "java"...).
It's just a VPN app. However, to support itself, it can show you ads... while browsing. You can be browsing on chrome, and the page will load with a banner at the bottom.
Since it is a VPN app, it obviously can control what goes in and out of your machine, so I guess that it simply appends some html to any website response before passing it to your browser.
I'm not interested in making a VPN or anything like that. The real question is: how, using Java, can you intercept the html response from a website and append more html to it before it reaches your browser? Suppose that I want to make an app that literally puts a picture at the bottom of every site you visit.

This is, of course, a hypothetical answer - I don't really know how Spotflux works.
However, I'm guessing that as part of its VPN, it installs a proxy server. Proxy servers intercept HTTP requests and responses, for a variety of reasons - most corporate networks use proxy servers for caching, monitoring internet usage, and blocking access to NSFW content.
As a proxy server can see all HTTP traffic between your browser and the internet, it can modify that HTTP; often, a proxy server will inject an HTTP header, for instance; injecting an additional HTML tag for an image would be relatively easy.
Here's a sample implementation of a proxy server in Java.

There are many ways to do this. Probably the easiest would be to proxy HTTP requests through a web proxy like RabbIT (written in java). Then just extend the proxy to mess with the response being sent back to the browser.
In the case of Rabbit, this can either be done with custom code, or with a special Filter. See their FAQ.
WARNING: this is not as simple as you think. Adding an image to the bottom of every screen will be hard to do, depending on what kind of HTML is returned by the server. Depending on what CSS, javascript, etc that the remote site uses, you can't just put the same HTML markup in and expect it to act the same everywhere.

HTTP Referrer. Is it reliable? [duplicate]

This question already has an answer here:
How reliable is HTTP_REFERER?
(1 answer)
Closed 8 years ago.
I am interested in logging from where a user comes in order to access my web app.
I thought of using HTTP's referrer header for that, but from e.g.HTTP referrer wiki
it seems that this is not a accurate/reliable way since in many cases it is not send.
I was wondering is the referrer header the only way? Is there a better/standard approach?

Reliable way would be to have ?ref=somehash a GET parameter
For example:
Consider this site SO, they have list of questions, now there is a portlet which streams the recent questions to some other site for example abcd.com now to see if user clicked the link from abcd.com you pass a parameter ?ref=423jahjaghr where this string maps to abcd.com

Referrer header isn't the only way, but it is the most standard.
You can consider using Google Analytics, which has extra referrer capabilities, but you'd have to manually setup collecting the data from their services to input into your logging infrastructure.
Nothing is going to be 100% fool proof though. It's pretty straight-foward to block Google Analytics, and spoof referrers, and HTML5 will make it even easier to prevent sending referrer information.
If it's mission critical that you know the referrer of all inbound traffic, you'll have come up with a more draconian approach (like #Jigar Joshi has suggested)

Depending on the browser, you may OR may not get the referrer header. You may not get it always. You have to have a request parameter OR a form field to get the referrer.

HTTP Referer is a good way to analyze logs. And to maintain analytics on user interactions. However a browser or any other system which displays webpages and is able to traverse a webpage might not send this header entry.
You might also consider to use a 3rd party application like google analytics. But you should check if this 3rd party tool is legal in your country. Most of them have data-privacy issues.
Very important is. For analytics its ok if you have a certain error in the expected outcome. However never do any security related checks on the http referer. Someone might enter whatever he wants as referer.

My program needs to access information (key/value) from my hosted server. What web architecture would be best for this?

My program needs to download object definitions (basically xml files, maybe binary files) on demand via the net. The program will request objects from my server during runtime. The only thing the program has to send the server is a string that identifies the object it needs (e.g. RedCubeIn3DSpace23). So a basic Key, Value system. My app also has to have some basic authentication mechanism to make sure only legitimate programs access my server’s info. Maybe send the license number and a password.
What is the best way to go about implementing this? I have 0 web knowledge so I'm not sure exactly what technologies I need. I have implemented socket programs in college so maybe that is what I need? Are there frameworks for this type of thing? There could be thousands of users/clients simultaneously; maybe more but I don’t know.
One super important requirement is that I need security to be flawless on the server side. That is, I can't have some hacker replacing object definitions with malicious one that clients download. That would be disastrous.
My first thoughts:
-Set up an ftp server and have each xml file will be named by the key value. Program logs in with its product_id and fixed password and just does downloads. If I use a good ftp server, that is pretty impervious to a hacker modifying definitions. Drawback is that it's very non expandable nor flexible.
-RESTful type system. I just learned about this when searching stackoverflow. I can make categories of objects using URL but how do I do authentication and other actions. Might be hard to program but is this a better approach? Is there a prebuilt library for this?
-Sockets using Java/C#. Java/C# would protect me from overflow attacks and then it is just a matter of spawning a thread on each connection and setting up simple messaging protocol and file transfers.
-SOAP. Just learned about it while searching. Don't know much.
-EC2. I think it (and other?) cloud services add a db layer over it.
That's what I can come up with, what do you think given my requirements? I just need a little guidance.

HTTP seems a better fit than ftp, since you only want to download stuff. That is, you would set up a web server (e.g. Apache), configure it for whatever authentication scheme you need, and have it serve that content.
SOAP is clearly overkill for this, and using raw sockets would be reinventing the wheel (i.e. a web server).
I'd do security on the socket level, using HTTPS. That way, the client will verify the identity of the server prior when establishing the connection, and nobody can intercept the password sent to the server. Again, a decent webserver will support this out-of-the-box, you just need to configure it properly.

Displaying a website in a website?

What I need to do is this:
<iframe src="http://www.google.com" width="800" height="600"></iframe>
But the constraint is, I want my website to fetch a requested website and display it in
frame. That is, the clients browser must only have a connection with my web server. My website
in turn will fetch requested url's and display them to the client.
The only way I have thought I could do this is perhaps passing the url to an application that in turn downloads the page and then redirects the clients browser to the page (now stored locally on my web server). The problem is however that this would only work with rather boring and static sites, I require the website in the website do be fully functioning, ie streaming video, secure connections...
What would be the best way to do this?

I hate to break it to you, but I don't think there's a foolproof way to do this. What you're trying to do is make a proxy, and there's several ways to do it, but either way you won't be able to take things like Flash and JavaScript into account. I've used a lot of different proxies to get around the filter at my school and not one of them has been 100% effective. In fact, I don't think a single one has been able to load the music player on either PureVolume or MySpace.
If you still want to give it a try, read this article: Using Apache As A Proxy Server

If one of your requirements is
... secure connections
that is not possible at all. By definition a secure end-to-end connections cannot go thru a proxy (see Man-in-the-middle)

I have found a solution, to who ever mentioned it and then deleted their answer, thanks.
Making use of a reverse proxy could do this, http://docsrv.sco.com/INT_Proxy/revpxy.htm shows some ways in which a reverse proxy may be used.
Paramesh Gunasekaran wrote a tutorial on creating your own reverse proxy with code supplied.
http://www.codeproject.com/KB/IP/reverseproxy.aspx

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.