Java web crawling - static URLS

Java web crawling - static URLS - java

I'm going to look a bit more into the techniques because obviously there is a lot to learn but I was wondering what the best approach to handling static URLs is. I would guess it has to do with cookies but I'm not positive.
E.x. So I search my query on a site example.com, example.com/search?string=blah then sends me to a url that is specific to the search string. From there I can then go further (to the data I actually want) but the link to the results is a static URL example.com/results.php?id=33, the id stays the same regardless of search string. So the only logical thing is a cookie being passed right? If so, how would I have Java open a connection, grab the cookies, then open a new connection and pass the cookies? I tried something like this with two methods, one that opens the initial connection and grabs the cookies then opens a new connection and passes the cookies to that method.
There are definitely multiple cookies if that helps.
Also, any links/resources that you think I might find helpful would be greatly appreciated.

Related

Whitelist validation for http request

I am trying to create a servlet request filter which filters any incoming request based on the whitelist characters.
I want to accept only those characters which matches the whitelist pattern to avoid any malicious code to be executed by the attacker in the form of script or modified URL.
Does anyone know which whitelist characters should be used for filtering any HTTP request string?
Any help would be appreciated
Thanks in Advance

Implement pattern matching mechanism to find whitelist characters from your URL pattern by using RegEx..
Follow this link1
Or you can try:
if (inputUrl.contains(whiteList)) {
// your code goes here
}
Or If you need to know where it occurs, you can use indexOf:
int index = inputUrl.indexOf(whiteList);
if (index != -1) // -1 means "not found"
{
...
}
Thanks,
~Chandan

The problem is that "malicious" is very broad term. You should have clear idea what types of attacks are you trying to protect from and then take measures to prevent it.
You cannot specify set of characters in general which need to be filtered out, you need to know domain in which your input in url will be used. Generally dangerous is not url itself but url parameters which are provided by your users and then interpreted by your application. Depending on how your application will use this input, you need to take specific precautions. So for example:
Url param is used to determine target of redirect. User can use this to navigate victim to malicious site, site which masks as your site but will steal users credentials providing false credentials and so on. In that case you should construct whitelist of allowed destinations expected by your aplication and forbid others. See OWASP top TEN - Unvalidated redirects and forwards.
You save data from url param to DB. You should prevent SQL injection by using Parametrized queries. See OWASP SQL injection Cherat Sheet,
Url param data will be displayed as html. You should sanitize your html by some already proven sanitizer such as OWASP html sanitizer or AntiSamy to prevent Cross Site Scripting.
And so on...
The point is, there is no silver bullet to protect you from all the malicious attack vectors especially not by whitelisting certain characters in servlet filter. You should know where is potentially malicious data used and process it with its specific usage in mind because different targets will have different vulnerabilities and will require different measures for protection.
Good start for high level overview of security issues and measures form protection against them is OWASP TOP TEN. Then I recommend some more detailed guides and resources provided by owasp.

Cleaning up URLs to remove personal information

Are there rules to identify and remove any PII information from URLs? I would like this to be generic and handle all sorts of urls which we might encounter on the internet.
Clarification : I have a list of urls of people browsing the internet and want to remove PII from those.

To answer the question as restated in your reply to snemarch:
Yes I understand that. I meant what considerations I need to keep in mind to identify PII in urls? What are the various ways in which PII might occur in URls?
HTTP GET information can be transmitted in many different ways. Some, and likely most, will look like this:
example.com/form.php?key=value.
Other websites, including stackoverflow, may use a URL rewrite to tranform the link "example.com/form/value" into the equivalent: "example.com/form.php?key=value." This URL rewrite is completely dependent on the configuration of the server and there is no simple way to detect and strip off PII presented this way.
With this in mind, there is really no way to 100% remove all PII from a list of different urls, as such information can be indiscernible from a URL without any PII. You can, at the very least, strip out information that is DEFINITELY PII, such as a URL in the form "example.com/form.php?key=value." I would be willing to bet that any URL with a "=" has some sort of variable in it, and should be filtered. Past that, you're going to have to manually parse a majority of the list.
Depending on how big the list is and how serious you are about filtering it, you could research popular mod_rewrite methods for popular products and attempt to match them in your list, scrape URLS to determine additional information about a URL, and do some complicated and likely ugly algorithms to attempt to guess at what may be a variable in a URL - possibly factoring into account similar URL's a user has visited and comparing the tokens of the URL. similar urls with slightly different text in a given token are probably variables, and should be filtered.
Good luck!

You should never pass any user sensitive information from URL via GET. If you use POST instead then just make sure the connection is HTTPS.

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!

First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.

This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

strange problem with java.net.URL and java.net.URLConnection

I'm trying to download an image from a url.
The process I wrote works for everyone except for ONE content provider that we're dealing with.
When I access their JPGs via Firefox, everything looks kosher (happy Passover, btw). However, when I use my process I either:
A) get a 404
or
B) in the debugger when I set a break point at the URL line (URL url = new URL(str);)
then after the connection I DO get a file but it's not a .jpg, but rather some HTML that they're producing with generic links and stuff. I don't see a redirect code, though! It comes back as 200.
Here's my code...
URL url = new URL(urlString);
URLConnection uc = url.openConnection();
String val = uc.getHeaderField(0);
System.out.println("FOUND OBJECT OF TYPE:" + contType);
if(!val.contains("200")){
//problem
}
else{
is = uc.getInputStream();
}
Has anyone seen anything of this nature? I'm thinking maybe it's some mime type issue, but that's just a total guess... I'm completely stumped.

Maybe the site is just using some kind of protection to prevent others from hotlinking their images or to disallow mass downloads.
They usually check either the HTTP referrer (it must be from their own domain), or the user agent (must be a browser, not a download manager). Set both and try it again.

if(!val.contains("200")) // ...
First of all, I would suggest you to use this useful class called
HttpURLConnection, which provides the method getResponseCode()
Searching the whole data for some '200' implies
performance issues, and
inconsistency (binary files can contain some '200')

Have you tried using WireShark to see exactly what packets are going back and forth? This is often the fastest way to see what is different. That is:
First run WireShark when using FireFox to get the GIF, and then
Run WireShark to use your code to get it.
Then compare and contrast the packets in both directions and I almost guarantee that you'll see something different in the HTTP headers or some other part of the traffic that will explain the problem.

All good guesses, but the "right" answer reward, I think, has to go to ivan_pertrovich_ivanovich_harkovich_rostropovitch_o'neil because using HttpURLConnection I was able to see that, in fact, before getting a 404, I'm first getting a 301. So, now, it's just a matter of finding out from these people what they're expecting in the header, which would make them less inclined to redirect me.
thanks for the suggestion.

Multiple questions in Java (Validating URLs, adding applications to startup, etc.)

I have an application to build in Java, and I've got some questions to put.
Is there some way to know if one URL of a webpage is real? The user enters the URL and I have to test if it's real, or not.
How can I konw if one webpage has changes since one date, or what is the date of the last update?
In java how can I put an application running on pc boot, the application must run since the user turns on the computer.

I'm not sure what kind of application you want to build. I'll assume it's a desktop application. In order to check if a URL exists, you should make a HTTP HEAD Request, and parse the results. HEAD can be used to check if the page has been modified. In order for an application to start when the PC boots, you have to add a registry entry under Windows; this process is explained here

To check whether a url is valid you could try using a regular expression (regex for urls).
To know if a webpage has changed you can take a look at the http headers (reading http headers in java).
You can't make the program startup automatically on boot, the user must do that. However, you can write code to help the user set the program as startup app; this however depends on the operating system.

I'm not sure what you mean by "real". If you mean "valid", then you can just construct a java.net.URL from a String and catch the resulting MalformedURLException if it's not valid. If you mean that there's actually something there, you could issue an HTTP HEAD request like Geo says, or you could just retrieve the content. HTTPUnit is particularly handy for retrieving web content.
HTTP headers may indicate when the content has changed, as nan suggested above. If you don't want to count on that you can just retrieve the page and store it, or even better, store a hash of the page content. See DigestOutputStream for generating a hash. On a subsequent check for changes you would simply compare the new hash with the the one you stored last time.
Nan is right about start on boot. What OS are you targeting?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.