Cleaning up URLs to remove personal information

Cleaning up URLs to remove personal information - java

Are there rules to identify and remove any PII information from URLs? I would like this to be generic and handle all sorts of urls which we might encounter on the internet.
Clarification : I have a list of urls of people browsing the internet and want to remove PII from those.

To answer the question as restated in your reply to snemarch:
Yes I understand that. I meant what considerations I need to keep in mind to identify PII in urls? What are the various ways in which PII might occur in URls?
HTTP GET information can be transmitted in many different ways. Some, and likely most, will look like this:
example.com/form.php?key=value.
Other websites, including stackoverflow, may use a URL rewrite to tranform the link "example.com/form/value" into the equivalent: "example.com/form.php?key=value." This URL rewrite is completely dependent on the configuration of the server and there is no simple way to detect and strip off PII presented this way.
With this in mind, there is really no way to 100% remove all PII from a list of different urls, as such information can be indiscernible from a URL without any PII. You can, at the very least, strip out information that is DEFINITELY PII, such as a URL in the form "example.com/form.php?key=value." I would be willing to bet that any URL with a "=" has some sort of variable in it, and should be filtered. Past that, you're going to have to manually parse a majority of the list.
Depending on how big the list is and how serious you are about filtering it, you could research popular mod_rewrite methods for popular products and attempt to match them in your list, scrape URLS to determine additional information about a URL, and do some complicated and likely ugly algorithms to attempt to guess at what may be a variable in a URL - possibly factoring into account similar URL's a user has visited and comparing the tokens of the URL. similar urls with slightly different text in a given token are probably variables, and should be filtered.
Good luck!

You should never pass any user sensitive information from URL via GET. If you use POST instead then just make sure the connection is HTTPS.

Related

Whitelist validation for http request

I am trying to create a servlet request filter which filters any incoming request based on the whitelist characters.
I want to accept only those characters which matches the whitelist pattern to avoid any malicious code to be executed by the attacker in the form of script or modified URL.
Does anyone know which whitelist characters should be used for filtering any HTTP request string?
Any help would be appreciated
Thanks in Advance

Implement pattern matching mechanism to find whitelist characters from your URL pattern by using RegEx..
Follow this link1
Or you can try:
if (inputUrl.contains(whiteList)) {
// your code goes here
}
Or If you need to know where it occurs, you can use indexOf:
int index = inputUrl.indexOf(whiteList);
if (index != -1) // -1 means "not found"
{
...
}
Thanks,
~Chandan

The problem is that "malicious" is very broad term. You should have clear idea what types of attacks are you trying to protect from and then take measures to prevent it.
You cannot specify set of characters in general which need to be filtered out, you need to know domain in which your input in url will be used. Generally dangerous is not url itself but url parameters which are provided by your users and then interpreted by your application. Depending on how your application will use this input, you need to take specific precautions. So for example:
Url param is used to determine target of redirect. User can use this to navigate victim to malicious site, site which masks as your site but will steal users credentials providing false credentials and so on. In that case you should construct whitelist of allowed destinations expected by your aplication and forbid others. See OWASP top TEN - Unvalidated redirects and forwards.
You save data from url param to DB. You should prevent SQL injection by using Parametrized queries. See OWASP SQL injection Cherat Sheet,
Url param data will be displayed as html. You should sanitize your html by some already proven sanitizer such as OWASP html sanitizer or AntiSamy to prevent Cross Site Scripting.
And so on...
The point is, there is no silver bullet to protect you from all the malicious attack vectors especially not by whitelisting certain characters in servlet filter. You should know where is potentially malicious data used and process it with its specific usage in mind because different targets will have different vulnerabilities and will require different measures for protection.
Good start for high level overview of security issues and measures form protection against them is OWASP TOP TEN. Then I recommend some more detailed guides and resources provided by owasp.

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

URL structure of a web page with different languages, different characters

Im working on a news site that is created with jsp. I would like to change the link structure using the "title" of news, not only their IDs.
At the following screen shot, the website puts the exact title to the URL although it has some different characters
I would like to generate url like: mydomain.com/news/id-title
I have some question about that :
1- Is it a correct approach using the url like this with different characters ? If not, how can I create URL for a Russian title (completely different characters) ?
2- Should I change these characters ? advantages, disadvantages ? (according to SEO)
3- Putting the title to URL has any benefit for SEO if we compare with the URL that is created only content ID ?

The approach is good but I would recommend against using special language characters in URLs, it often leads to errors and confusion, unless they are parameter values which often needs to hold special values (but even in those cases it's better to restrain from using special characters).
Instead if would be good to use the English title for example.
For example take a look at this example:
http://www.teamliquid.net/forum/starcraft-2
/437452-scelight-50-run-without-java-merged-accounts?page=21
Much readable, doesn't depend on who sees it, also if you use services like Google Analytics, the resource/page is directly readable from the URL etc.

[Stack Overflow is not the correct place for SEO advice (these questions are off-topic here). You could ask it on Webmasters SE, but this question is most likely answered there already. So my following answer will leave out any SEO aspects.]
You have to percent-encode the URL:
Some characters are allowed directly (e.g., a-z, A-Z, 0-9 etc.) -- you don’t have to percent-encode them,
some characters are allowed directly but have a reserved meaning -- you only have to percent-encode them if you don’t want this meaning,
and most characters are not allowed (including everything non-ASCII) -- you always have to percent-encode them.
Check the URI specification to learn which characters are allowed in which component.
Most programming languages have methods for percent-encoding URLs. For JSP, see for example these questions:
How to encode a URL with the special character "percentage"?
How to URL encode a URL in JSP?
Take for example the Russian Wikipedia page about bees. In your browser’s address bar, the URL will most likely look like
http://ru.wikipedia.org/wiki/Пчёлы
But the real URL is
http://ru.wikipedia.org/wiki/%D0%9F%D1%87%D1%91%D0%BB%D1%8B
You can easily check this yourself by copy-pasting the URL from the address bar to a text document.

Java web crawling - static URLS

I'm going to look a bit more into the techniques because obviously there is a lot to learn but I was wondering what the best approach to handling static URLs is. I would guess it has to do with cookies but I'm not positive.
E.x. So I search my query on a site example.com, example.com/search?string=blah then sends me to a url that is specific to the search string. From there I can then go further (to the data I actually want) but the link to the results is a static URL example.com/results.php?id=33, the id stays the same regardless of search string. So the only logical thing is a cookie being passed right? If so, how would I have Java open a connection, grab the cookies, then open a new connection and pass the cookies? I tried something like this with two methods, one that opens the initial connection and grabs the cookies then opens a new connection and passes the cookies to that method.
There are definitely multiple cookies if that helps.
Also, any links/resources that you think I might find helpful would be greatly appreciated.

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!

First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.

This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.