Java: How to easily check if a URL was already shortened? - java

If I have a general url (not restricted to twitter or google) like this:
http://t.co/y4o14bI
is there an easy way to check if this url is shortened?
In the above case, I as a human can of course see that it was shortend, but is there an automatic and elegant way?

You could do a request to the URL, look if you get redirected and if so, assume it's a shortening service. For this you'd have to read the HTTP status codes.
On the other hand, you could whitelist some URL shortening services (t.co, bit.ly, and so on) and assume all links to those domains are shortened.
Drawback of the first method is that it isn't certain, some sites use redirects internally. The drawback of the second method is that you'd have to keep adding shortening services, although only a few are used widely.

One signal may be to request the URL and see if it results in a redirect to another domain. However, without a good definition of what "shortened" means, there is no generic way.

if you know all the domains that can be used to shorten your URLs, check if it is contained :
String[] domains = {"bit.ly", "t.co"...};
for(String domain : domains){
if(url.startsWith("http://" + domain)){
return true;
}
}
return false;

You can't: You will have to work by assumption.
Assumption:
Does www exist in url.
Does the server name end with a valid domain (e.g. com, edu, etc.) or does it has co.xx where xx is a valid country or organization code.
And you can add more assumption based on other url shortening links.

You can't.
You can only check if you list a couple of shorteners and check if the url starts with it.
You can also try checking whether the url is shorter than a given length (and contains path/query string), but some shorteners (tinyurl for example) may have longer urls than normal sites (aol.com)
I would prefer the list of known shorteners.

Here's what you could do in Java, groovy and the like.
Get the url you want to test;
Open the url with HttpURLConnection
Check the response code
if it is a valid code, 200 for example, the you can retrieve the url string in long form from the connection object if it was shortened or back in its original form if it wasn't.
We all love to see some code don't we. Its crude, but hey!
String addr = "http://t.co/y4o14bI";
URL url = new URL(addr);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
if (connection.getResponseCode() == 200) {
String longUrl = connection.url;
System.out.println(longUrl);
} else {
// You decide what you want to do here!
}

Actually, you as a human, can't. The only way you know that it's shortened is that it's a t.co domain. The y4o14bI could be an CMS identifier for all you know.
The best way would be to use a list of known shortener urls, and lookup against that.
And even then you would have problems. I use bit.ly with a personal domain, wtn.gd
So http://wtn.gd/random would also be a shortened URL.
You could maybe do a HTTP HEAD-request, and check for a 301/302 ?

If you request an URL like this, your HttpCLient should receive a HTTP Redirect instead of a HTML page. This wouldn't be an evidence but at least a hint.

Evaluate the URL and look for some clues:
the Path meets certain criteria
only has one step (i.e. not multiple slashes)
does not end with filename extensions
not longer than X characters (would need to evaluate various URL shortening services and adjust the upper bounds for the max token length)
HttpUrlConnection returns a redirect responseCode (i.e. 301, 302)

I would suggest using android.util.Patterns.WEB_URL
public static List<String> findUrls(String input) {
List<String> links = new ArrayList<>();
Matcher m = android.util.Patterns.WEB_URL.matcher(input);
while (m.find()) {
String url = m.group();
links.add(url);
}
return links;
}

Use the unshorten URL service like https://unshorten.me
They have an API as well https://unshorten.me/api
If the URL is shortened it will return the original URL.
If not you will get the same one back.

Related

Java - Retain only the Rest API base URL of a Rest API

Small question regarding how to use java to retain only the base URL of a rest API please.
As input, many strings, all valid rest APIs.
For instance, the inputs:
https://some-host.com/v1/someapi
https://another-host.fr/api/compute
https://somewhere.host.com/public/api/v3/getsomething
I would like to only retain the bold part, basically, the https, the : and the slashes, the host name. Everything that comes after the host, I would like to discard it.
Currently, I am trying some kind of string.split based on the / character, then trying to re-concat the arrays, but I have a feeling I am not going to the right direction.
What would be the most appropriate way please?
Thank you.
You could just try java.net.URL or java.net.URI. They behave pretty similar.
For example:
URL url = new URL("http://example.com/a/b/c");
url.getProtocol();
url.getHost();
url.getPath();
or:
URI uri = new URI("http://example.com/a/b/c");
uri.getScheme();
uri.getHost();
uri.getPath();
There are several methods in both classes to extract lot's of different parts.

Render Tapestry page and get it as Stream/String resource

Is any convinient way to dynamically render some page inside application and then retrieve its contents as InputStream or String?
For example, the simplest way is:
// generate url
Link link = linkSource.createPageRenderLink("SomePageLink");
String urlAsString = link.toAbsoluteURI() + "/customParam/" + customParamValue;
// get info stream from url
HttpGet httpGet = new HttpGet(urlAsString);
httpGet.addHeader("cookie", request.getHeader("cookie"));
HttpResponse response = new DefaultHttpClient().execute(httpGet);
InputStream is = response.getEntity().getContent();
...
But it seems it must be some more easy method how to archive the same result. Any ideas?
I created tapestry-offline for exactly this purpose. Please be aware of the issue here (workaround included).
It's probably best to understand your exact use case. If, for example, you are generating emails in a scheduled task, it's probably better to configure jenkins or cron to hit a URL.
It's probably also worth mentioning the capture component from tapestry-stitch
This is only useful in situations where you want to cature part of a page as a String during page / component render.

Open a link from Java, how to hide GET parameter

I want to open a link from Java I tried this
public static void main(String[] args) {
try {
//Set your page url in this string. For eg, I m using URL for Google Search engine
String url = "http://myurl.com?id=xx";
java.awt.Desktop.getDesktop().browse(java.net.URI.create(url));
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
It is working fine but the problem is that the query string is in that url. I don't want to pass it as a query string because it is a secret key. It should be passed as hidden to webpage request. How can I do this?
You can't, directly
You'd need to use a POST instead of a GET to hide the value and the URL does not encode the method used to access it, so it will always use GET.
You could conceivably write out a HTML file that automagically does a POST to the desired URL (using some JavaScript) and open that (using a file:// URL).
But note that "hiding" the parameter like this adds no real security! An interested user that wants to know the value that his PC sends to some site will be able to see it. It might take slightly more effort to find it, but it's definitely not impossible.
If there is no need to show the particular url in a browser, then you could handle the link as an HttpURLConnection (see JavaDoc).
And here you have an example.

Search for redirected website path

I have two websites, A and B. When I open website A, I am redirected to website B automatically.
What is the function with which I can check what was the full path of website A from which was the redirect?
I was trying to start with:
logger.info(request.getPathInfo());
logger.info(request.getPathTranslated());
logger.info(request.getServletPath());
logger.info(request.getLocalName());
logger.info(request.getRemoteAddr());
logger.info(request.getRemoteHost());
logger.info(request.getRequestURI());
logger.info(request.getServerName());
but none of them is correct.
For redirecting I use response.sendRedirect inside Controller.
Thanks for help.
You can try using the optional referer header:
request.getHeader("referer");
But it is important to note that this may not always be populated (specifically IE).
A better solution, if you are in control of both of the websites, is to pass the value somehow when you are doing the redirect. For example, as a GET or POST parameter.
Edit:
As suggested above, you can append query strings to your redirect URL. For example, you might try something like this:
String redirectUrl = "http://my.redirect.com/";
redirectUrl += "?referer=";
redirectUrl += URLEncoder.encode(request.getRequestURL().toString(), "UTF-8");
Then you can just pull this out of the request on the other side.
Use this as a starting point. You may need to manually append other query parameters that may not be part of the getRequestURL() output.
None of these would get you the page that redirected you to the current page. What you can try is:
String refererPage = request.getHeader("referer");
However keep in mind that this is also browser dependent and may not always be present.
Try this
request.getHeader("referer");
Please try
request.getHeader("referer");

Very Simple Regex Question

I have a very simple regex question. Suppose I have 2 conditions:
url =http://www.abc.com/cde/def
url =https://www.abc.com/sadfl/dsaf
How can I extract the baseUrl using regex?
Sample output:
http://www.abc.com
https://www.abc.com
Like this:
String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str);
if (m.matches())
baseUrl = m.group(1);
However, you should use the URI class instead, like this:
URI uri = new URI(str);
A one liner without regexp:
String baseUrl = url.substring(0, url.indexOf('/', url.indexOf("//")+2));
/^(https?\:\/\/[^\/]+).*/$1/
This will capture ANYTHING that starts with http and $1 will contain everything from the beginning to the first / after the //
Except for write-and-throw-away scripts, you should always refrain from parsing complex syntaxes (e-mail addresses, urls, html pages, etc etc) using regexes.
believe me, you will get bitten eventually.
I'm pretty sure that there is a Java class that will allow path manipulations, but if it has to be a regex,
https?://[^/]+
would work. (s? included to also handle https:)
Looks like the simplest solution to your two specific examples would be the pattern:
[^/]_//[^/]+
i.e.: non-slash (0 or more times), two slashes, non-slash (0 or more times). You can be stricter than that if you wish, as the two existing answers are doing in different ways -- one will reject e.g. URLs starting with ftp:, the other will reject domains with underscores (but accept URLs without a leading protocol://, thereby being even broader than mine in that respect). This variety of answers (all correct wrt your scant specs;-) should suggest to you that your specs are too vague and should be tightened.
Here's a regex that should satisfy the problem as given.
https?://[^/]*
I'm assuming you're asking this partly to gain more knowledge of regexes. If, however, you're trying to pull the host from a URL, it's arguably much more correct to use Java's more robust parsing methods:
String urlStr = "https://www.abc.com/stuff";
URL url = new URL(urlStr);
String host = url.getHost();
String protocol = url.getProtocol();
URL baseUrl = new URL (protocol, host);
This is better, as it should catch more cases if your input URL isn't as strict as described above.
Old post.. thought I might as well put a simple answer to a simple regex Q:
(http|https):\/\/(www.)?(\w+)?\.(\w+)?

Categories