Recursively expanding short URLs in a big text - Java - java

I have collected text of about 2 * 10^5 rows, i have to expand the shortened URLs in the text. the problem is, that some expanded URLs redirect to again a shorten URL, which means that the original URL is 2 or more times shortened. how to handle that efficiently? because if i make a while loop, it is taking too much time.
The problem with the code below, is about 2*10^5 context switches from my MySQL DB. I extract the URL from each text and then expand it. This will make an http request, and then it has to be entered into my DB. it is taking too much time.
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.Proxy;
import java.net.URL;
public class urlExpander {
public static void main(String[] args) throws IOException {
String shortenedUrl = "https://www.youtube.com/watch?v=4oZTPGvG3s8&feature=youtu.be";
String expandedURL = expandUrl(shortenedUrl);
System.out.println(shortenedUrl + "-->" + expandedURL);
}
public static String expandUrl(String shortenedUrl) throws IOException {
URL url = new URL(shortenedUrl);
// open connection
HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
// stop following browser redirect
httpURLConnection.setInstanceFollowRedirects(false);
// extract location header containing the actual destination URL
String expandedURL = httpURLConnection.getHeaderField("Location");
httpURLConnection.disconnect();
return expandedURL;
}
}

Related

WebDriver getCurrentUrl() returning malformed URI

I'm involved in writing a (Java/Groovy) browser-automation app with Selenium 2 and FireFox driver.
Currently there is an issue with some URLs we find in the wild that are apparently using bad URI syntax. (specifically curly braces ({}), |'s and ^'s).
String url = driver.getCurrentUrl(); // http://example.com/foo?key=val|with^bad{char}acters
When trying to construct a java.net.URI from the string returned by driver.getCurrentUrl() a URISyntaxException is thrown.
new URI(url); // java.net.URISyntaxException: Illegal character in query at index ...
Encoding the whole url before constructing the URI will not work (as I understand it).
The whole url is encoded, and it doesn't preseve any pieces of it that I can parse in any normal fashion. For example, with this uri-safe string, URI can't know the difference between a & as the query-string-param delimeter or %26 (its encoded value) in the content of a single qs-param.
String encoded = URLEncoder.encode(url, "UTF-8") // http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval%7Cwith%5E%7Cbad%7Ccharacters
URI uri = new URI(encoded)
URLEncodedUtils.parse(uri, "UTF-8") // []
Currently the solution is, before constructing the URI, running the following (groovy) code:
["|", "^", "{", "}"].each {
url = url.replace(it, URLEncoder.encode(it, "UTF-8"))
}
But this seems dirty and wrong.
I guess my question is multi-part:
Why does FirefoxDriver return a String rather than a URI?
Why is this String malformed?
What is best practice for dealing with this kind of thing?
We can partially encode query string parameters, as discussed in comments, it should work.
Other way is to use galimatias library:
import io.mola.galimatias.GalimatiasParseException;
import io.mola.galimatias.URL;
import java.net.URI;
import java.net.URISyntaxException;
public class Main {
public static void main(String[] args) throws URISyntaxException {
String example1 = "http://example.com/foo?key=val-with-a-|-in-it";
String example2 = "http://example.com?foo={bar}";
try {
URL url1 = URL.parse(example1);
URI uri1 = url1.toJavaURI();
System.out.println(url1);
System.out.println(uri1);
URL url2 = URL.parse(example2);
URI uri2 = url2.toJavaURI();
System.out.println(url2);
System.out.println(uri2);
} catch (GalimatiasParseException ex) {
// Do something with non-recoverable parsing error
}
}
}
Output:
http://example.com/foo?key=val-with-a-|-in-it
http://example.com/foo?key=val-with-a-%7C-in-it
http://example.com/?foo={bar}
http://example.com/?foo=%7Bbar%7D
driver.getCurrentUrl() gets a string from the browser and before making it into an URL, you should URL encode the string.
See Java URL encoding of query string parameters for an example of this in Java.
Will this work for you?
import java.net.URI;
import java.net.URL;
import java.net.URLEncoder;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException {
String urlInString="http://example.com/foo?key=val-with-a-{-in-it";
String encodedURL=URLEncoder.encode(urlInString, "UTF-8");
URI encodedURI=URI.create(encodedURL);
System.out.println("Actual URL:"+urlInString);
System.out.println("Encoded URL:"+encodedURL);
System.out.println("Encoded URI:"+encodedURI);
}
}
Output:
Actual URL:http://example.com/foo?key=val-with-a-{-in-it
Encoded URL:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Encoded URI:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Another Solution is to split the URL fetched and then use them to create the URL you want. This will ensure that you get all the features of URL class.
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException,
URISyntaxException, MalformedURLException {
String uri1 = "http://example.com/foo?key=val-with-a-{-in-it";
String scheme=uri1.split(":")[0];
String authority=uri1.split("//")[1].split("/")[0];
String path=uri1.split("//")[1].split("/")[1].split("\\?")[0];
String query=uri1.split("\\?")[1];
URI uri = null;
uri = new URI(scheme, authority, "/"+path, query,null);
URL url = null;
url = uri.toURL();
System.out.println("URI's Query:"+uri.getQuery());
System.out.println("URL's Query:"+url.getQuery());
}
}

Cannot access azure blobs through rest api

I was able to create a Container in Storage Account and upload a blob to it through the Client Side Code.
I was able to make the blob available for Public access as well , such that when I hit the following query from my browser, I am able to see the image which I uploaded.
https://MYACCOUNT.blob.core.windows.net/MYCONTAINER/MYBLOB
I now have a requirement to use the rest service to retrieve the contents of the blob. I wrote down the following java code.
package main;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.TimeZone;
public class GetBlob {
public static void main(String[] args) {
String url="https://MYACCOUNT.blob.core.windows.net/MYCONTAINER/MYBLOB";
try {
System.out.println("RUNNIGN");
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
connection.setRequestProperty("Authorization", createQuery());
connection.setRequestProperty("x-ms-version", "2009-09-19");
InputStream response = connection.getInputStream();
System.out.println("SUCCESSS");
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(response));
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
e.printStackTrace();
}
}
public static String createQuery()
{
String dateFormat="EEE, dd MMM yyyy hh:mm:ss zzz";
SimpleDateFormat dateFormatGmt = new SimpleDateFormat(dateFormat);
dateFormatGmt.setTimeZone(TimeZone.getTimeZone("UTC"));
String date=dateFormatGmt.format(new Date());
String Signature="GET\n\n\n\n\n\n\n\n\n\n\n\n" +
"x-ms-date:" +date+
"\nx-ms-version:2009-09-19" ;
// I do not know CANOCALIZED RESOURCE
//WHAT ARE THEY??
// +"\n/myaccount/myaccount/mycontainer\ncomp:metadata\nrestype:container\ntimeout:20";
String SharedKey="SharedKey";
String AccountName="MYACCOUNT";
String encryptedSignature=(encrypt(Signature));
String auth=""+SharedKey+" "+AccountName+":"+encryptedSignature;
return auth;
}
public static String encrypt(String clearTextPassword) {
try {
MessageDigest md = MessageDigest.getInstance("SHA-256");
md.update(clearTextPassword.getBytes());
return new sun.misc.BASE64Encoder().encode(md.digest());
} catch (NoSuchAlgorithmException e) {
}
return "";
}
}
However , I get the following error when I run this main class...
RUNNIGN
java.io.IOException: Server returned HTTP response code: 403 for URL: https://klabs.blob.core.windows.net/delete/Blob_1
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at main.MainClass.main(MainClass.java:61)
Question1: Why this error, did I miss any header/parameter?
Question2: Do I need to add headers in the first place, because I am able to hit the request from the browser without any issues.
Question3: Can it be an SSL issue? What is the concept of certificates, and how and where to add them? Do I really need them? Will I need them later, when I do bigger operations on my blob storage(I want to manage a thousand blobs)?
Will be thankful for any reference as well, within Azure and otherwise that could help me understand better.
:D
AFTER A FEW DAYS
Below is my new code for PutBlob I azure. I believe I have fully resolved all header and parameter issues and my request is perfect. However I am still getting the same 403. I do not know what the issue is. Azure is proving to be pretty difficult.
A thing to note is that the containers name is delete, and I want to create a blob inside it, say newBlob. I tried to initialize the urlPath in the code below with both "delete" and "delete/newBlob".
Does not work..
package main;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.TimeZone;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import com.sun.org.apache.xml.internal.security.exceptions.Base64DecodingException;
import com.sun.org.apache.xml.internal.security.utils.Base64;
public class Internet {
static String key="password";
static String account="klabs";
private static Base64 base64 ;
private static String createAuthorizationHeader(String canonicalizedString) throws InvalidKeyException, Base64DecodingException, NoSuchAlgorithmException, IllegalStateException, UnsupportedEncodingException {
Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(base64.decode(key), "HmacSHA256"));
String authKey = new String(base64.encode(mac.doFinal(canonicalizedString.getBytes("UTF-8"))));
String authStr = "SharedKey " + account + ":" + authKey;
return authStr;
}
public static void main(String[] args) {
System.out.println("INTERNET");
String key="password";
String account="klabs";
long blobLength="Dipanshu Verma wrote this".getBytes().length;
File f = new File("C:\\Users\\Dipanshu\\Desktop\\abc.txt");
String requestMethod = "PUT";
String urlPath = "delete";
String storageServiceVersion = "2009-09-19";
SimpleDateFormat fmt = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:sss");
fmt.setTimeZone(TimeZone.getTimeZone("UTC"));
String date = fmt.format(Calendar.getInstance().getTime()) + " UTC";
String blobType = "BlockBlob";
String canonicalizedHeaders = "x-ms-blob-type:"+blobType+"\nx-ms-date:"+date+"\nx-ms-version:"+storageServiceVersion;
String canonicalizedResource = "/"+account+"/"+urlPath;
String stringToSign = requestMethod+"\n\n\n"+blobLength+"\n\n\n\n\n\n\n\n\n"+canonicalizedHeaders+"\n"+canonicalizedResource;
try {
String authorizationHeader = createAuthorizationHeader(stringToSign);
URL myUrl = new URL("https://klabs.blob.core.windows.net/" + urlPath);
HttpURLConnection connection=(HttpURLConnection)myUrl.openConnection();
connection.setRequestProperty("x-ms-blob-type", blobType);
connection.setRequestProperty("Content-Length", String.valueOf(blobLength));
connection.setRequestProperty("x-ms-date", date);
connection.setRequestProperty("x-ms-version", storageServiceVersion);
connection.setRequestProperty("Authorization", authorizationHeader);
connection.setDoOutput(true);
connection.setRequestMethod("POST");
System.out.println(String.valueOf(blobLength));
System.out.println(date);
System.out.println(storageServiceVersion);
System.out.println(stringToSign);
System.out.println(authorizationHeader);
System.out.println(connection.getDoOutput());
DataOutputStream outStream = new DataOutputStream(connection.getOutputStream());
// Send request
outStream.writeBytes("Dipanshu Verma wrote this");
outStream.flush();
outStream.close();
DataInputStream inStream = new DataInputStream(connection.getInputStream());
System.out.println("BULLA");
String buffer;
while((buffer = inStream.readLine()) != null) {
System.out.println(buffer);
}
// Close I/O streams
inStream.close();
outStream.close();
} catch (InvalidKeyException | Base64DecodingException | NoSuchAlgorithmException | IllegalStateException | UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I know only a proper code reviewer might be able to help me, please do it if you can.
Thanks
Question1: Why this error, did I miss any header/parameter?
Most likely you're getting this error is because of incorrect signature. Please refer to MSDN documentation for creating correct signature: http://msdn.microsoft.com/en-us/library/azure/dd179428.aspx. Unless your signature is correct you'll not be able to perform operations using REST API.
Question2: Do I need to add headers in the first place, because I am
able to hit the request from the browser without any issues.
In your current scenario, because you can access the blob directly (which in turn means the container in which the blob exist has Public or Blob ACL) you don't really need to use REST API. You can simply make a HTTP request using Java and read the response stream which will have blob contents. You would need to go down this route if the container ACL is Private because in this case your requests need to be authenticated and the code above creates an authenticated request.
Question3: Can it be an SSL issue? What is the concept of
certificates, and how and where to add them? Do I really need them?
Will I need them later, when I do bigger operations on my blob
storage(I want to manage a thousand blobs)?
No, it is not an SSL issue. Its an issue with incorrect signature.
Finally found the mistake!!
In the code above , I was using a String "password" as key for my SHA2
base64.decode(key)
It should have been the key associated with my account with AZURE.
Silly One!! Took me 2 weeks to find.

Multiple Response Headers with the same name in Java

In Java, is it possible to view multiple response headers on a HttpURLConnection if they have the same name?
In the Oracle documentation for "GetHeaderField", it states:
If called on a connection that sets the same header multiple times
with possibly different values, only the last value is returned.
My question is, how do I view all the different values for a header that is set multiple times?
Use getHeaderFields
List<String> values = conn.getHeaderFields().get("X-Header-Of-Interest");
Complete example
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
public class UrlConnectionTest {
public static void main (String[] args) throws IOException {
URL url = new URL("http://localhost:8888/");
URLConnection conn = url.openConnection();
conn.getContent(); // Force request
System.out.println(conn.getHeaderFields().get("X-Funky-Header"));
}
}
On Linux you can create a simple single-request server with netcat for testing
$ echo -e 'HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nX-Funky-Header: value1\r\nX-Funky-Header: value2\r\n\r\nContent' | nc -l 8888 &

how to differentiate xml from html links in java

I have a list of links, containing links to html and xml pages, how can I extract the xml links from the list? in java
thanks
You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.
There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <!DOCTYPE ...> declaration, check for the <?xml ...?> directive, and check to see if the file contains a root <html> tag. Of course, these should all be case-insensitive checks.
You can also try to identify the type of file based on its MIME type (for example, text/html or text/xml). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.
package com.example.mimetype;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.FileNameMap;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.sf.jmimemagic.Magic;
import net.sf.jmimemagic.MagicException;
import net.sf.jmimemagic.MagicMatchNotFoundException;
import net.sf.jmimemagic.MagicParseException;
public class MimeUtils {
// After calling this method, you can retrieve a list of URLs for each mimetype.
public static Map<String, List<String>> sortLinksByMimeType(List<String> links) {
Map<String, List<String>> mapMimeTypesToLinks = new HashMap<String, List<String>>();
for (String url : links) {
try {
String mimetype = getMimeType(url);
System.out.println(url + " has mimetype " + mimetype);
// If this mimetype hasn't already been initialized, initialize it.
if (! mapMimeTypesToLinks.containsKey(mimetype)) {
mapMimeTypesToLinks.put(mimetype, new ArrayList<String>());
}
List<String> lst = mapMimeTypesToLinks.get(mimetype);
lst.add(url);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return mapMimeTypesToLinks;
}
public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException {
// first attempt at determining MIME type--returned null for all URLs that I tried
// FileNameMap filenameMap = URLConnection.getFileNameMap();
// return filenameMap.getContentTypeFor(url);
// second attempt at determining MIME type--worked better, but still returned null for many URLs
// URLConnection c = new URL(url).openConnection();
// InputStream in = c.getInputStream();
// String mimetype = URLConnection.guessContentTypeFromStream(in);
// in.close();
// return mimetype;
URLConnection c = new URL(url).openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
byte[] content = new byte[100];
in.read(content);
in.close();
return Magic.getMagicMatch(content, false).getMimeType();
}
public static void main(String[] args) {
List<String> links = new ArrayList<String>();
links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java");
links.add("http://stackoverflow.com");
links.add("http://stackoverflow.com/feeds");
links.add("http://amazon.com");
links.add("http://google.com");
sortLinksByMimeType(links);
}
}
I'm not certain if your links are some sort of Link object, but as long as you can access the value as a string this should work I think.
List<String> xmlLinks = new ArrayList<String>();
for (String link : list) {
if (link.endsWith(".xml") || link.contains(".xml")) {
xmlLinks.add(link);
}
}

Rails action not responding to Java POST

Really simple, or so I thought.
Java Code
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;
public class UrlConnectionTest {
private static final String TEST_URL = "http://localhost:3000/test/hitme";
public static void main(String[] args) throws IOException {
URLConnection urlCon = null;
URL url = null;
OutputStreamWriter osw = null;
try {
url = new URL(TEST_URL);
urlCon = url.openConnection();
urlCon.setDoOutput(true);
urlCon.setRequestProperty("Content-Type", "text/plain");
osw = new OutputStreamWriter(urlCon.getOutputStream());
osw.write("HELLO WORLD");
} catch (Exception e) {
e.printStackTrace();
} finally {
if (osw != null) {
osw.close();
}
}
}
}
TestController#hitme
def hitme
puts "SOMEONE IS HITTING ME!" * 100
puts request.env.inspect
end
When I run the Java code, I see nothing in my Rails Server Console. However, when I hit the URL in my browser, I get output as specified in TestController#hitme. I thought it would be simple, but haven't had any luck. Any ideas?
Thanks in advance!
You're probably getting an exception, which you aren't seeing, because you're swallowing it. At least print the exception in the catch block.
Even if this isn't the problem, your going to chase your tail a lot if you make a habit of swallowing errors.
I don't think you're actually sending any data until you call
urlCon.getInputStream();
Is it that your URL in your java code shows the controller name of "test" (test/hitme) but you mention that your controller name is TestController? i.e., the URL in your java code should be changed.
private static final String TEST_URL = "http://localhost:3000/TestController/hitme";
Don't fiddle around with URLConnection yourself, let Resty handle it.
Here's the code you would need to write (I assume you are getting text back):
import static us.monoid.web.Resty.*;
import us.monoid.web.Resty;
...
new Resty().text(TEST_URL, content("HELLO WORLD")).toString();

Categories