I'm involved in writing a (Java/Groovy) browser-automation app with Selenium 2 and FireFox driver.
Currently there is an issue with some URLs we find in the wild that are apparently using bad URI syntax. (specifically curly braces ({}), |'s and ^'s).
String url = driver.getCurrentUrl(); // http://example.com/foo?key=val|with^bad{char}acters
When trying to construct a java.net.URI from the string returned by driver.getCurrentUrl() a URISyntaxException is thrown.
new URI(url); // java.net.URISyntaxException: Illegal character in query at index ...
Encoding the whole url before constructing the URI will not work (as I understand it).
The whole url is encoded, and it doesn't preseve any pieces of it that I can parse in any normal fashion. For example, with this uri-safe string, URI can't know the difference between a & as the query-string-param delimeter or %26 (its encoded value) in the content of a single qs-param.
String encoded = URLEncoder.encode(url, "UTF-8") // http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval%7Cwith%5E%7Cbad%7Ccharacters
URI uri = new URI(encoded)
URLEncodedUtils.parse(uri, "UTF-8") // []
Currently the solution is, before constructing the URI, running the following (groovy) code:
["|", "^", "{", "}"].each {
url = url.replace(it, URLEncoder.encode(it, "UTF-8"))
}
But this seems dirty and wrong.
I guess my question is multi-part:
Why does FirefoxDriver return a String rather than a URI?
Why is this String malformed?
What is best practice for dealing with this kind of thing?
We can partially encode query string parameters, as discussed in comments, it should work.
Other way is to use galimatias library:
import io.mola.galimatias.GalimatiasParseException;
import io.mola.galimatias.URL;
import java.net.URI;
import java.net.URISyntaxException;
public class Main {
public static void main(String[] args) throws URISyntaxException {
String example1 = "http://example.com/foo?key=val-with-a-|-in-it";
String example2 = "http://example.com?foo={bar}";
try {
URL url1 = URL.parse(example1);
URI uri1 = url1.toJavaURI();
System.out.println(url1);
System.out.println(uri1);
URL url2 = URL.parse(example2);
URI uri2 = url2.toJavaURI();
System.out.println(url2);
System.out.println(uri2);
} catch (GalimatiasParseException ex) {
// Do something with non-recoverable parsing error
}
}
}
Output:
http://example.com/foo?key=val-with-a-|-in-it
http://example.com/foo?key=val-with-a-%7C-in-it
http://example.com/?foo={bar}
http://example.com/?foo=%7Bbar%7D
driver.getCurrentUrl() gets a string from the browser and before making it into an URL, you should URL encode the string.
See Java URL encoding of query string parameters for an example of this in Java.
Will this work for you?
import java.net.URI;
import java.net.URL;
import java.net.URLEncoder;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException {
String urlInString="http://example.com/foo?key=val-with-a-{-in-it";
String encodedURL=URLEncoder.encode(urlInString, "UTF-8");
URI encodedURI=URI.create(encodedURL);
System.out.println("Actual URL:"+urlInString);
System.out.println("Encoded URL:"+encodedURL);
System.out.println("Encoded URI:"+encodedURI);
}
}
Output:
Actual URL:http://example.com/foo?key=val-with-a-{-in-it
Encoded URL:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Encoded URI:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Another Solution is to split the URL fetched and then use them to create the URL you want. This will ensure that you get all the features of URL class.
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException,
URISyntaxException, MalformedURLException {
String uri1 = "http://example.com/foo?key=val-with-a-{-in-it";
String scheme=uri1.split(":")[0];
String authority=uri1.split("//")[1].split("/")[0];
String path=uri1.split("//")[1].split("/")[1].split("\\?")[0];
String query=uri1.split("\\?")[1];
URI uri = null;
uri = new URI(scheme, authority, "/"+path, query,null);
URL url = null;
url = uri.toURL();
System.out.println("URI's Query:"+uri.getQuery());
System.out.println("URL's Query:"+url.getQuery());
}
}
Related
How to handle URL Encoded Characters like colon (%3A) in JSoup connect function?
What you could basically do is encode the URL before you use it in JSOUP.
I believe what you are trying to do here is pass some parameters to the host in the URL itself.
To encode the URL, use the below code:
String url = "https://google.com?q=i wish to search something";
String encodeURL=URLEncoder.encode( url, "UTF8" );
Here's the answer to your comment:
package com.abk;
import java.io.IOException;
import java.net.URLDecoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest{
public static void main( String[] args ) throws IOException{
Document doc = Jsoup.connect(URLDecoder.decode("https://siccode.com/en/business-list/sic%3A2211%22","UTF8")).get();
String title = doc.title();
System.out.println("title is: " + title);
}
}
This should work like a charm :)
Use
String decodedString1 = URLDecoder.decode("siccode.com/en/business-list/sic%3A2211", "UTF-8");
as its url encoded you need to decode it before using.
Sample for JS.
var str = decodeURIComponent("siccode.com/en/business-list/sic%3A2211");
console.log(str);
I have collected text of about 2 * 10^5 rows, i have to expand the shortened URLs in the text. the problem is, that some expanded URLs redirect to again a shorten URL, which means that the original URL is 2 or more times shortened. how to handle that efficiently? because if i make a while loop, it is taking too much time.
The problem with the code below, is about 2*10^5 context switches from my MySQL DB. I extract the URL from each text and then expand it. This will make an http request, and then it has to be entered into my DB. it is taking too much time.
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.Proxy;
import java.net.URL;
public class urlExpander {
public static void main(String[] args) throws IOException {
String shortenedUrl = "https://www.youtube.com/watch?v=4oZTPGvG3s8&feature=youtu.be";
String expandedURL = expandUrl(shortenedUrl);
System.out.println(shortenedUrl + "-->" + expandedURL);
}
public static String expandUrl(String shortenedUrl) throws IOException {
URL url = new URL(shortenedUrl);
// open connection
HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
// stop following browser redirect
httpURLConnection.setInstanceFollowRedirects(false);
// extract location header containing the actual destination URL
String expandedURL = httpURLConnection.getHeaderField("Location");
httpURLConnection.disconnect();
return expandedURL;
}
}
I have a question regarding UTF-8 encoding when sending strings containing special characters using HttpServiceClient (Apache)
I have this small piece of code below where the method takes string and sends it via Http(which is not fully complete in the code).
Although the decoded string seems to work without problems, I would like to know if the method.addparameter or httpClient.execute(method) encodes the string again. We have the problem that at the client side the strings seem doubly encoded!
eg. strReq = äöü
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.EncoderException;
import org.apache.commons.codec.net.URLCodec;
import org.apache.commons.httpclient.methods.PostMethod;
public class Demo {
public static void Test(String strReq) throws CancellationException, IOException, DecoderException {
PostMethod method = null;
method = new PostMethod("www.example.com");
// Encode the XML document.
URLCodec codec = new URLCodec();
String requestEncoded = new String(strReq);
try {
requestEncoded = codec.encode(strReq);
} catch (EncoderException e) {
}
System.out.println("encoded req = "+requestEncoded);
method.addParameter(Constants.Hdr, requestEncoded);
String str2 = codec.decode(requestEncoded);
System.out.println("str2 ="+str2);
}
I have a list of links, containing links to html and xml pages, how can I extract the xml links from the list? in java
thanks
You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.
There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <!DOCTYPE ...> declaration, check for the <?xml ...?> directive, and check to see if the file contains a root <html> tag. Of course, these should all be case-insensitive checks.
You can also try to identify the type of file based on its MIME type (for example, text/html or text/xml). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.
package com.example.mimetype;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.FileNameMap;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.sf.jmimemagic.Magic;
import net.sf.jmimemagic.MagicException;
import net.sf.jmimemagic.MagicMatchNotFoundException;
import net.sf.jmimemagic.MagicParseException;
public class MimeUtils {
// After calling this method, you can retrieve a list of URLs for each mimetype.
public static Map<String, List<String>> sortLinksByMimeType(List<String> links) {
Map<String, List<String>> mapMimeTypesToLinks = new HashMap<String, List<String>>();
for (String url : links) {
try {
String mimetype = getMimeType(url);
System.out.println(url + " has mimetype " + mimetype);
// If this mimetype hasn't already been initialized, initialize it.
if (! mapMimeTypesToLinks.containsKey(mimetype)) {
mapMimeTypesToLinks.put(mimetype, new ArrayList<String>());
}
List<String> lst = mapMimeTypesToLinks.get(mimetype);
lst.add(url);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return mapMimeTypesToLinks;
}
public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException {
// first attempt at determining MIME type--returned null for all URLs that I tried
// FileNameMap filenameMap = URLConnection.getFileNameMap();
// return filenameMap.getContentTypeFor(url);
// second attempt at determining MIME type--worked better, but still returned null for many URLs
// URLConnection c = new URL(url).openConnection();
// InputStream in = c.getInputStream();
// String mimetype = URLConnection.guessContentTypeFromStream(in);
// in.close();
// return mimetype;
URLConnection c = new URL(url).openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
byte[] content = new byte[100];
in.read(content);
in.close();
return Magic.getMagicMatch(content, false).getMimeType();
}
public static void main(String[] args) {
List<String> links = new ArrayList<String>();
links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java");
links.add("http://stackoverflow.com");
links.add("http://stackoverflow.com/feeds");
links.add("http://amazon.com");
links.add("http://google.com");
sortLinksByMimeType(links);
}
}
I'm not certain if your links are some sort of Link object, but as long as you can access the value as a string this should work I think.
List<String> xmlLinks = new ArrayList<String>();
for (String link : list) {
if (link.endsWith(".xml") || link.contains(".xml")) {
xmlLinks.add(link);
}
}
I'm having trouble building an absolute URL from a relative URL without resorting to String hackery...
Given
http://localhost:8080/myWebApp/someServlet
Inside the method:
public void handleRequest(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
}
What's the most "correct" way of building :
http://localhost:8080/myWebApp/someImage.jpg
(Note, must be absolute, not relative)
Currently, I'm doing it through building the string, but there MUST be a better way.
I've looked at various combinations of new URI / URL, and I end up with
http://localhost:8080/someImage.jpg
Help greatly appreciated
Using java.net.URL
URL baseUrl = new URL("http://www.google.com/someFolder/");
URL url = new URL(baseUrl, "../test.html");
How about:
String s = request.getScheme() + "://" + request.getServerName() + ":" + request.getServerPort() + request.getContextPath() + "/someImage.jpg";
Looks like you already figured out the hard part, which is what host your are running on. The rest is easy,
String url = host + request.getContextPath() + "/someImage.jpg";
Should give you what you need.
this code work will on linux, it can just combine the path, if you want more, constructor of URI could be helpful.
URL baseUrl = new URL("http://example.com/first");
URL targetUrl = new URL(baseUrl, Paths.get(baseUrl.getPath(), "second", "/third", "//fourth//", "fifth").toString());
if you path contain something need to escape, use URLEncoder.encode to escape it at first.
URL baseUrl = new URL("http://example.com/first");
URL targetUrl = new URL(baseUrl, Paths.get(baseUrl.getPath(), URLEncoder.encode(relativePath, StandardCharsets.UTF_8), URLEncoder.encode(filename, StandardCharsets.UTF_8)).toString());
example:
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main {
public static void main(String[] args) {
try {
URL baseUrl = new URL("http://example.com/first");
Path relativePath = Paths.get(baseUrl.getPath(), "second", "/third", "//fourth//", "fifth");
URL targetUrl = new URL(baseUrl, relativePath.toString());
System.out.println(targetUrl.toString());
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
output
http://example.com/first/second/third/fourth/fifth
baseUrl.getPath() are very important, don't forget it.
a wrong example:
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main {
public static void main(String[] args) {
try {
URL baseUrl = new URL("http://example.com/first");
Path relativePath = Paths.get("second", "/third", "//fourth//", "fifth");
URL targetUrl = new URL(baseUrl, relativePath.toString());
System.out.println(targetUrl.toString());
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
output
http://example.com/second/third/fourth/fifth
we have lost our /first in baseurl.