How handle URL Encoded Characters in Jsoup - java

How to handle URL Encoded Characters like colon (%3A) in JSoup connect function?

What you could basically do is encode the URL before you use it in JSOUP.
I believe what you are trying to do here is pass some parameters to the host in the URL itself.
To encode the URL, use the below code:
String url = "https://google.com?q=i wish to search something";
String encodeURL=URLEncoder.encode( url, "UTF8" );
Here's the answer to your comment:
package com.abk;
import java.io.IOException;
import java.net.URLDecoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest{
public static void main( String[] args ) throws IOException{
Document doc = Jsoup.connect(URLDecoder.decode("https://siccode.com/en/business-list/sic%3A2211%22","UTF8")).get();
String title = doc.title();
System.out.println("title is: " + title);
}
}
This should work like a charm :)

Use
String decodedString1 = URLDecoder.decode("siccode.com/en/business-list/sic%3A2211", "UTF-8");
as its url encoded you need to decode it before using.
Sample for JS.
var str = decodeURIComponent("siccode.com/en/business-list/sic%3A2211");
console.log(str);

Related

Create a String variable from a URL using JSoup and Regex in Java?

So I am trying to make a program that retrieves the IFrame tag from a website, opens the link and downloads the video. Currently, it retrieves the IFrame tag, but I can't figure out how to ignore the actual tags. I am pretty sure I can use the .split() feature, but I don't know how to create a regex code to only pull the data from inside of the quotes. I also tried using JSoup's .html, but it just printed a blank statement. Here is what I have (It mostly split correctly, except in the URL there is "id=..." which causes it to split again):
package com.trentmenard;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
Document website;
try{
website = Jsoup.connect("https://swordartonlineepisode.com/sword-art-online-season-3-episode-1-english-dubbed-watch-online/").get();
System.out.println("Website Found! Title: " + website.title());
Element videoLink = website.select("iframe").first();
System.out.println("Found Video Link: " + videoLink);
videoLink.removeAttr("width");
videoLink.removeAttr("height");
videoLink.removeAttr("scrolling");
videoLink.removeAttr("allowfullscreen");
System.out.println("Modified: " + videoLink);
String link = videoLink.toString();
String[] stringArray = link.split("=");
for(String a : stringArray){
System.out.println(a);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Output: https://i.stack.imgur.com/ZXTiV.png
Thanks in advance!

Recursively expanding short URLs in a big text - Java

I have collected text of about 2 * 10^5 rows, i have to expand the shortened URLs in the text. the problem is, that some expanded URLs redirect to again a shorten URL, which means that the original URL is 2 or more times shortened. how to handle that efficiently? because if i make a while loop, it is taking too much time.
The problem with the code below, is about 2*10^5 context switches from my MySQL DB. I extract the URL from each text and then expand it. This will make an http request, and then it has to be entered into my DB. it is taking too much time.
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.Proxy;
import java.net.URL;
public class urlExpander {
public static void main(String[] args) throws IOException {
String shortenedUrl = "https://www.youtube.com/watch?v=4oZTPGvG3s8&feature=youtu.be";
String expandedURL = expandUrl(shortenedUrl);
System.out.println(shortenedUrl + "-->" + expandedURL);
}
public static String expandUrl(String shortenedUrl) throws IOException {
URL url = new URL(shortenedUrl);
// open connection
HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
// stop following browser redirect
httpURLConnection.setInstanceFollowRedirects(false);
// extract location header containing the actual destination URL
String expandedURL = httpURLConnection.getHeaderField("Location");
httpURLConnection.disconnect();
return expandedURL;
}
}

WebDriver getCurrentUrl() returning malformed URI

I'm involved in writing a (Java/Groovy) browser-automation app with Selenium 2 and FireFox driver.
Currently there is an issue with some URLs we find in the wild that are apparently using bad URI syntax. (specifically curly braces ({}), |'s and ^'s).
String url = driver.getCurrentUrl(); // http://example.com/foo?key=val|with^bad{char}acters
When trying to construct a java.net.URI from the string returned by driver.getCurrentUrl() a URISyntaxException is thrown.
new URI(url); // java.net.URISyntaxException: Illegal character in query at index ...
Encoding the whole url before constructing the URI will not work (as I understand it).
The whole url is encoded, and it doesn't preseve any pieces of it that I can parse in any normal fashion. For example, with this uri-safe string, URI can't know the difference between a & as the query-string-param delimeter or %26 (its encoded value) in the content of a single qs-param.
String encoded = URLEncoder.encode(url, "UTF-8") // http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval%7Cwith%5E%7Cbad%7Ccharacters
URI uri = new URI(encoded)
URLEncodedUtils.parse(uri, "UTF-8") // []
Currently the solution is, before constructing the URI, running the following (groovy) code:
["|", "^", "{", "}"].each {
url = url.replace(it, URLEncoder.encode(it, "UTF-8"))
}
But this seems dirty and wrong.
I guess my question is multi-part:
Why does FirefoxDriver return a String rather than a URI?
Why is this String malformed?
What is best practice for dealing with this kind of thing?
We can partially encode query string parameters, as discussed in comments, it should work.
Other way is to use galimatias library:
import io.mola.galimatias.GalimatiasParseException;
import io.mola.galimatias.URL;
import java.net.URI;
import java.net.URISyntaxException;
public class Main {
public static void main(String[] args) throws URISyntaxException {
String example1 = "http://example.com/foo?key=val-with-a-|-in-it";
String example2 = "http://example.com?foo={bar}";
try {
URL url1 = URL.parse(example1);
URI uri1 = url1.toJavaURI();
System.out.println(url1);
System.out.println(uri1);
URL url2 = URL.parse(example2);
URI uri2 = url2.toJavaURI();
System.out.println(url2);
System.out.println(uri2);
} catch (GalimatiasParseException ex) {
// Do something with non-recoverable parsing error
}
}
}
Output:
http://example.com/foo?key=val-with-a-|-in-it
http://example.com/foo?key=val-with-a-%7C-in-it
http://example.com/?foo={bar}
http://example.com/?foo=%7Bbar%7D
driver.getCurrentUrl() gets a string from the browser and before making it into an URL, you should URL encode the string.
See Java URL encoding of query string parameters for an example of this in Java.
Will this work for you?
import java.net.URI;
import java.net.URL;
import java.net.URLEncoder;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException {
String urlInString="http://example.com/foo?key=val-with-a-{-in-it";
String encodedURL=URLEncoder.encode(urlInString, "UTF-8");
URI encodedURI=URI.create(encodedURL);
System.out.println("Actual URL:"+urlInString);
System.out.println("Encoded URL:"+encodedURL);
System.out.println("Encoded URI:"+encodedURI);
}
}
Output:
Actual URL:http://example.com/foo?key=val-with-a-{-in-it
Encoded URL:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Encoded URI:http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval-with-a-%7B-in-it
Another Solution is to split the URL fetched and then use them to create the URL you want. This will ensure that you get all the features of URL class.
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
public class Sample {
public static void main(String[] args) throws UnsupportedEncodingException,
URISyntaxException, MalformedURLException {
String uri1 = "http://example.com/foo?key=val-with-a-{-in-it";
String scheme=uri1.split(":")[0];
String authority=uri1.split("//")[1].split("/")[0];
String path=uri1.split("//")[1].split("/")[1].split("\\?")[0];
String query=uri1.split("\\?")[1];
URI uri = null;
uri = new URI(scheme, authority, "/"+path, query,null);
URL url = null;
url = uri.toURL();
System.out.println("URI's Query:"+uri.getQuery());
System.out.println("URL's Query:"+url.getQuery());
}
}

String Encoding in java

I have a String String a="123+>jo I want to en
code the String so that I can redirect it to an url. I initially tried it with urlencoder but in urldecoder +(plus) is removed while decoding.So i lost my data.What is the right way to encode so that I get the same string while decoding also?
URLEncoder works perfectly. The plus sign is succesfully encoded into %2B.
Encoding: Works
Here is the IDEONE project:
http://ideone.com/zMDur
import java.net.URLEncoder;
// ...
public static void main (String[] args) throws java.lang.Exception
{
String str = "123+>jo";
String str2 = "http://1.com/23+>jo";
System.out.println(URLEncoder.encode(str));
System.out.println(URLEncoder.encode(str2));
}
prints:
123%2B%3Ejo
http%3A%2F%2F1.com%2F23%2B%3Ejo
Encoding + Decoding: Works
The IDEONE project with decoding as well: http://ideone.com/Ypfv4
import java.net.URLEncoder;
import java.net.URLDecoder;
// ...
public static void main (String[] args) throws java.lang.Exception
{
String str = "123+>jo";
String str2 = "http://1.com/23+>jo";
System.out.println(URLDecoder.decode(URLEncoder.encode(str)));
System.out.println(URLDecoder.decode(URLEncoder.encode(str2)));
}
Prints:
123+>jo
http://1.com/23+>jo
So everything works using the java.net.URLEncoder and java.net.URLDecoder.

Full Link Extraction using java

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example:
Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link
You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.
package com.stackoverflow.q3394298;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String... args) throws Exception {
URL url = new URL("https://stackoverflow.com/questions/3394298/");
Document document = Jsoup.connect(url).get();
Element link = document.select("a.question-hyperlink").first();
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
which prints (correctly) the following for the title link of your current question:
/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
Jsoup may have more other (undiscovered) advantages for your purpose as well.
Related questions:
What are the pros and cons of the leading HTML parsers in Java?
Update: if you want to select all links in the document, then do as follows:
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
Use the URL object:
URL url = new URL(URL context, String spec)
Here's an example:
import java.net.*;
public class Test {
public static void main(String[] args) throws Exception {
URL base = new URL("http://www.java.com/dit/index.html");
URL url = new URL(base, "../hello.html");
System.out.println(base);
System.out.println(url);
}
}
It will print:
http://www.java.com/dit/index.html
http://www.java.com/hello.html

Categories