this is a URL which I have in a text file. When my application reads the file in and converts it to a string it ends up with strange characters being added.
Before:
<p>W3Schools.com</p>
After:
"%3Cp%3E%3Ca%20href%3D%22http%3A//www.w3schools.com%22%20target%3D%22iframe_a%22%3EW3Schools.com%3C/a%3E%3C/p%3E"
I'm aware that this is probably an encoding error, my question is why does it happen and only to URLS? How could I stop it from doing it in Java and iOS.
The string you posted is not a URL, it's HTML that contains a URL.
You can't treat HTML as if it is a valid URL because it's not.
As #Duncan_C said: there's HTML code around your url. You can use the Jsoup library to get rid of that. Once you've done that it will encode properly.
Here's how to do that:
String html = "<p>W3Schools.com</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String url = link.attr("href");
System.out.println(url);
Related
I am using Jsoup html parser for replacing hyperlinks in a html document. I want actual case, elements and line breaks to be as is even after updating the html document. But, Jsoup is updating the case to lowercase, updating few elements and also removing the line breaks. I have tried with ParseSettings also. But with parse settings, doc.select("a[href]") is not returning the elements. Below is the code I am using.
Can someone help me with the right html parser using java to replace hyperlinks by retaining the html document as is?
File input = new File(fileEntry.getPath());
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true));
Document doc = parser.parseInput(input.toString(), "UTF-8");
Elements anchorLinks = doc.select("a[href]");
The documentation is your friend… even when there is no description in that documentation.
Notice the first argument is named html and the second argument is named baseUri.
The first argument needs to be actual HTML content, not a filename. Your code is trying to parse a filename as if it’s HTML.
The second argument needs to be a URI, or an empty string. "UTF-8" is not a valid URI at all, though since you aren’t trying to resolve the links, it may not be a critical mistake.
You probably want the Jsoup.parse method which takes both an InputStream and a customized Parser:
Document doc;
try (InputStream content = new BufferedInputStream(
new FileInputStream(input))) {
doc = Jsoup.parse(content, null, "", parser);
}
I need change some tag in html String so that I use Jsoup. Here I check but only convert and reverse:
The first I load String from url => str1
Make Document from str1 to edit with jsoup:
Document doc = Jsoup.parse(str1)
Then I use function doc.html() or doc.toString() to convert doc to String => str2
I load str1, str2 to webview by function loadDataWithBaseURL
And see that str2 not same when load str1 (example video frame not fit screen when use str2)
Why and how to fix it ?
JSoup changes relative url's from the input to absolute url's on the output, using the base href you provide.
org.jsoup.Jsoup.parse(String)
Parse HTML into a Document. As no base URI is specified, absolute URL
detection relies on the HTML including a tag.
You likely need to add the base href in your input content or call this method instead:
org.jsoup.Jsoup.parse(String content, String baseUri)
I am Using JSOUP to parse the HTML page and extract all text from it. Below code works fine with other URL's but this is giving weird output with this URL. http://gumgum-public.s3.amazonaws.com/numbers.html
Document doc = null;
doc = Jsoup.connect("http://gumgum-public.s3.amazonaws.com/numbers.html").maxBodySize(0).get();
String parsedText = doc.body().text();
System.out.println("Output-"+parsedText);
Output-
Output-This is a test page
Output-This is a test page
HTML page contains large set of numbers. Please Help..
Thanks
Then your solution is the following:
Download the page
Slice it in smaller parts
Add a tag before and after
send the file to Jsoup
get your content.
Concat the parts
I have a Java web crawler. It is opening this type of urls :
http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=
The final url is different that this, which i guess means that a redirect is involved. I can get and parse the returned Document, but is there any way to get the "final", "real" url too?
That URL is not doing a redirect, is returning a page which has this meta header
<meta http-equiv="refresh" content="0; url=https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158"-->
You can see your "final" url there.
You can parse the document for this tag with (for example) select("meta[http-equiv=refresh]")
And then parse the content attribute.
Summing up:
Document doc = Jsoup.connect("http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=").get();
Elements select = doc.select("meta[http-equiv=refresh]");
String content = select.first().attr("content");
String prefix = "url=";
String url = content.substring(content.indexOf(prefix) + prefix.length());
System.out.println(url);
Will give you your desired uri:
https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158
I hope it will help.
I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.
Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.