when i am using this program
that "content=content.replaceAll("&.*?;","");" syntax is not work to remove "&#____;" from extracted text, all word like "&#____;" shown as question "?" mark . tell me how do i remove "&#____;" from text..
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
class JavaSoup1
{
public static void main(String []args)throws Exception
{
Document doc = Jsoup.connect("http://www.iitbhu.ac.in/").get();
String content=doc.text();
content=content.replaceAll("&.*?;","");
System.out.println(content);
}
}
Related
So I am trying to make a program that retrieves the IFrame tag from a website, opens the link and downloads the video. Currently, it retrieves the IFrame tag, but I can't figure out how to ignore the actual tags. I am pretty sure I can use the .split() feature, but I don't know how to create a regex code to only pull the data from inside of the quotes. I also tried using JSoup's .html, but it just printed a blank statement. Here is what I have (It mostly split correctly, except in the URL there is "id=..." which causes it to split again):
package com.trentmenard;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
Document website;
try{
website = Jsoup.connect("https://swordartonlineepisode.com/sword-art-online-season-3-episode-1-english-dubbed-watch-online/").get();
System.out.println("Website Found! Title: " + website.title());
Element videoLink = website.select("iframe").first();
System.out.println("Found Video Link: " + videoLink);
videoLink.removeAttr("width");
videoLink.removeAttr("height");
videoLink.removeAttr("scrolling");
videoLink.removeAttr("allowfullscreen");
System.out.println("Modified: " + videoLink);
String link = videoLink.toString();
String[] stringArray = link.split("=");
for(String a : stringArray){
System.out.println(a);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Output: https://i.stack.imgur.com/ZXTiV.png
Thanks in advance!
I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.
I want to remove the script when reading url not file, please help me
Document connect = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm");
Elements selects = connect.select("div.middle-col");
System.out.println(selects.removeAttr("script").html());
This is how you need to remove script element:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm").get();
Elements selects = doc.select("div.middle-col");
for (Element script : selects) {
Elements scripts = script.select("script");
scripts.remove();
}
System.out.println(selects.html());
}
}
Additionally, you can use Jsoup.Clean(html,white).
I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>
You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.
Hi I am trying to extract text which a href defines in a html line. For example:
<link rel="stylesheet" href="style.css" type="text/css">
I want to get "style.css" or:
<a href="target0.html"><img align="center" src="thumbnails/image001.jpg" width="154" height="99">
I want to get "target0.html"
What would be the correct Java code to do this?
public static String getHref(String str)
{
int startIndex = str.indexOf("href=");
if (startIndex < 0)
return "";
return str.substring(startIndex + 6, str.indexOf("\"", startIndex + 6));
}
This method assumes that the html is well formed and it only works for the first href in the string but I'm sure you can extrapolate from here.
I realize you asked about using regular expressions, but jsoup makes this so simple and is much less error prone:
import java.io.IOException;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;
public class HrefExtractor {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Document document = Jsoup.parse("<a href=\"target0.html\"><img align=\"center\" src=\"thumbnails/image001.jpg\" width=\"154\" height=\"99\">");
final Elements links = document.select("a[href]");
for (final Element element : links) {
System.out.println(element.attr("href"));
}
}
}
I have not try the following but it should be something like this:
'Pattern.compile("<(?:link|a\s+)[^>]*href=\"(.*?)\"")'
But I'd recommend you to use one of available HTML or even XML parsers for this task.