I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.
I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
}
When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.
Document doc = Jsoup.connect(url).get();
String text = doc.text();
That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.
for (Element p : paragraphs)
text+=p.text(); // Append the text.
In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.
I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.
I hope this code will fulfill your requirements without jSoup.
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}
Related
I have a problem using jsoup what I am trying to do is fetch a document from the url which will redirect to another url based on meta refresh url which is not working, to explain clearly if I am entering a website url named http://www.amerisourcebergendrug.com which will automatically redirect to http://www.amerisourcebergendrug.com/abcdrug/ depending upon the meta refresh url but my jsoup is still sticking with http://www.amerisourcebergendrug.com and not redirecting and fetching from http://www.amerisourcebergendrug.com/abcdrug/
Document doc = Jsoup.connect("http://www.amerisourcebergendrug.com").get();
I have also tried using,
Document doc = Jsoup.connect("http://www.amerisourcebergendrug.com").followRedirects(true).get();
but both are not working
Any workaround for this?
Update:
The Page may use meta refresh redirect methods
Update (case insensitive and pretty fault tolerant)
The content parsed (almost) according to spec
The first successfully parsed content meta data should be used
public static void main(String[] args) throws Exception {
URI uri = URI.create("http://www.amerisourcebergendrug.com");
Document d = Jsoup.connect(uri.toString()).get();
for (Element refresh : d.select("html head meta[http-equiv=refresh]")) {
Matcher m = Pattern.compile("(?si)\\d+;\\s*url=(.+)|\\d+")
.matcher(refresh.attr("content"));
// find the first one that is valid
if (m.matches()) {
if (m.group(1) != null)
d = Jsoup.connect(uri.resolve(m.group(1)).toString()).get();
break;
}
}
}
Outputs correctly:
http://www.amerisourcebergendrug.com/abcdrug/
Old answer:
Are you sure that it isn't working. For me:
System.out.println(Jsoup.connect("http://www.ibm.com").get().baseUri());
.. outputs http://www.ibm.com/us/en/ correctly..
to have a better error handling and case sensitivity problem
try
{
Document doc = Jsoup.connect("http://www.ibm.com").get();
Elements meta = doc.select("html head meta");
if (meta != null)
{
String lvHttpEquiv = meta.attr("http-equiv");
if (lvHttpEquiv != null && lvHttpEquiv.toLowerCase().contains("refresh"))
{
String lvContent = meta.attr("content");
if (lvContent != null)
{
String[] lvContentArray = lvContent.split("=");
if (lvContentArray.length > 1)
doc = Jsoup.connect(lvContentArray[1]).get();
}
}
}
// get page title
return doc.title();
}
catch (IOException e)
{
e.printStackTrace();
}
I tried to create contents in new page using doc.newPage();. It worked fine but when i tried to use the same inside the forloop it's not working. Below is my code
String first="First Page";
String second="second Page";
ByteArrayInputStream is = new ByteArrayInputStream(first.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
doc.newPage();
is=new ByteArrayInputStream(second.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
This code is working fine but when I put this code in for loop it's not creating the contents in 2 pages.
ByteArrayInputStream is = null;
List<String> strList=new ArrayList<String>();
String first="First Page";
String second="second Page";
strList.add(first);
strList.add(second);
for(String string:strList)
{
doc.newPage();
is=new ByteArrayInputStream(second.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
}
How to overcome this issue?
Apparently adding text without HTML tags fails when you use XMLWorker. This is a known bug. For now a workaround would be to check whether a string has HTML elements or not and if it doesn't, you could wrap it in a paragraph tag.
I adjusted your sample to make it work:
List<String> strList=new ArrayList<String>();
String first="<p>First Page</p>";
String second="<p>second Page</p>";
strList.add(first);
strList.add(second);
for(String string:strList)
{
doc.newPage();
// I just prefer StringReaders over ByteArrayInputStreams
worker.parseXHtml(pdfWriter, doc, new StringReader(string));
}
Also, for future reference, your code threw an Exception, it would be best if you would provide any thrown exceptions in future questions.
I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!
Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).
I am trying to search all image tags on a specific page. An example page would be www.chapitre.com
I am using the following code to search for all images on the page:
HtmlPage page = HTMLParser.parseHtml(webResponse, webClient.openWindow(null,"testwindow"));
List<?> imageList = page.getByXPath("//img");
ListIterator li = imageList.listIterator();
while (li.hasNext() ) {
HtmlImage image = (HtmlImage)li.next();
URL url = new URL(image.getSrcAttribute());
//For now, only load 1X1 pixels
if (image.getHeightAttribute().equals("1") && image.getWidthAttribute().equals("1")) {
System.out.println("This is an image: " + url + " from page " + webRequest.getUrl() );
}
}
This doesn't return me all the image tags in the page. For example, an image tag with attributes "src="http://ace-lb.advertising.com/site=703223/mnum=1516/bins=1/rich=0/logs=0/betr=A2099=[+]LP2" width="1" height="1"" should be captured, but its not. Am I doing something wrong here?
Any help is really appreciated.
Cheers!
That's because
URL url = new URL(image.getSrcAttribute());
Is throwing you an exception :)
Try this code:
public Main() throws Exception {
WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
HtmlPage page = webClient.getPage("http://www.chapitre.com");
List<HtmlImage> imageList = (List<HtmlImage>) page.getByXPath("//img");
for (HtmlImage image : imageList) {
try {
new URL(image.getSrcAttribute());
if (image.getHeightAttribute().equals("1") && image.getWidthAttribute().equals("1")) {
System.out.println(image.getSrcAttribute());
}
} catch (Exception e) {
System.out.println("You didn't see this comming :)");
}
}
}
You can even get those 1x1 pixel images by xpath.
Hope this helps.
as title shows, how do i return a list of urls under (a href) reference and display it in a text file ? The code below return the html form a a website.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
public class Main {
public static void main(String[] args) {
try {
URL my_url = new URL("http://www.placeofjo.blogspot.com/");
BufferedReader br = new BufferedReader(
new InputStreamReader(my_url.openStream()));
String strTemp = "";
while(null != (strTemp = br.readLine())){
System.out.println(strTemp);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
You sound like you want to be using an HTML parsing library like HtmlUnit, rather than getting into the hassle of parsing the HTML yourself. The HtmlUnit code would be as simple as:
final WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
final HtmlPage page = webClient.getPage("http://www.placeofjo.blogspot.com/");
// Then iterate through
for (DomElement element : page.getElementsByTagName("a")){
String link = ((HtmlAnchor)element).getHrefAttribute();
System.out.println(link);
}
Gives output of:
http://www.twitter.com/jozefinfin/
http://www.facebook.com/jozefinfin/
http://placeofjo.blogspot.com/2008_08_01_archive.html
... etc etc
http://placeofjo.blogspot.com/2011_02_01_archive.html
http://endlessdance.blogspot.com
http://blogskins.com/me/aaaaaa
http://weheartit.com
You might want to try parsing the HTML with jsoup and collect all the anchor tags from the page.
Edit (2)
If you're looking for a robust solution (or might need to extend to parsing more HTML), then check out one of the other answers here. If you just want a quick-and-dirty, one time solution you might consider regex.
If I understand you correctly, you want to extract the href values for all <a> tags in the HTML you're fetching.
You can use regular expressions. Something like
String regex = "<a\s.*href=['\"](.*?)['\"].*?>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while (m.find())
{
String urlStr = m.group();
}
Edit (1)
Corrected the regex - we want reluctant quantifiers otherwise we'll end up getting everything!