How to get scrape specific URL from multiple URL in Webpage Java - java

I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}

Related

How can I set link inside a text in Android?

So, I am using Jsoup for web scraping. I can scrape the data from the web, But, the problem is I am getting the links and text separately. I want those links to set inside my texts. I am using SpannableStringBuilder so, there are a lot of links and a lot of texts. so I can't figure out how to deal with the problem as I am new to android development.
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final SpannableStringBuilder
builder = new SpannableStringBuilder();
try {
Document doc = Jsoup.
connect("https://www.wikipedia.org/").get();
String title = doc.title();
Elements links = doc.select("a[href]");
builder.append(title).append("\n");
for (Element link : links) {
final String url = link.attr("href");
builder.append("\n")
.append("Link: ")
.append(url, new URLSpan(url),
Spannable.SPAN_EXCLUSIVE_EXCLUSIVE)
.append("\n")
.append("Text: ")
.append(link.text());
}
} catch (IOException e) {
builder.append("Error : ")
.append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
textView.setMovementMethod
(LinkMovementMethod.getInstance());
}
});
}
}).start();}
I am getting output like this format.
Link : //en.wikipedia.org/
Text : English 5 678 000+ articles
Link : //ja.wikipedia.org/
Text : 日本語 1 112 000+ 記事
Link : //es.wikipedia.org/
Text : Español 1 430 000+ artículos
......
......
I want to have an output like this format,
** Texts: English 5 678 000+ articles**,
inside that line, I want to
join this link
** Link://en.wikipedia.org/**
as hyperlinked or in some way so that I can click this text and go to the webpage directly like in MS Word.
You are looking for setting text values using HTML. Here is the documentation and Here is some sample code:
String str = "Do you want to search on " + "<a href=http//www.google.com>" +
"Google" + "</a>" + " or " + "<a href=http//www.yahoo.com>" +
"Yahoo" + "</a>" + "?";
if(Build.VERSION.SDK_INT >= 24) {
viewToSet.setText(Html.fromHtml(str, Html.FROM_HTML_MODE_LEGACY));
} else {
viewToSet.setText(Html.fromHtml(str));
}
In it, you can set values using HTML. You can also update colors, bold, italics, etc, as long as you utilize HTML properties.

Get text snippets from google results

I want to extract the snippets from the google results, I'm using the following code that parse the google results page:
Scanner scanner = new Scanner(System.in);
System.out.println("Please enter the search term.");
String searchTerm = scanner.nextLine();
System.out.println("Please enter the number of results. Example: 5 10 20");
int num = scanner.nextInt();
scanner.close();
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();
Elements results = doc.select("//div//div//span[contains(#class, 'st')]/text()");
for (Element result : results) {
String linkText = result.text();
System.out.println("Text::" + linkText );//1000+ ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
}
it extract the resulted url and the caption, the problem is that the snippets are in html tags that are in "lower level", like in the attached image:
So how can i extract them ?!
With a xpath query :
'//em[.="Stack Overflow"]/following-sibling::text()'
or
'//em[text()="Stack Overflow"]/following-sibling::text()'

JSoup Parse text and links in sequence from html file

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.
Here is my code:
try {
doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for(Element p : paragraphs){
// System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
}
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
}
}
I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:
Text
Link
Text
Link
I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?
Try
Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for (Element p : paragraphs) {
System.out.println(p.text());
Elements links = p.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
}
}

Extracting atom feeds from URL sets

I have a huge list of URL's and my task is to feed them to a java code which should spit out the atom contents. Is there an API library or how can I access them?I tried the below code but it does not show any output. I don't know what went wrong?
try {
URL url = new URL("https://www.google.com/search?hl=en&q=robbery&tbm=blg&
output=atom");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(url));
System.out.println("Feed Title: " + feed.getTitle());
for (SyndEntry entry : (List<SyndEntry>) feed.getEntries())
{
System.out.println("Title: " + entry.getTitle());
System.out.println("Unique Identifier: " + entry.getUri());
System.out.println("Updated Date: " + entry.getUpdatedDate());
for (SyndLinkImpl link : (List<SyndLinkImpl>) entry.getLinks())
{
System.out.println("Link: " + link.getHref());}
for (SyndContentImpl content : (List<SyndContentImpl>) entry.getContents())
{
System.out.println("Content: " + content.getValue());
}
for (SyndCategoryImpl category : (List<SyndCategoryImpl>) entry.getCategories())
{
System.out.println("Category: " + category.getName());
}}}
catch (Exception ex)
{
}
You can use Rome (http://rometools.org) to process atom feeds.
Every Atom feed have "feed" tag in it.
So what you can do is read the url and check if it contains feed tag or not.
In java you can use inbuilt XMLparser library to do it -
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
doc.getDocumentElement().normalize();
if (doc.getElementsByTagName("feed").getLength() > 0) {
//do something
}

Search Function in HTML

How can I search text in HTMLDocument and then return the index and last index of that word/sentence but ignoring tags when searching..
Searching: stackoverflow
html: <p class="red">stack<b>overflow</b></p>
this should return index 15 and 31.
Just like in browsers when searching in webpages.
If you want to do that in Java, here are rough example using Jsoup. But of course you should implement the detail so that the code can parse properly for any given html.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";
String search = "stackoverflow";
Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow
if(search.matches(pPlainText)){
System.out.println("text found in html");
String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"
//search the text in pElementString
int start = pElementString.indexOf(firstWord); // 15
int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
System.out.println(start + " >> " + end);
}else{
System.out.println("cannot find searched text");
}

Categories