I want to extract the snippets from the google results, I'm using the following code that parse the google results page:
Scanner scanner = new Scanner(System.in);
System.out.println("Please enter the search term.");
String searchTerm = scanner.nextLine();
System.out.println("Please enter the number of results. Example: 5 10 20");
int num = scanner.nextInt();
scanner.close();
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();
Elements results = doc.select("//div//div//span[contains(#class, 'st')]/text()");
for (Element result : results) {
String linkText = result.text();
System.out.println("Text::" + linkText );//1000+ ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
}
it extract the resulted url and the caption, the problem is that the snippets are in html tags that are in "lower level", like in the attached image:
So how can i extract them ?!
With a xpath query :
'//em[.="Stack Overflow"]/following-sibling::text()'
or
'//em[text()="Stack Overflow"]/following-sibling::text()'
Related
I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}
I am trying to extract the value of an HTML table element from a website and compare it to a user input value but it seems that the nested loop is not being entered when I run the program. It works with no errors but I am not getting any output from Eclipse, I'm new to Selenium Java and still learning.
See my code below:
String inputString = basePrem;
try {
//Print to console the value of Base Prem
WebElement table = driver.findElement(By.xpath(".//td[text()='Base Premium']/following-sibling::*"));
List<WebElement> allrows = table.findElements(By.tagName("tr"));
List<WebElement> allcols = table.findElements(By.tagName("td"));
for (WebElement row: allrows) {
List<WebElement> Cells = row.findElements(By.tagName("td"));
for (WebElement Cell:Cells) {
if (Cell.getText().contains(basePrem)) {
System.out.print("Base Premium = "+ basePrem + " ");
}
else if (!Cell.getText().contains(basePrem))
{
System.out.print("Base Premium = " + basePrem + " ");
break;
}
}
}
}
catch (Exception e) {
errorMessage = "Value discrepancy";
System.out.println(errorMessage + " - " + e.getMessage());
driver.close();
}
Also, inputString is where I input the value I use for comparison (I use a separate excel file for testing)
Since the control is not going inside the nested loop, I probably have some logical error?
You can rewrite the code as below and then validate whether your inputstring is available in the table. It is not necessary to use nested for loops
Code:
String inputString = basePrem;
WebElement table = driver.findElement(By.xpath(".//table"));
//Extract all the Cell Data Element
List<WebElement> dataElementList=table.findElements(By.xpath(".//td"));
for(WebElement dataElement : dataElementList){
if(dataElement.getText().contains(inputString)){
System.out.print("Base Premium = "+ basePrem + " ");
break;
}
}
I am experimenting with JSoup, and I cannot get my 2nd go-around with my Scanner to work. It skips directly to my catch statement.
Here is a description of the program:
I take a google search term as user input (String). Next, I ask for the number of query items that the user wishes to see, and enter an integer.
I loop through each element that is returned and add it to an ArrayList. The String displayed on the console consists of an index, Link Text, and a hyperlink.
I then want to ask the user which index they would like to enter to open a browser window leading to that link. This is done by cocantenating the hRef string with the Linux terminal command "xdg-open " using the Runtime class.
It works great up until it's time to ask which index will be chosen.
Here is my code:
/**
* Created by christopher on 4/26/16.
*/
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GoogleSearchJava {
static int index;
static String linkHref;
static Scanner input;
public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";
public static void main(String[] args) throws IOException {
//GET INPUT FOR SEARCH TERM
input = new Scanner(System.in);
System.out.print("Search: ");
String searchTerm = input.nextLine();
System.out.print("Enter number of query results: ");
int num = input.nextInt();
String searchURL = GOOGLE_SEARCH_URL + "?q=" + searchTerm + "&num=" + num;
//NEED TO DEFINE USER AGENT TO PREVENT 403 ERROR.
Document document = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();
//OPTION TO DISPLAY HTML FILE IN BROWSWER. DON'T KNOW YET.
//System.out.println(doc.html());
//If google search results HTML change the <h3 class="r" to <h3 class ="r1"
//need to change below stuff accordingly
Elements results = document.select("h3.r > a");
index = 0;
String news = "News";
ArrayList<String> displayResults = new ArrayList<>();
for (Element result : results) {
index++;
linkHref = result.attr("href");
String linkText = result.text();
String pingResult = index + ": " + linkText + ", URL:: " + linkHref.substring(6, linkHref.indexOf("&")) + "\n";
if (pingResult.contains(news)) {
System.out.println("FOUND " + "\"" + linkText + "\"" + "NO HYPERTEXT FOR NEWS QUERY RESULTS AT THIS TIME. SKIPPED INDEX.");
System.out.println();
} else {
displayResults.add(pingResult);
}
}
for(String urlString : displayResults) {
System.out.println(urlString);
}
System.out.println();
goToURL(input, displayResults);
}
public static int goToURL(Scanner input, ArrayList<String> resultList) {
int newIndex = 0;
try {
System.out.print("Enter Index (i.e. 1, 2, etc) you wish to visit, 0 to exit: ");
newIndex = input.nextInt();
input.nextLine();
for (String string : resultList) {
if(string.startsWith(String.valueOf(newIndex))) {
Process process = Runtime.getRuntime().exec("xdg-open " + string.substring(6, string.indexOf("&")));
process.waitFor();
}
}
} catch (Exception e) {
System.out.println("ERROR while parsing URL");
}
return newIndex;
}
}
HERE IS THE OUTPUT Notice how it stops after I enter "1" No, I haven't taken care of pressing "0" yet:
Search: Oracle
Enter number of query results: 3
1: Oracle | Integrated Cloud Applications and Platform Services, URL:: =http://www.oracle.com/
2: Oracle Corporation - Wikipedia, the free encyclopedia, URL:: =https://en.wikipedia.org/wiki/Oracle_Corporation
3: Oracle on the Forbes America's Best Employers List, URL:: =http://www.forbes.com/companies/oracle/
Enter Index (i.e. 1, 2, etc) you wish to visit, 0 to exit: 1
ERROR while parsing URL
Process finished with exit code 0
ERROR while parsing URL suggests that error comes from
try {
System.out.print("Enter Index (i.e. 1, 2, etc) you wish to visit, 0 to exit: ");
newIndex = input.nextInt();
input.nextLine();
for (String string : resultList) {
if(string.startsWith(String.valueOf(newIndex))) {
Process process = Runtime.getRuntime().exec("xdg-open " + string.substring(6, string.indexOf("&")));
process.waitFor();
}
}
} catch (Exception e) {
System.out.println("ERROR while parsing URL");
}
I am not working on Linux so I can't test it but I suspect that your url shoulnd't start with = (you will notice that your console contains URL:: =... where your printing statement doesn't have this = so it is part of address you are trying to visit).
So change in .substring(6, hRef.indexOf("&")) 6 to 7.
Other problem is that hRef is set to be linkHref which will be last result from google you picked. You should probably create your own class which will store proper href and its description, or pass list of Element representing <a ...>..</a> elements which you picked (also you don't need to check elements in list based on their 1: ... format, simply use list.get(index - 1) if you want to map 1 to index 0, 2 to index 1 and so on).
Last advice for now is that you may change your code to be more OS independent with solution described here How to open the default webbrowser using java rather than trying to execute xdg-open
I am trying to get the values out of String[] value; into String lastName;, but I get errors and it says java.lang.ArrayIndexOutOfBoundsException: 2
at arduinojava.OpenFile.openCsv(OpenFile.java:51) (lastName = value[2];). Here is my code, but I am not sure if it is going wrong at the split() or declaring the variables or getting the data into another variable.
Also I am calling input.next(); three times for ignoring first row, because otherwise of study of Field of study would also be printed out..
The rows I am trying to share are in a .csv file:
University Firstname Lastname Field of study
Karlsruhe Jerone L Software Engineering
Amsterdam Shahin S Software Engineering
Mannheim Saman K Artificial Intelligence
Furtwangen Omid K Technical Computing
Esslingen Cherelle P Technical Computing
Here's my code:
// Declare Variable
JFileChooser fileChooser = new JFileChooser();
StringBuilder sb = new StringBuilder();
// StringBuilder data = new StringBuilder();
String data = "";
int rowCounter = 0;
String delimiter = ";";
String[] value;
String lastName = "";
/**
* Opencsv csv (comma-seperated values) reader
*/
public void openCsv() throws Exception {
if (fileChooser.showOpenDialog(null) == JFileChooser.APPROVE_OPTION) {
// Get file
File file = fileChooser.getSelectedFile();
// Create a scanner for the file
Scanner input = new Scanner(file);
// Ignore first row
input.next();
input.next();
input.next();
// Read from input
while (input.hasNext()) {
// Gets whole row
// data.append(rowCounter + " " + input.nextLine() + "\n");
data = input.nextLine();
// Split row data
value = data.split(String.valueOf(delimiter));
lastName = value[2];
rowCounter++;
System.out.println(rowCounter + " " + data + "Lastname: " + lastName);
}
input.close();
} else {
sb.append("No file was selected");
}
}
lines are separated by spaces not by semicolon as per your sample. Try in this way to split based on one or more spaces.
data.split("\\s+");
Change the delimiter as shown below:
String delimiter = "\\s+";
EDIT
The CSV file should be in this format. All the values should be enclosed inside double quotes and there should be a valid separator like comma,space,semicolon etc.
"University" "Firstname" "Lastname" "Field of study"
"Karlsruhe" "Jerone" "L" "Software Engineering"
"Amsterdam" "Shahin" "S" "Software Engineering"
Please check if you file is using delimiter as ';' if not add it and try it again, it should work!!
Use OpenCSV Library for read CSV files .Here is a detailed example on read/write CSV files using java by Viral Patel
How can I search text in HTMLDocument and then return the index and last index of that word/sentence but ignoring tags when searching..
Searching: stackoverflow
html: <p class="red">stack<b>overflow</b></p>
this should return index 15 and 31.
Just like in browsers when searching in webpages.
If you want to do that in Java, here are rough example using Jsoup. But of course you should implement the detail so that the code can parse properly for any given html.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";
String search = "stackoverflow";
Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow
if(search.matches(pPlainText)){
System.out.println("text found in html");
String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"
//search the text in pElementString
int start = pElementString.indexOf(firstWord); // 15
int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
System.out.println(start + " >> " + end);
}else{
System.out.println("cannot find searched text");
}