I'm using jsoup to extract some ads from a page and i need to check if a class exists but i'm not doing it right:(
Here is the html:
I need to check if the class .large-4.medium-5.large-text-right.medium-text-right.columns exists and if so, i need to extract the element inside it but i've got stuck at checking if that class exists:(
Here is my code:
Elements pageSearchPrice = page2
.select("li[itemtype=https://schema.org/Offer] > div[class=listing-data]");
for(int j=0; j < pageSearchTitle.size(); j++) {
if(pageSearchPrice.get(j).hasClass(".large-4.medium-5.large-text-right.medium-text-right.columns")) {
String price = pageSearchPrice.get(j).select("strong[itemprop=price]").text();
list.get(index1).setPrice(price);
index1++;
}else {
list.get(index1).setPrice("No price");
index1++;
}
}
Related
URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version
As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.
I try to retrieve text via foreach loop,as according to page wise. Flow is : It prints text of single row and as soon as it completes, it goes to second page and start again to retrieve text. Problem is, it retrieves data of first page multiple times like sometimes 2 or 3 or 4 times, How to control it for single time execution ?
if (driver.findElement(By.xpath("//button[#ng-click='currentPage=currentPage+1']")).isEnabled()) {
int ilength = driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']")).size();
Outer: for (int i1 = ilength; i1 > 0;) {
List<WebElement> findData = driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']"));
for (WebElement webElement : findData) {
String printGroupName = webElement.getAttribute("value").toString();
System.out.println(printGroupName);
ilength--;
}
if (driver.findElement(By.xpath("//button[#ng-click='currentPage=currentPage+1']")).isEnabled()) {
action.moveToElement(driver.findElement(By.xpath("//button[#ng-click='currentPage=currentPage+1']"))).click().perform();
page.pagecallingUtility();
ilength = driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']")).size();
} else {
break Outer;
}
}
} else {
List<WebElement> findAllGroupName = driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']"));
for (WebElement webElement : findAllGroupName) {
String printGroupName = webElement.getAttribute("value").toString();
System.out.println(printGroupName);
}
}
Console Data, on which it retrieve information
Your loop can be simplified as below.
boolean newPageOpened = true;
while (newPageOpened) {
List<WebElement> findData = driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']"));
for (WebElement webElement : findData) {
if (webElement.isDisplayed()) {
String printGroupName = webElement.getAttribute("value").toString();
System.out.println(printGroupName);
}
}
WebElement nextButton = driver.findElement(By.xpath("//button[#ng-click='currentPage=currentPage+1']"));
if (nextButton.isEnabled()) {
action.moveToElement(nextButton).click().perform();
page.pagecallingUtility();
} else {
newPageOpened = false;
}
}
As for the contents of the fist page printing again and again, I suspect when you open the second page the contents of the first page are simply hidden in the page. So when you use driver.findElements(By.xpath("//input[#ng-attr-id='{{item.attr}}']")) the hidden first page elements are also found. The simple solution is to check if the element is displayed before printing it.
I have been trying a simple program that navigates and fetches data from the new page, comes back in history and open other page and fetch data and so on until all the links have been visited and data is fetched.
After getting results on the below site, i am trying to loop through all the links i get in the first column and open those links one by one and extract text from each of these page. But the below program only visits first link and gives StaleElementReferenceException, I have tried using Actions but it didn't work and I am not aware about JavascriptExecutor. I also tried solutions posted on other SO questions, one of which was mine over here. I would like to have the mistake corrected in the below code and a working code.
public class Selenium {
private final static String CHROME_DRIVER = "C:\\Selenium\\chromedriver\\chromedriver.exe";
private static WebDriver driver = null;
private static WebDriverWait wait = null;
private void setConnection() {
try {
System.setProperty("webdriver.chrome.driver", CHROME_DRIVER);
driver = ChromeDriver.class.newInstance();
wait = new WebDriverWait(driver, 5);
driver.get("https://sanctionssearch.ofac.treas.gov");
this.search();
} catch (Exception e) {
e.printStackTrace();
}
}
private void search() {
try {
driver.findElement(By.id("ctl00_MainContent_txtLastName")).sendKeys("Dawood");
driver.findElement(By.id("ctl00_MainContent_btnSearch")).click();
this.extractText();
} catch (Exception e) {
e.printStackTrace();
}
}
private void extractText() {
try {
List<WebElement> rows = driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr"));
List<WebElement> links = null;
for (int i = 1; i <= rows.size(); i++) {
links = driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr/td[1]/a"));
for (int j = 0; j < links.size(); j++) {
System.out.println(links.get(j).getText() + ", ");
links.get(j).click();
System.out.println("Afte click");
driver.findElement(By.id("ctl00_MainContent_btnBack")).click();
this.search();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] ar) {
Selenium object = new Selenium();
object.setConnection();
}
}
Generally we will be getting the Stale Exception if the element attributes or something is changed after initiating the webelement. For example, in some cases if user tries to click on the same element on the same page but after page refresh, gets staleelement exception.
To overcome this, we can create the fresh webelement in case if the page is changed or refreshed. Below code can give you some idea.
Example:
webElement element = driver.findElement(by.xpath("//*[#id='StackOverflow']"));
element.click();
//page is refreshed
element.click();//This will obviously throw stale exception
To overcome this, we can store the xpath in some string and use it create a fresh webelement as we go.
String xpath = "//*[#id='StackOverflow']";
driver.findElement(by.xpath(xpath)).click();
//page has been refreshed. Now create a new element and work on it
driver.fineElement(by.xpath(xpath)).click(); //This works
In this case, we are collecting a group of webelements and iterating to get the text. But it seems there is some changes in the webelement after collecting the webelements and gettext throws staleness. We can use a loop and create the element on the go and get text.
for(int i = 0; i<5; i++)
{
String value = driver.findElement(by.xpath("//.....["+i+"]")).getText);
System.out.println(value);
}
Hope this helps you. Thanks.
The reason you get StaleElementReference Exception, is normally because you stored element(s) into some variable, however after that you did some action and page has changed (due to some ajax response) and so your stored element has become stale.
The best solution is not to store element in any variable in such case.
This should work.
links = driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr/td[1]/a"));
for (int j = 0; j < links.size(); j++) {
System.out.println(links.get(j).getText() + ", ");
driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr/td[1]/a")).get(j).click();
System.out.println("Afte click");
driver.findElement(By.id("ctl00_MainContent_btnBack")).click();
this.search();
}
Please check this code
private void extractText() {
try {
List<WebElement> rows = driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr"));
List<WebElement> links = null;
System.out.println(rows.size());
for (int i = 0; i < rows.size(); i++) {
links = driver.findElements(By.xpath("//*[#id='gvSearchResults']/tbody/tr/td[1]/a"));
WebElement ele= links.get(0);
System.out.println(ele.getText() + ", ");
ele.click();
System.out.println("After click");
driver.findElement(By.id("ctl00_MainContent_btnBack")).click();
}
} catch (Exception e) {
e.printStackTrace();
}
}
I already generated an ont.owl file using Jena. Then first I need to take all the classes to the array list which are contain the ontology. secondly I will give another classes(terms) using my code and check whether these classes contain generated ontology or not. following is the code up to now.
m.read("http://localhost/myontofile/ont.owl");
ExtendedIterator<OntClass> classes = m.listClasses();
while (classes.hasNext()) {
OntClass takeclasses = (OntClass) classes.next();
String ontcls = takeclasses.getLocalName().toString();
ArrayList<String> listiter = new ArrayList<String>();
listiter.add(ontcls);
System.out.println("classes: " + listiter); ----------????
///////////////////////////////////////
ArrayList<String> tempTerms = new ArrayList<String>();
for(int i=0; i < terms.size(); i++) {
String aTerm = terms.get(i) ;
tempTerms.add(aTerm);
}
terms.add("Information");
terms.add("Video Information");
terms.add("Video Price Information");
terms.add("Video Maximum Price Information");
terms.add("Action Video Price Information");
for(int i=0; i < terms.size(); i++) {
if (listiter.equals(terms.get(i))==true) {
System.out.println("ok");
}
else {
System.out.println("no");
}
}
}
Result is always come to else ("no") part. "classes" give out put as only one class. What are the changes I need to do?
I am working on a project where I am trying to fetch financial statements from the internet and use them in a JAVA application to automatically create ratios, and charts.
The site I am using uses a login and password to get to the tables.
The Tag is TBODY, but there are 2 other TBODY's in the html.
How can I use java to print my table to a txt file where I can then use in my application?
What would the best way to go about this, and what should I read up on?
If this were my project, I'd look into using an HTML parser, something like jsoup (although others are available). The jsoup site has a tutorial, and after playing with it a while, you'll likely find it pretty easy to use.
For example, for an HTML table like so:
jsoup could parse it like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TableEg {
public static void main(String[] args) {
String html = "http://publib.boulder.ibm.com/infocenter/iadthelp/v7r1/topic/" +
"com.ibm.etools.iseries.toolbox.doc/htmtblex.htm";
try {
Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
System.out.println("headers");
for (int i = 0; i < tableHeaderEles.size(); i++) {
System.out.println(tableHeaderEles.get(i).text());
}
System.out.println();
Elements tableRowElements = tableElements.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
System.out.println("row");
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
System.out.println(rowItems.get(j).text());
}
System.out.println();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Resulting in the following output:
headers
ACCOUNT
NAME
BALANCE
row
0000001
Customer1
100.00
row
0000002
Customer2
200.00
row
0000003
Customer3
550.00