Issue using Sibling Index in JSoup 1.8.2

Issue using Sibling Index in JSoup 1.8.2 - java

New to Java. Trying to learn/practice JSoup.
Goal: Extract the "Close" column from the Historical Prices off Yahoo Finance.
The code below returns both the "Close" prices and the "Adj Close" prices.
"S" below is any ticker in the SP 500.
Found another post which I've been using as a model. (here: How to parse the cells of the 3rd column of a table?).
public class MovingAverage200 {
private String ticker;
private String movingAverageURL;
double movingAverage200;
public MovingAverage200(String s) {
ticker = s;
movingAverageURL = ("https://finance.yahoo.com/q/hp?s="+ticker+"+Historical+Prices");
}
public void setMovingAverage() {
try {
Document document = Jsoup.connect(movingAverageURL).get();
Elements prices = document.select("td.yfnc_tabledata1:eq(4)");
for (Element price : prices) {
System.out.println(price.text());
}
}
catch (IOException ex) {
ex.printStackTrace();
}
}

Hint posted by alkis:
If you are using a previous release (1.8.2) then just use the latest. There were some performance tweaks that introduced bugs regarding siblings.

Related

Webpage collector using google bot

I'm continuing a project that has been coming for a few years at my university. One of the activities this project does is to collect some web pages using the google bot.
Due to a problem that I cannot understand, the project is not getting through this part. Already research a lot about what may be happening, if it is some part of the code that is outdated.
The code is in Java and uses Maven for project management.
I've tried to update some information from maven's "pom".
I already tried to change the part of the code that uses the bot, but nothing works.
I'm posting the part of code that isn't working as it should:
private List<JSONObject> querySearch(int numSeeds, String query) {
List<JSONObject> result = new ArrayList<>();
start=0;
do {
String url = SEARCH_URL + query.replaceAll(" ", "+") + FILE_TYPE + "html" + START + start;);
Connection conn = Jsoup.connect(url).userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").timeout(5000);
try {
Document doc = conn.get();
result.addAll(formatter(doc);
} catch (IOException e) {
System.err.println("Could not search for seed pages in IO.");
System.err.println(e);
} catch (ParseException e) {
System.err.println("Could not search for seed pages in Parse.");
System.err.println(e);
}
start += 10;
} while (result.size() < numSeeds);
return result;
}
what some variables do:
private static final String SEARCH_URL = "https://www.google.com/search?q=";
private static final String FILE_TYPE = "&fileType=";
private static final String START = "&start=";
private QueryBuilder queryBuilder;
public GoogleAjaxSearch() {
this.queryBuilder = new QueryBuilder();
}
Until this part is ok, it connect with the bot and can get a html from google. The problem is to separate what found and take only the link, that should be between ("h3.r> a").
That it does in this part with the result.addAll(formatter(doc)
public List<JSONObject> formatter(Document doc) throws ParseException {
List<JSONObject> entries = new ArrayList<>();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
//System.out.println(result.toString());
JSONObject entry = new JSONObject();
entry.put("url", (result.attr("href").substring(6, result.attr("href").indexOf("&")).substring(1)));
entry.put("anchor", result.text());
So when it gets to this part: Elements results = doc.select ("h3.r> a"), find, probably, no h3 and can't increment the "results" list by not entering the for loop. Then goes back to the querysearch function and try again, without increment the results list. And with that, entering in a infinite loop trying to get the requested data and never finding.
If anyone here can help me, I've been trying for a while and I don't know what else to do. Thanks in advance.

Can I update my TableView items one a time without blocking UI thread?

I have two methods. The first retrieves a list of results from a search method in another class.
/* 2 - Retrieve list of results */
qmitResultsList = QMITSearchUtil.execute(URL, keyword);
/* 3 - Show results */
populateTable(qmitResultsList, tableView)
The second, populateTable() adds all the items to the table at once by calling:
ObservableList<QMITResult> dataPriority = FXCollections.observableArrayList(
qmitResultsList
);
tableView.setItems(dataPriority);
My goal is to add each new element to the TableView as it is being processed in real-time. For example, instead of processing and returning the entire list in the first method, QMITSearchUtil.execute(), I would like to update the UI with each result that is returned, one at a time. How can this be accomplished? I've tried a few ways, using a Platform.runLater() hack for example, with no success...

I discovered the answer to my question. I first define the ObservableList for my TableView:
ObservableList<QMITResult> dataPriority = FXCollections.observableArrayList();
Then I pass that into the execute() method that runs the background thread:
private void execute(String URL, String keyword, ObservableList<QMITResult> dataPriority) throws Exception {
/* Download HTML page and create list of URLs from relevant links */
Elements links = getLinkList(URL);
List<QMITResult> qmitResults = new ArrayList<>();
new Thread(() -> {
for (Element link : links) {
try {
/* Create a list of formatted URLs to loop through */
String linkText = link.toString();
String titleText = link.text();
String formattedLink = StringUtils.substringBetween(linkText, "<a href=\"", "\"").replace("\\", "/");
System.out.println(titleText);
System.out.println(formattedLink);
/* Create Word Document for each link and parse for keyword */
QMITResult qmitResultNode = null;
try {
qmitResultNode = parseDocument(keyword, formattedLink, titleText);
} catch (Exception e) {
e.printStackTrace();
}
qmitResults.add(qmitResultNode);
dataPriority.add(qmitResultNode);
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}).start();
tableView.setItems(dataPriority);
}
The result is that while the list is being formed each TableView item is being individually published without blocking the main UI thread. They come one at a time.

AngularJs page issue with selecting an element and clicking it

I have a problem with selecting and clicking an element it so the drop down occurs here is what i have tried uptill now:-
String csspath = "html body.ng-scope f:view form#wdesk.ng-pristine.ng-valid div.container div.ng-scope md-content.md-padding._md md-tabs.ng-isolate-scope.md-dynamic-height md-tabs-content-wrapper._md md-tab-content#tab-content-7._md.ng-scope.md-active.md-no-scroll div.ng-scope.ng-isolate-scope ng-include.ng-scope div.ng-scope accordion div.accordion div.accordion-group.ng-isolate-scope div.accordion-heading a.accordion-toggle.ng-binding span.ng-scope b.ng-binding";
String uxpath = "//html//body//f:view//form//div//div[2]//md-content//md-tabs//md-tabs-content-wrapper//md-tab-content[1]//div//ng-include//div//accordion//div//div[1]//div[1]//a";
String xpath2 = "/html/body/pre/span[202]/a";
xpath = "/html/body/f:view/form/div/div[2]/md-content/md-tabs/md-tabs-content-wrapper/md-tab-content[1]/div/ng-include/div/accordion/div/div[1]/div[1]/a/span/b";
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector(csspath)));
locator = By.cssSelector(csspath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune csspath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(xpath)));
locator = By.xpath(xpath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune xpath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(uxpath)));
locator = By.xpath(uxpath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune uxpath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(xpath2)));
locator = By.xpath(xpath2);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune xpath2");
}
However nothing has worked till now i want to select responsibility code and give it values
It would be really appreciated if you can give me any insight
Thanks in advance
Here is a screenshot of my issue
enter image description here

First issue (as already pointed out in comments) is the absolute selectors you are using. For example, try to refactor your xpath selectors and make those relative.
Next issue is related to the
AngularJs page
itself. Let's look at Protractor, the testing framework for Angular built upon WebDriverJS, it provides additional WebDriver-like functionality to test Angular based websites. Put simple - your code needs extra functionality that will know when Angular elements are available for interaction.
Here is how to port some of the most useful Protractor functions to Java (and Python):

Quote retweets Twitter4j

I have a problem when using twitter4j, when I get timeline using this code :
try {
ResponseList<Status> tweets = twitter.getHomeTimeline();
for(Status s : tweets){
Tweet temp = new Tweet(new URL(s.getUser().getProfileImageURL()),s.getUser().getName(),"#"+s.getUser().getScreenName() , s.getText());
tweetsPanel.add(temp);
}
} catch (TwitterException | MalformedURLException e) {
e.printStackTrace();
}
(Tweet is local class) everything is OK except the retweets in the timeline are displayed as "Quote Retweet":
RT #SOMEONE : the tweet.
I want it like the website, just a normal retweet.

on twitter4j the retweets are shown in format
RT #user : tweet
because is the actual form a rt takes in text. If you put on twitter a tweet with this same format it will be parsed as a normal retweet from twitter itself.
the only way you can edit this out is to parse the text and eliminate the first part manually.
try something like:
String tweetText = "";
String [] splitted=s.getText().split(":");
if(splitted.lenght>2)
for (int i=1;i<splitted.lenght-1;i++)
{
tweetText+=splitted[i]+":";
}
tweetText+=splitted[splitted.lenght-1];
return tweetText;
by starting the for on i=1 you will avoid adding the first split that contains the RT #user, by adding the splitted[i]+":" you will put back eventual other ":" present in the tweet that split will otherwise eliminate. Of course you don't want to introduce a ":" that was not there, so the last piece of splitted goes outside the for, without the +":"

How to get h2 Tag of a table using Jsoup

I need some help scraping a webpage with Jsoup. I want to pars player profiles from the hcfactions webpage and gather their kills and deaths. The problem I'm running into is that each profile page is dynamically created and will only have said tables if the player has kills or deaths. So in order to tell which table I'm parsing I need to get the header text that's set after the call.
example web page: http://www.hcfactions.net/index.php?action=playerinfo&player=Djmaddox.
Below is a html segment from the web page I'm scraping:
<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>
<tr><td>Date</td><td>Reason</td><td>Details</td></tr><tr><td>Dec 11 5:27pm CST</td>.....
I have this code that pulls the tables and counts entries but it wont pull the h2 tags with it for me to select.
public void getPlayerDetails(String name) {
String data = "";
Avatar temp = _db.getPlayer(name);
playerUrl = "http://www.hcfactions.net/index.php?action=playersearch&player=" + name;
try {
// data = Jsoup.connect(url)
// .url(url).get().html();
playerDoc = Jsoup.connect(playerUrl).get();
} catch (IOException ex) {
Logger.getLogger(JParser.class.getName()).log(Level.SEVERE, null, ex);
}
if (playerDoc.select("table").size() == 1) {
return;
} else if (playerDoc.select("table").size() >= 2) {
for (int x = 1; x < playerDoc.select("table").size(); x++) {
System.out.println("deaths");
Element table = playerDoc.select("table").get(x);
Iterator<Element> ite = table.select("tr").iterator();
int count = 0;
while (ite.hasNext()) {
data = ite.next().text();
count++;
}
if (count > 0) {
temp.setDeaths(count - 1);
}
}
}
}

The tag <h2> is on an invalid position. That's why JSoup cannot find it I think. You have to extract it yourself with regular expressions. You can get the content of the <h2> with the following code:
String tableToString = "<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>" + "<tr>" + "<td>Date</td>" + "<td>Reason</td>" + "<td>Details</td>" + "</tr>" + "</table>";
String regex = "<h2.*>(.*)?</h2>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(tableToString);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
You can init tableToString with table.toString() from your code.

As ka3ak says, the <h2> is mispositioned. But you don't have to abandon your parser as resort to regex for that. Assuming JSoup is a decent HTML parser (never used it myself) the <h2> element should be the element immediately preceding the <table> element. Get your 'select' statement to look for it there.

Elements headers=playerDoc.select("div.span10.offset1 h2");
IMHO Your selections seams to be little bit overcomplicated, but maybe it has to be like that. Anyway snippet above will get you every H2 tags present in proper container.
Later on you ca select required tables like that Elements tables=playerDoc.select("div.span10.offset1 table"); and apply proper data digging onto them. Headers will be in corresponding order to tables ofc. I think, that my job is done here :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Issue using Sibling Index in JSoup 1.8.2 - java

Hint posted by alkis: If you are using a previous release (1.8.2) then just use the latest. There were some performance tweaks that introduced bugs regarding siblings.

Related

Webpage collector using google bot

Can I update my TableView items one a time without blocking UI thread?

AngularJs page issue with selecting an element and clicking it

Quote retweets Twitter4j

How to get h2 Tag of a table using Jsoup

Categories

Resources