Get data from table using jSoup - java

I am looking to get data from the table on http://www.sportinglife.com/greyhounds/abc-guide using jSoup. I would like to put this data into some kind of table within my java program that I can then use in my code.
I'm not too sure how to do this. I have been playing around with jSoup and currently am able to get each cell from the table to print out using a while loop - but obviously can't use this always as the number of cells in the table will change.
Document doc = Jsoup.connect("http://www.sportinglife.com/greyhounds/abc-guide").get();
int n = 0;
while (n < 100){
Element tableHeader = doc.select("td").get(n);
for( Element element : tableHeader.children() )
{
// Here you can do something with each element
System.out.println(element.text());
}
n++;
}
Any idea of how I could do this?

There are just a few things you have to implement to achieve your goal. Take a look on this Groovy script - https://gist.github.com/wololock/568b9cc402ea661de546 Now lets explain what we have here
List<Element> rows = document.select('table[id=ABC Guide] > tbody > tr')
Here we're specifying that we are interested in every row tr that is immediate child of tbody which is immediate child of table with id ABC Guide. In return you receive a list of Element objects that describes those tr rows.
Map<String, String> data = new HashMap<>()
We will store our result in a simple hash map for further evaluation e.g. putting those scraped data into the database.
for (Element row : rows) {
String dog = row.select('td:eq(0)').text()
String race = row.select('td:eq(1)').text()
data.put(dog, race)
}
Now we iterate over every Element and we select content as a text from the first cell: String dog = row.select('td:eq(0)').text() and we repeat this step to retrieve the content as a text from the second cell: String race = row.select('td:eq(1)').text(). Then we just simply put those data into the hash map. That's all.
I hope this example with provided description will help you with developing your application.
EDIT:
Java code sample - https://gist.github.com/wololock/8ccbc6bbec56ef57fc9e

Related

Finding all cells within the table instead of cells within current row in foreach loop in Java

My code is like below.
private List<WebElement> reports;
public List<WebElement> getReports(){
return Common.returnElementList(DriverFactory.getDriver(), reportsMenu, reports);
}
public Map<String, String> getReportDesc() {
Map<String, String> temp = new HashMap<>();
for(WebElement item: getReports()){
List<WebElement> cols = item.findElements(By.xpath("/child::td[#role='gridcell']"));
String key = Common.getElementText(DriverFactory.getDriver(), cols.get(0));
String desc = Common.getElementText(DriverFactory.getDriver(), cols.get(1));
temp.put(key, desc);
}
return temp;
}
With item.findElements(By.xpath("/child::td[#role='gridcell']")); I am trying to get the cells of that specific row, instead I am getting all the cells in that table.
How Can I get the specific row columns?
You need to use relative XPath locator.
Since you didn't share a link to the page containing that table we can only guess. So, I guess:
Instead of
item.findElements(By.xpath("/child::td[#role='gridcell']"));
try this:
item.findElements(By.xpath(".//td[#role='gridcell']"));
The dot . infront of the XPath means this is relative XPath, we want to locate element matching //td[#role='gridcell'] inside current node item.
Otherwise driver will search for all elements matching /child::td[#role='gridcell'] expression form the first, top element on the page.

Grabbing all data values form a web table

I need to go through all the table values and grab those into array list (or your some suggested place)
First-row xpath list
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[1]/th[1]
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[1]/td[1]
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[1]/td[2]
.
.
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[1]/td[5]
2nd row, few xpaths
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[2]/th[1]
/html[1]/body[1]/div[4]/div[1]/main[1]/div[1]/div[3]/div[2]/div[1]/div[1]/div[4]/div[2]/table[1]/tbody[1]/tr[2]/td[1]
Please provide some Custom keyword logic for capture those values in easy way.
DOM
We can start using relative xpath for table's tbodythen we can use tagName method of By class for html tag names 'tr' and 'td' to fetch the rows and column elements
then we can save to arrayList as shown below in code.
Note - Closely observe first call to get table start is findElement and remaining are findElements as we want all elements with tr and td tagname.
#Test
public void testWebTable() {
WebElement simpleTable = driver.findElement(By.xpath("//table[1]/tbody[1]"));
// Get all rows
List<WebElement> rows = simpleTable.findElements(By.tagName("tr"));
List<String> webTableData = new ArrayList<String>();
// Print/Save data from each row
for (WebElement row : rows) {
List<WebElement> cols = row.findElements(By.tagName("td"));
for (WebElement col : cols) {
webTableData.add(col.getText());
System.out.print(col.getText() + "\t");
} System.out.println();
}
}
You can use PHP Simple HTML Dom parser library to parse those table data easily. Check out https://simplehtmldom.sourceforge.io/

How to get just the desired field from an array of sub documents in Mongodb using Java

I have just started using Mongo Db . Below is my data structure .
It has an array of skillID's , each of which have an array of activeCampaigns and each activeCampaign has an array of callsByTimeZone.
What I am looking for in SQL terms is :
Select activeCampaigns.callsByTimeZone.label,
activeCampaigns.callsByTimeZone.loaded
from X
where skillID=50296 and activeCampaigns.campaign_id= 11371940
and activeCampaigns.callsByTimeZone='PT'
The output what I am expecting is to get
{"label":"PT", "loaded":1 }
The Command I used is
db.cd.find({ "skillID" : 50296 , "activeCampaigns.campaignId" : 11371940,
"activeCampaigns.callsByTimeZone.label" :"PT" },
{ "activeCampaigns.callsByTimeZone.label" : 1 ,
"activeCampaigns.callsByTimeZone.loaded" : 1 ,"_id" : 0})
The output what I am getting is everything under activeCampaigns.callsByTimeZone while I am expecting just for PT
DataStructure :
{
"skillID":50296,
"clientID":7419,
"voiceID":1,
"otherResults":7,
"activeCampaigns":
[{
"campaignId":11371940,
"campaignFileName":"Aaron.name.121.csv",
"loaded":259,
"callsByTimeZone":
[{
"label":"CT",
"loaded":6
},
{
"label":"ET",
"loaded":241
},
{
"label":"PT",
"loaded":1
}]
}]
}
I tried the same in Java.
QueryBuilder query = QueryBuilder.start().and("skillID").is(50296)
.and("activeCampaigns.campaignId").is(11371940)
.and("activeCampaigns.callsByTimeZone.label").is("PT");
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1).append("_id", 0);
DBCursor cursor = coll.find(query.get(), fields);
String campaignJson = null;
while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.toString();
System.out.println(campaignJson);
}
the value obtained is everything under callsByTimeZone array. I am currently parsing the JSON obtained and getting only PT values . Is there a way to just query the PT fields inside activeCampaigns.callsByTimeZone .
Thanks in advance .Sorry if this question has already been raised in the forum, I have searched a lot and failed to find a proper solution.
Thanks in advance.
There are several ways of doing it, but you should not be using String manipulation (i.e. indexOf), the performance could be horrible.
The results in the cursor are nested Maps, representing the document in the database - a Map is a good Java-representation of key-value pairs. So you can navigate to the place you need in the document, instead of having to parse it as a String. I've tested the following and it works on your test data, but you might need to tweak it if your data is not all exactly like the example:
while (cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
List callsByTimezone = (List) ((DBObject) ((List) campaignDBO.get("activeCampaigns")).get(0)).get("callsByTimeZone");
DBObject valuesThatIWant;
for (Object o : callsByTimezone) {
DBObject call = (DBObject) o;
if (call.get("label").equals("PT")) {
valuesThatIWant = call;
}
}
}
Depending upon your data, you might want to add protection against null values as well.
The thing you were looking for ({"label":"PT", "loaded":1 }) is in the variable valueThatIWant. Note that this, too, is a DBObject, i.e. a Map, so if you want to see what's inside it you need to use get:
valuesThatIWant.get("label"); // will return "PT"
valuesThatIWant.get("loaded"); // will return 1
Because DBObject is effectively a Map of String to Object (i.e. Map<String, Object>) you need to cast the values that come out of it (hence the ugliness in the first bit of code in my answer) - with numbers, it will depend on how the data was loaded into the database, it might come out as an int or as a double:
String theValueOfLabel = (String) valuesThatIWant.get("label"); // will return "PT"
double theValueOfLoaded = (Double) valuesThatIWant.get("loaded"); // will return 1.0
I'd also like to point out the following from my answer:
((List) campaignDBO.get("activeCampaigns")).get(0)
This assumes that "activeCampaigns" is a) a list and in this case b) only has one entry (I'm doing get(0)).
You will also have noticed that the fields values you've set are almost entirely being ignored, and the result is most of the document, not just the fields you asked for. I'm pretty sure you can only define the top-level fields you want the query to return, so your code:
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1)
.append("_id", 0);
is actually exactly the same as:
BasicDBObject fields = new BasicDBObject("activeCampaigns", 1).append("_id", 0);
I think some of the points that will help you to work with Java & MongoDB are:
When you query the database, it will return you the whole document of
the thing that matches your query, i.e. everything from "skillID"
downwards. If you want to select the fields to return, I think those will only be top-level fields. See the documentation for more detail.
To navigate the results, you need to know that a DBObjects are returned, and that these are effectively a Map<String,
Object> in Java - you can use get to navigate to the correct node,
but you will need to cast the values into the correct shape.
Replacing while loop from your Java code with below seems to give "PT" as output.
`while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.get("activeCampaigns").toString();
int labelInt = campaignJson.indexOf("PT", -1);
String label = campaignJson.substring(labelInt, labelInt+2);
System.out.println(label);
}`

Retrieving Reviews from Amazon using JSoup

I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.

How we get the List objects in backward direction?

Hi i am getting List object that contains pojo class objects of the table. in my case i have to show the table data in reverse order. mean that, for ex
i am adding some rows to particular table in database when i am added recently, the data is storing at last row in table(in database). here i have to show whole content of the table in my jsp page in reverse order, mean that what i inserted recently have to display first row in my jsp page.
here my code was like,
List lst = tabledate.getAllData();//return List<Table> Object
Iterator it = lst.iterator();
MyTable mt = new MyTable();//pojo class
while(it.hasNext())
{
mt=(MyTable)it.next();
//getting data from getters.
System.out.println(mt.getxxx());
System.out.println(mt.getxxx());
System.out.println(mt.getxxx());
System.out.println(mt.getxxx());
}
Use a ListIterator to iterate through the list using hasPrevious() and previous():
ListIterator it = lst.listIterator(lst.size());
while(it.hasPrevious()) {
System.out.println(it.previous());
}
You cannot use an iterator in this case. You will need to use index based access:
int size = lst.size();
for (int i=size - 1; i >= 0; i --)
{
MyTable mt = (MyTable)lst.get(i);
....
}
Btw: there is no need to create a new MyTable() before the loop. This is an instance that will be thrown away immediately and serves no purpose.

Categories