parsing a table with jsoup

parsing a table with jsoup - java

I'm trying to extract the e-mail adress and the phone number from a linkedin profile using jsoup, each of these informations is in a table. I have written a code to extract them but it doesn't work, the code should work on any linkedin profile. Any help or guidance would be much appreciated.
public static void main(String[] args) {
try {
String url = "https://fr.linkedin.com/";
// fetch the document over HTTP
Document doc = Jsoup.connect(url).get();
// get the page title
String title = doc.title();
System.out.println("Nom & Prénom: " + title);
// first method
Elements table = doc.select("div[class=more-info defer-load]").select("table");
Iterator < Element > iterator = table.select("ul li a").iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().text());
}
// second method
for (Element tablee: doc.select("div[class=more-info defer-load]").select("table")) {
for (Element row: tablee.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 0) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
}
}
here is an example of the html code that i'm trying to extract (taken from a linkedin profile)
<table summary="Coordonnées en ligne">
<tr>
<th>E-mail</th>
<td>
<div id="email">
<div id="email-view">
<ul>
<li>
adam1adam#gmail.com
</li>
</ul>
</div>
</div>
</td>
</tr>
<tr class="no-contact-info-data">
<th>Messagerie instantanée</th>
<td>
<div id="im" class="editable-item">
</div>
</td>
</tr>
<tr class="address-book">
<th>Carnet d’adresses</th>
<td>
<span class="address-book">
<a title="Une nouvelle fenêtre s’ouvrira" class="address-book-edit" href="/editContact?editContact=&contactMemberID=368674763">Ajouter</a> des coordonnées.
</span>
</td>
</tr>
</table>
<table summary="Coordonnées">
<tr>
<th>Téléphone</th>
<td>
<div id="phone" class="editable-item">
<div id="phone-view">
<ul>
<li>0021653191431 (Mobile)</li>
</ul>
</div>
</div>
</td>
</tr>
<tr class="no-contact-info-data">
<th>Adresse</th>
<td>
<div id="address" class="editable-item">
<div id="address-view">
<ul>
</ul>
</div>
</div>
</td>
</tr>
</table>

To scrape email and phone number, use css selectors to target the element identifiers.
String email = doc.select("div#email-view > ul > li > a").attr("href");
System.out.println(email);
String phone = doc.select("div#phone-view > ul > li").text();
System.out.println(phone);
See CSS Selectors for more information.
Output
mailto:adam1adam#gmail.com
0021653191431 (Mobile)

Related

Multiple items in table cell

I am using thymeleaf to develop a website and it's been working fine for me so far however when I want to present multiple items in single in the table it will instead add extra seperate cells on the row where the values exist. I have no solution to that problems so far, if anyone else might see what I'm missing I'd greatly appreciate it.
Thanks in advance!
Code(EDITED)
Here I first check if there are handlers(in the database) then I put them all in a list using the foreach, but I have no way of putting all the items in the single cell(or table data represented here by the tag ). I've put the logic inside the tag which worked somewhat well however rows with no data get extra cells, like in the picture below.
<td > th:if="${place.getHandler} != null" th:each="handlerList : ${place.getHandler}" th:text="${handlerList.name}"></td>
New Code
<td> <span th:if="${place.getHandler} != null" th:each="handlerList : ${place.getHandler}" th:text="${handlerList.name}"></span></td>
The entire code
<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org" class="has-navbar-fixed-top">
<head th:replace="common_fragments/header :: header">
<meta charset="utf-8">
<link rel="stylesheet" href="../../../../public/css/font-awesome.min.css"/>
<link rel="stylesheet" href="../../../../public/css/bulma_custom.min.css"/>
</head>
<body>
<div id="navbar-top">
<nav th:replace="logged_in/admin/fragments/navbar :: nav"></nav>
</div>
<main>
<section class="section">
<div class="container">
<h1 class="title">
matchade studenter
</h1>
<hr>
<div class="content is-medium">
<table id="table" class="table is-bordered is-narrow is-hoverable">
<thead>
<tr>
<th>Student</th>
<th>Student email</th>
<th>Student mobilnummer</th>
<th>Enhet</th>
<th>Handledare</th>
<th>Handledare email</th>
<th>Handledare mobilnummer</th>
</tr>
</thead>
<tbody>
<tr th:each="place : ${places}">
<td th:text="${place.student.studentData.name}"></td>
<td th:text="${place.student.studentData.email}"></td>
<td th:if="${place.student.studentData.phoneNumber} != ''" th:text="${place.student.studentData.phoneNumber}"></td>
<td th:if="${place.student.studentData.phoneNumber} == ''" >
<p class="icon has-text-danger">
<i class="fa fa-lg fa-times"></i>
</p>
</td>
<td th:text="|${place.unit.name} (Regioner: ${place.unit.municipality.getRegionNamesString}, Kommuner: ${place.unit.municipality.name})|"></td>
<td> <span th:if="${place.getHandledare} != null" th:each="handledareList : ${place.getHandledare}" th:text="${handledareList.name}"></span></td>
<td th:if="${place.getHandledare} != null" th:text="${place.getHandledare[0].email}"></td>
<td th:if="${place.getHandledare} == null">
<p class="icon has-text-danger">
<i class="fa fa-lg fa-times"></i>
</p>
</td>
<td th:if="${place.getHandledare} == null">
<p class="icon has-text-danger">
<i class="fa fa-lg fa-times"></i>
</p>
</td>
<div th:if="${place.getHandledare} != null">
<td th:if="${place.getHandledare[0].phoneNumber} != ''" th:text="${place.getHandledare[0].phoneNumber}"></td>
</div>
<div th:if="${place.getHandledare} == null">
<td >
<p class="icon has-text-danger">
<i class="fa fa-lg fa-times"></i>
</p>
</td>
</div>
<div th:if="${place.getHandledare} != null">
<td th:if="${place.getHandledare[0].phoneNumber} == '' ">
<p class="icon has-text-danger">
<i class="fa fa-lg fa-times"></i>
</p>
</td>
</td>
</div>
</tr>
</tbody>
</table>
<br>
<button class="button is-large is-success" id="download-button">ladda ner matchning resultat</button>
<br>
<br>
</div>
</div>
</section>
</main>
<footer th:replace="common_fragments/footer :: footer"></footer>
<script>
function htmlToCSV(html, filename) {
var data = [];
var rows = document.querySelectorAll("table tr");
for (var i = 0; i < rows.length; i++) {
var row = [], cols = rows[i].querySelectorAll("td, th");
for (var j = 0; j < cols.length; j++) {
row.push(cols[j].innerText);
}
data.push(row.join(";"));
}
downloadCSVFile(data.join("\n"), filename);
}
</script>
<script>
function downloadCSVFile(csv, filename) {
var csv_file, download_link;
csv_file = new Blob(["\uFEFF"+csv], {type: "text/csv"});
download_link = document.createElement("a");
download_link.download = filename;
download_link.href = window.URL.createObjectURL(csv_file);
download_link.style.display = "none";
document.body.appendChild(download_link);
download_link.click();
}
</script>
<script>
document.getElementById("download-button").addEventListener("click", function () {
var html = document.querySelector("table").outerHTML;
htmlToCSV(html, "matchning.csv");
});
</script>
</body>
</html>

You can move your Thymeleaf logic from the <td> tag into a tag inside the <td> tag - for example, a <span>:
<td>
<span th:if="${place.getHandler} != null"
th:each="handlerList : ${place.getHandler}"
th:text="${handlerList.name}"></span>
</td>
From there you can add whatever CSS you may need to format the spans.
If you have extra <td> cells you need to suppress, then move the th:if expression to inside the <td> tag:
<td th:if="${place.getHandler} != null">
<span th:each="handlerList : ${place.getHandler}"
th:text="${handlerList.name}"></span>
</td>

jsp paging problem when i go to next page searching keyword doesn't apply

when i search specific word only first page is classified. it shows pages and posts well on first page.
but when i go to page 2 or next page, seaching keyword doesn't apply on
is this address problem?
i guess this is sql or Paging.java problem because when i print log of page at BDAO it shows page well which i clicked.
also I don't know how can i transfer keyWord &keyField for that..!
I use oracle DB.
<%
String keyWord = (String)request.getParameter("keyWord");
String keyField = (String)request.getParameter("keyField");
%>
<script>
function searchCheck(frm){
//검색
if(frm.keyWord.value ==""){
alert("검색 단어를 입력하세요.");
frm.keyWord.focus();
return;
}
frm.submit();
}
function PageMove(page){
var keyWord = '<%=keyWord%>'
var keyField = '<%=keyField%>'
console.log(keyWord);
if(keyWord !=''){
location.href = "list.do?page="+page+"&keyWord=" + keyWord + "&keyField=" + keyField;
}
location.href = "list.do?page="+page;
}
</script>
</head>
<body>
<table width="800" cellpadding="0" cellspacing="0" border="1">
<tr>
<td>번호</td>
<td>이름</td>
<td>제목</td>
<td>날짜</td>
<td>히트</td>
</tr>
<c:forEach items="${list}" var="dto">
<tr>
<td>${dto.bId}</td>
<td>${dto.bName}</td>
<td>
<c:forEach begin="1" end="${dto.bIndent}">-</c:forEach>
${dto.bTitle}</td>
<td>${dto.bDate}</td>
<td>${dto.bHit}</td>
</tr>
</c:forEach>
<tr>
<td colspan="5">
<form action="list.do" method="post" name="search">
<select name="keyField">
<option value="bTitle">글 제목</option>
<option value="bContent">글 내용</option>
<option value="bName">작성자</option>
</select>
<input type="text" name="keyWord">
<input type="button" value="검색" onclick="searchCheck(form)">
</form>
</td>
</tr>
<tr>
<td colspan="5"> 글작성 </td>
</tr>
</table>
<div class="toolbar-bottom">
<div class="toolbar mt-lg">
<div class="sorter">
<ul class="pagination">
<li>맨앞으로</li>
<li>앞으로</li>
<c:forEach var="i" begin="${paging.startPageNo}" end="${paging.endPageNo}" step="1">
<c:choose>
<c:when test="${i eq paging.pageNo}">
<li class="active">${i}</li>
</c:when>
<c:otherwise>
<li>${i}</li>
</c:otherwise>
</c:choose>
</c:forEach>
<li>뒤로</li>
<li>맨뒤로</li>
</ul>
</div>
</div>
</div>

You never seem to be passing the keyword or keyfield when you call pageMove(). You might as well look up their values inside the function instead of having them as parameters:
function PageMove(page){
var keyWord = document.getElementById("keyWord").value;
var keyField = document.getElementById("keyField").value;
location.href = "list.do?page=" + page + "&keyWord=" + keyWord + "&keyField=" + keyField;
}

First value empty while reading cells of a html table's row using Selenium/java

I have a very strange problem with reading data from a table row. This particular row has a few cells. First two are datetime (03/27/2017 08:30), code:
<div class="container-fluid container-full" id="OUTPUTSECTION">
<div class="row">
<div class="col-xs-12">
<div class="row">
<div class="col-xs-12">
<div class="row">
<div class="col-xs-12">
<table border="0" cellpadding="2" cellspacing="1" width="1080" class="table table-condensed-extra table-striped table-hover">
<tbody><tr><td colspan="50" style="text-align: center" class="tableHead">Document History Report</td></tr><tr bgcolor="#ffffff">
<td class="tableSubhead">
<span class="table-header">Start Time</span></td>
<td class="tableSubhead">
<span class="table-header">End Time</span></td>
<td class="tableSubhead">
<span class="table-header">Machine</span></td>
<td class="tableSubhead">
<span class="table-header">Site</span></td>
<td class="tableSubhead">
<span class="table-header">Operator</span></td>
<td class="tableSubhead">
<span class="table-header">Disposition</span></td>
<td class="tableSubhead">
<span class="table-header">Status</span></td>
<td class="tableSubhead">
<span class="table-header">Result</span></td>
<td class="tableSubhead">
<span class="table-header">Data Source</span></td>
</tr>
<tr class="tableTextWhite">
<td align="CENTER">03/27/2017 08:30</td>
<td align="CENTER">03/27/2017 08:30</td>
<td align="CENTER">TMX_01</td>
<td align="CENTER">Techmex</td>
<td align="CENTER">Anne</td>
<td align="CENTER">Completed</td>
<td align="CENTER">Good</td>
<td align="CENTER">Finished</td>
<td align="CENTER">D:\TMX_01\WORKING\27003001.txt/1</td>
</tr>
</tbody></table>
</div>
</div>
</div>
</div>
<script type="text/javascript">
parent.endTimems = new Date().getTime();
if( parent.startTimems )
{
parent.timeTakenms = parent.endTimems - parent.startTimems;
parent.startTimems = null;
if (parent.debugdiv && parent.timeTakenms )
parent.debugdiv.innerHTML = parent.timeTakenms/1000 + ' sec';
}
</script>
</div>
</div>
</div>
A basic table really... Here is my method, which reads that code using xpath.
public String[] getTimesFromDocumentHistoryReportPage() {
String XPATH_DETAILS_BASE = "//div['OUTPUTSECTION']/table/tbody/tr[3]/td";
String[] data = new String[2];
for (int i = 0; i < 2; i++) {
String XPATH_DETAILS = XPATH_DETAILS_BASE + "[" + (i + 1) + "]";
data[i] = getElement(By.xpath(XPATH_DETAILS)).getText();
}
return data;
}
For data[0] I am getting an empty value, even though copying and pasting the html and xpath into one of the online testers: videlibri.sourceforge.net/cgi-bin/xidelcgi returns a valid result. Next pass returns data[1], which has correct value. What am I missing here?

Ok, I'm not sure what happened, but since you successfully get the data from page source, I suggest you paste it to html parser (I use jsoup) then extract the data.
public String[] getTimesFromDocumentHistoryReportPage() {
Document document = Jsoup.parse(driver.getPageSource());
Elements elements = document.select("#OUTPUTSECTION .tableTextWhite > td");
String[] data = new String[2];
for (int i = 0; i < 2; i++) {
data[i] = elements.get(i).text()
}
return data;
}
I'm not on my computer now and I haven't tried my code yet, so maybe there is syntax errors, do let me know and I'll fix it.
you can download Jsoup here :https://jsoup.org/download

Given the piece of code code I was struggling with:
<tr class="tableTextWhite">
<td align="CENTER">03/27/2017 08:30</td>
<td align="CENTER">03/27/2017 08:30</td>
<td align="CENTER">TMX_01</td>
<td align="CENTER">Techmex</td>
<td align="CENTER">Anne</td>
<td align="CENTER">Completed</td>
<td align="CENTER">Good</td>
<td align="CENTER">Finished</td>
<td align="CENTER">D:\TMX_01\WORKING\27003001.txt/1</td>
</tr>
This is my final, working solution as proposed by #SDBot:
public String[] getTimesFromDocumentHistoryReportPage() {
String htmlSource = driver.getPageSource();
final Pattern pattern = Pattern.compile("<td align=\"CENTER\">(.+?)</td>");
final String[] tagValues = new String[2];
final Matcher matcher = pattern.matcher(htmlSource);
for (int i = 0; i < 2; i++) {
matcher.find();
tagValues[i] = matcher.group(1);
}
return tagValues;
}
The method does a search within given htmlSource and finds all instances of data (.+?) located between and html tags. This was enough in my case.
Since I am interested in first two cell values it is sufficient to make 2 iterations and return result. Test passed. Thank you!

Handling multiple tables using selenium webdriver

I am checking the folder hierarchy on a webpage, depending on the type of user. User1 has a set of permissions which enable him to see the folder structure like this :
Main Folder
- First Child
-First Grandchild
-Second Grandchild
- Second Child
- Third Child
Each branch of the tree is a table consisting of 1 row. But the number of columns varies depending on the generation.
The "Main Folder" parent has only 1 column. The cell content is the string "Main Folder".
The children branches have 2 columns, the first cell containing blank space, and the next cell containing the name of the branch ("First Child", "Second Child").
The grandchildren branches have 3 columns, the first and second cell containing blank space, and the the third cell containing the name of the branch (" First Grandchild", "Second Grandchild").
HTML code :
<div id = 0>
<div id = 1>
<table id = 1>
<tbody>
<tr>
<td id="content1"
<a id="label1"
<span id="treeNode1"
Main Folder
</span>
</a>
</td>
</tr>
</tbody>
</table>
<div id = 2>
<table id = 2>
<tbody>
<tr>
<td>
<td id="content2"
<a id="label2"
<span id="treeNode2"
First Child
</span>
</a>
</td>
</td>
</tr>
</tbody>
</table>
<div id = 5>
<table id = 5>
<tbody>
<tr>
<td>
<td>
<td id="content5"
<a id="label5"
<span id="treeNode5"
First GrandChild
</span>
</a>
</td>
</td>
</td>
</tr>
</tbody>
</table>
</div>
<div id = 6>
<table id = 6>
<tbody>
<tr>
<td>
<td>
<td id="content6"
<a id="label6"
<span id="treeNode6"
Second GrandChild
</span>
</a>
</td>
</td>
</td>
</tr>
</tbody>
</table>
</div>
</div> /* End of division 2 */
<div id = 3>
<table id = 3>
<tbody>
<tr>
<td>
<td id="content3"
<a id="label3"
<span id="treeNode3"
Second Child
</span>
</a>
</td>
</td>
</tr>
</tbody>
</table>
</div>
<div id = 4>
<table id = 4>
<tbody>
<tr>
<td>
<td id="content4"
<a id="label4"
<span id="treeNode4"
Third Child
</span>
</a>
</td>
</td>
</tr>
</tbody>
</table>
</div>
</div> /*End of division 1 */
</div> /* End of division 0 */
User2 has a different set of permissions, which enable him to see the folder structure like this :
Main Folder
- First Child
-First Grandchild
- Second Child
- Third Child
The corresponding table is absent in the html code for this user.
My test case is to check User2 doesn't have access to the second grandchild. This means I need to ensure that particular table doesn't exist on the webpage.
How can I check this in selenium ? I am using JUnit for my test cases. I want to do an "assert" to ensure the second grandchild is not present.

You'll want to check to see if the element is not present or not visible. Calling isElementVisible() inside an assert false should do the trick. Just get the By locator of the elements you want to check.
private boolean isElementVisible(By by)
{
try
{
return driver.findElement(by).isDisplayed();
}
catch(NoSuchElementException e)
{
return false;
}
}

How to parse a nested Div into a table structure using Jsoup

I have div structure like this
<div class="DivClass-1"> Div One
<div class="DivClass-A"> Div A </div>
</div>
<div class="DivClass-2"> Div Two
<div class="DivClass-A"> Div B </div>
</div>
<div class="DivClass-3"> Div Three
<div class="DivClass-A"> Div C </div>
</div>
<div class="DivClass-4"> Div Four
<div class="DivClass-A"> Div D </div>
</div>
and i want to parse it and convert this div structure into a table structure
can any body give an idea how to achieve this.

Use replaceall() to replace all div tags

I am not clear which <div> tag you want to convert to <tr> and <td> tag.
But, I assume DivClass-1, DivClass-2, DivClass-3, DivClass-4 are convert to <tr> tag. Others are convert to <td> tag.
I hope following code will give you little idea.
StringBuffer myHTML = new StringBuffer();
myHTML.append("<div class=\"DivClass-1\"> Div One <div class=\"DivClass-A\"> Div A </div> </div>" +
"<div class=\"DivClass-2\"> Div Two<div class=\"DivClass-A\"> Div B </div></div>" +
"<div class=\"DivClass-3\"> Div Three<div class=\"DivClass-A\"> Div C </div></div>" +
"<div class=\"DivClass-4\"> Div Four <div class=\"DivClass-A\"> Div D </div></div>");
Document myDoc = Jsoup.parse(myHTML.toString());
//get DivClass-1, DivClass-2, etc.
Elements DivClass = myDoc.select("div").not("div.DivClass-A");
Elements DivClass_A = myDoc.select("div.DivClass-A");
//rename the tag <div class="DivClass-1"> to <tr class="DivClass-1">
DivClass.tagName("tr");
//renamed the tag <div class="DivClass-A"> to <td class="DivClass-A">
DivClass_A.tagName("td");
System.out.println(myDoc.toString());
Here's the printout-
<tr class="DivClass-1">
Div One
<td class="DivClass-A"> Div A </td>
</tr>
<tr class="DivClass-2">
Div Two
<td class="DivClass-A"> Div B </td>
</tr>
<tr class="DivClass-3">
Div Three
<td class="DivClass-A"> Div C </td>
</tr>
<tr class="DivClass-4">
Div Four
<td class="DivClass-A"> Div D </td>
</tr>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing a table with jsoup - java

Related

Multiple items in table cell

jsp paging problem when i go to next page searching keyword doesn't apply

First value empty while reading cells of a html table's row using Selenium/java

Handling multiple tables using selenium webdriver

How to parse a nested Div into a table structure using Jsoup

Categories

Resources