Parse the html code or use regex with java?

Parse the html code or use regex with java? - java

I'm trying to extract the values of this piece of html code:
<ul id="tree-dotlrn_class_instance">
<li>
**2011-12 Ampl.Arquit.Computadors Gr.A (13000)**
<ul>
<li>
**2011-12 Entorns d'Usuari Gr.A Sgr.T00 (13022)**
</li>
<li>
**2011-12 Eng.Serv.Telemàtics Gr.A Sgr.T00 (13036)**
</li>
</ul>
</li>
<li>
**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A (13038)**
</li>
<li>
**2011-12 Processad.Llenguatge Gr.A (13048)**
<ul>
<li>
**2011-12 Processad.Llenguatge Gr.A Sgr.L01 (13048)**
</li>
<li>
**2011-12 Processad.Llenguatge Gr.A Sgr.T00 (13048)**
</li>
<li>
**2011-12 Sist.Basats Microprocessadors Gr.A Sgr.L02 (13052)**
</li>
</ul>
</li>
<li>
**2011-12 Sist.Informàtics Gr.AA (13055)**
</li>
<li>
**2011-12 Administrac. Gestió de Xarxes Gr.A (14009)**
</li>
<li>
**2011-12 Transmissió de Dades Gr.A** (15656)
</li>
</ul>
All that it's in strong black (between**)with his href value into a HashMap. First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly.
Can you help me ??
Thanks!
Update: I'm trying this, but it's not the right way.
Source s = new Source(answer);
List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
int tam1 = Form1.size();
for(int j = 0; j < tam1; j++){
Element e1 = Form1.get(j);
if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
for (int k = 0; k < L1.size(); k++){
Element e2 = L1.get(k);
System.out.println("Elemento de la lista L1: "+e2.getContent());
List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
for(int m = 0; m < L2.size(); m++){
Element e3 = L2.get(m);
System.out.println("Elemento de la lista L2: "+e3.getContent());
asignaturas.add(e3.getContent().toString());
System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
}
}
}
}

Take a look at JSoup's selector syntax.
If you are looking for all a elements with an href attribute, you can find them like this:
String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");
From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap.

Regex:
\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>
Java String:
"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"
You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied.
Here is a place to test Java regular expressions

Why not use the DOM API? You can get attributes and values fairly trivially with it.

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

Related

Restrict number of table pages in Thymeleaf + Java

I followed this tutorial:
https://www.youtube.com/watch?v=Aie8n12EFQc
Pagination works. How can I restrict the number of page links below the table? If there is to much records, it can be to much numbers.
Numbers are below closing /table tag, it's not inside the table, but below (like in tutorial).
I know there are some solutions in Stackoverflow but they are mostly in php code. Can it be done somehow with java server-side or with Thymeleaf? Or can you send me an already good answered question on this topic.
My thymeleaf:
<div th:if = "${totalPages > 1}">
<nav aria-label="Page navigation" class="paging">
<ul class="pagination">
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo=1}">First</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={currentPage}(currentPage=${currentPage-1})}">Previous</a>
</li>
<li th:each="i: ${#numbers.sequence(1, totalPages)}" th:classappend="${i==currentPage} ? 'page-item active' : 'page-item'">
<a class="page-link" th:href="#{/customer_list?pageNo={i}(i=${i})}">[[${i}]]</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={currentPage}(currentPage=${currentPage+1})}">Next</a>
</li>
<li class="page-item">
<a class="page-link" th:href="#{/customer_list?pageNo={totalPages}(totalPages=${totalPages})}">Last</a>
</li>
</ul>
</nav>
</div>
Controller method in Java: (pageSize only 2, for experimenting)
#GetMapping("/customer_list")
public String customerList(#RequestParam(name = "pageNo", required = false) Integer pageNo ,Model model)
{
if(pageNo == null)
pageNo = 1;
int pageSize = 2;
try
{
Page<CustomerDTO> pageOfCustomers = customerService.findPaginated(pageNo, pageSize);
List<CustomerDTO> listOfCustomers = pageOfCustomers.getContent();
model.addAttribute("listOfCustomers", listOfCustomers);
model.addAttribute("currentPage", pageNo);
model.addAttribute("totalPages", pageOfCustomers.getTotalPages());
model.addAttribute("totalItems", pageOfCustomers.getTotalElements());
} catch (Exception e)
{
log.error("Error in retrieving the list of customers!", e);
e.printStackTrace();
}
return "customer_list";
}

Have a look to this post : pagination
It should answer your question.
In a nutshell you add to your template an array with the length equals to the number of pages.
Then with some tests you display either dots or the link to the page.
It will give something like :
<< < ... 3456789 ... > >>

Can't get text from dropdown options

driver.findElement(By.xpath("//*[#id=\"_desktop_currency_selector\"]/div")).click();
List<WebElement> list = driver.findElements(By.xpath("//*[#id=\"_desktop_currency_selector\"]/div/ul//li"));
System.out.println(list.size());
for (int i=0; i<list.size(); i++){
System.out.println(list.get(i).getText());
}
Output:
3
EUR €
I'm need to get the text from items in dropdown, my code can find all li elements and print number of them, but when I'm trying to print the visible text from them but I'm getting only text from the first option :(
I would be very grateful for a hint...
Part of page source:
<div id="_desktop_currency_selector">
<div class="currency-selector dropdown js-dropdown">
<span>Currency:</span>
<span class="expand-more _gray-darker hidden-sm-down" data-toggle="dropdown">UAH ₴</span>
<a data-target="#" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false" class="hidden-sm-down">
<i class="material-icons expand-more"></i>
</a>
<ul class="dropdown-menu hidden-sm-down" aria-labelledby="dLabel">
<li >
<a title="European EURO" rel="nofollow" href="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=2" class="dropdown-item">EUR €</a>
</li>
<li class="current" >
<a title="Ukrainian UAH" rel="nofollow" href="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=1" class="dropdown-item">UAH ₴</a>
</li>
<li >
<a title="Dollar USA" rel="nofollow" href="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=3" class="dropdown-item">USD $</a>
</li>
</ul>
<select class="link hidden-md-up">
<option value="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=2">EUR €</option>
<option value="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=1" selected="selected">UAH ₴</option>
<option value="http://wasd.com.ua/ru/search?order=product.price.desc&s=dress&SubmitCurrency=1&id_currency=3">USD $</option>
</select>
</div>

I've added a tag at the end which seems working.
List<WebElement> list = driver.findElements(By.xpath("//*[#id=\"_desktop_currency_selector\"]/div/ul/li/a"));
Also if you just want to print the text inside <option> tag use this:
List<WebElement> list = driver.findElements(By.xpath("//*[#id=\"_desktop_currency_selector\"]/div/select/option"));
Both of this produce same result.
Output:
3
EUR €
UAH ₴
USD $

// preprare emtpy list
List<String> texts = new ArrayList<String>();
// get the dropdown element
WebElement dropDown = driver.findElement(By.className("link hidden-md-up"));
// get dropdown options
List<WebElement> options = dropDown.findElements(By.tagName("option"));
// collect texts
for (WebElement option: options) {
texts.add(option.getText());
}

how to display span class field

i am trying to display two "text text-pass" from html in chrome browser to my print console, apparently, it did not work, any advise please?
my browser html code
<a href="/abc/123" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
<a href="/abc/1234" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
My code
String 123= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(123);
String 1234= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(1234);

You can use .findElements to get multiple elements with the same pattern, it will return a list collection.
UPDATE
Refers to your comment, you need put the string into a list again and check with the Collection.contains() method:
List<String> results = new ArrayList<>();
List<WebElement> elements = driver.findElements(By.xpath("//div[#class='sidebar-text']//span"));
for(WebElement element: elements) {
String attr = element.getAttribute("class");
results.add(attr);
System.out.println(attr);
}
if(results.contains("text text-fail")) {
System.out.println("this is list contains 'text text-fail'");
}

Try this Code :
String pass = driver.findElement(By.xpath("//*[#class='sidebar-text']/span")).getAttribute("class");
System.out.println(pass);

Build <ul> list recursively

So, I have this site structure
Page 1
Page 1.1
Page 2
Page 2.1
Page 2.1.1
Page 2.2
Page 3
Page 3.1
Page 3.2
Page 4
and I want to build <ul> list using recursive function. My function looks like this
public String getMenu(Page rootPage, boolean base){
final Logger log = LoggerFactory.getLogger(this.getClass());
Iterator<Page> subPages = rootPage.listChildren();
StringBuilder output = new StringBuilder("<ul");
output.append(" id=\"drop-menu\"");
output.append(" class=\"popup-menu\">");
if(!base){
output.append("<li><a href=\"").append(rootPage.getPath()).append(".html\" class=\"showSubPage\" rel=\"").append(rootPage.getPath()).append("\">");
String title = rootPage.getPageTitle() == null ? rootPage.getTitle() : rootPage.getPageTitle();
output.append(title);
output.append("</a>");
output.append("</li>");
output.append("</ul>");
}
while(subPages.hasNext()){
output.append("<ul>");
log.info("som subpages here!");
Page curPage = subPages.next();
output.append("<li><a href=\"").append(curPage.getPath()).append(".html\" class=\"showSubPage\" rel=\"").append(curPage.getPath()).append("\">");
String title = curPage.getPageTitle() == null ? curPage.getTitle() : curPage.getPageTitle();
output.append(title);
output.append("</a>");
Iterator<Page> subSub = curPage.listChildren();
int tmpCtr = 0;
while(subSub.hasNext()){
tmpCtr++;
output.append(getMenu(subSub.next(), false));
}
output.append("</li>");
output.append("</ul>");
}
return output.toString();
}
and the output looks like this
<ul id="drop-menu" class="popup-menu">
<ul>
<li><a href="/menu-hier/afsafa.html" class="showSubPage" >Page 1</a>
<ul id="drop-menu" class="popup-menu">
<li>Page 1.1
</li>
</ul>
</li>
</ul>
<ul>
<li>Page 2
<ul id="drop-menu" class="popup-menu">
<li>Page 2.1
</li>
</ul>
<ul>
<li>Page 2.1.1
</li>
</ul>
<ul id="drop-menu" class="popup-menu">
<li>Page 2.2
</li>
</ul>
</li>
</ul>
<ul>
<li>Page 3
<ul id="drop-menu" class="popup-menu">
<li>Page 3.1
</li>
</ul>
<ul id="drop-menu" class="popup-menu">
<li>Page 3.2
</li>
</ul>
<ul id="drop-menu" class="popup-menu">
<li>Page 3.3
</li>
</ul>
</li>
</ul>
<ul>
<li>Page 4
</li>
</ul>
So the problem is, that the level 3 pages aren't placed properly. For example the Page 2.1.1 isn't under the Page 2.1 section.
Thanks for any help!

Not sure, how you want your HTML to look like, but:
1) From your code, the <ul> tag is inserted twice for sub-pages (once in the while loop, then in the recursive called getMenu() again)
2) I think, you are missing the <li> tags below the <ul> tags.
3) Your code looks quite redundant and complex, can't it be done easier like so (not tested):
public String getMenu(Page page, boolean isRoot) {
StringBuilder output = new StringBuilder();
if (isRoot) {
output.append("<ul id=\"drop-menu\"");
output.append(" class=\"popup-menu\">");
}
else {
output.append("<ul>");
}
output.append("<li><a href=\"")
.append(page.getPath())
.append(".html\" class=\"showSubPage\" rel=\"")
.append(rootPage.getPath()).append("\">");
String title = page.getPageTitle() == null ? page.getTitle() : page.getPageTitle();
output.append(title);
output.append("</a>");
Iterator<Page> subPages = page.listChildren();
while(subPages.hasNext()){
output.append(getMenu(subPages.next(), false));
}
output.append("</li>");
output.append("</ul>");
return output.toString();
}

Unable to parse value from HTML using jsoup

I'm relatively new to using jsoup, and I can't seem to find the correct query to parse out the value I'm looking for. The HTML is as follows.
<img src='http://rootzwiki.com/public/style_images/ginger/t_unread.png' alt='New Replies' /><br />
</a>
</td>
<td class='col_f_content '>
<h4><a id="tid-link-12251" href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='View topic, started 17 December 2011 - 09:32 AM' class='topic_title'>[ROM][LTE] RootzBoat 4.0.3 V6.1</a></h4>
<br />
<span class='desc lighter blend_links'>
Started by <a hovercard-ref="member" hovercard-id="5" class="_hovertrigger url fn " href='http://rootzwiki.com/user/5-birdman/'>birdman</a>, 17 Dec 2011
</span>
<ul class='mini_pagination'>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='Go to page 1'>1</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__10" title='Go to page 2'>2</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__20" title='Go to page 3'>3</a></li>
<li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__1990" title='Go to page 200'>200 →</a></li>
</ul>
</td>
<td class='col_f_preview __topic_preview'>
<a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/' class='expander closed' title='Preview this topic'> </a>
</td>
<td class='col_f_views desc blend_links'>
<ul>
<li>
<span class='ipsBadge ipsBadge_orange'>Hot</span>
1,999 replies
</li>
<li class='views desc'>180,213 views</li>
</ul>
</td>
<td class='col_f_post'>
<a href='http://rootzwiki.com/user/49940-jakeday/' class='ipsUserPhotoLink left'>
<img src='http://rootzwiki.com/uploads/profile/photo-thumb-49940.jpg' class='ipsUserPhoto ipsUserPhoto_mini' />
</a>
<ul class='last_post ipsType_small'>
<li><a hovercard-ref="member" hovercard-id="49940" class="_hovertrigger url fn " href='http://rootzwiki.com/user/49940-jakeday/'>jakeday</a></li>
<li>
<a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__view__getlastpost' title='Go to last post'>Today, 04:20 AM</a>
</li>
</ul>
</td>
I need to parse out birdman from there. I know that once I've defined the element, I can get "birdman" out with author.text();, but I cant figure out how to define the author element. I thought perhaps the following block of code would work, but as I mentioned, I'm pretty new to jsoup and html and it obviously didnt work. Theres nothing wrong with the connection, and jsoup is working for the other values I parsed out.
TitleResults titleArray = new TitleResults();
Document doc = null;
try {
doc = Jsoup.connect(Constants.FORUM).get();
} catch (IOException e) {
e.printStackTrace();
}
Elements threads = doc.select(".topic_title");
for (Element thread : threads) {
titleArray = new TitleResults();
//Thread title
threadTitle = thread.text();
titleArray.setItemName(threadTitle);
//Thread link
String threadStr = thread.attr("abs:href");
String endTag = "/page__view__getnewpost"; //trim link
threadStr = new String(threadStr.replace(endTag, ""));
threadArray.add(threadStr);
titleArray.setAuthorDate("Author/Date");
results.add(titleArray);
}
Elements authors = doc.select("a[hovercard-ref]");
for (Element author : authors) {
if (author.attr("abs:href").contains("/user/")){
Log.d("POC", "SUCCESS " + author.attr("abs:href"));
} else {
Log.d("POC", "FAILURE " + author.text());
}
}
}

I think you're thinking too hard ;)
To get the birdman portion of the link, just use the following:
Elements authors = doc.select("a");
for (Element author : authors) {
Log.d("POC", author.text());
}
The "a" retrieves all links. After that you can just use the .text() like you said to retrieve the value.

Selvin answered it in the comments. I wasnt getting the source correctly and it was causing errors.
http://pastebin.com/xfUQkGw0

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse the html code or use regex with java? - java

Why not use the DOM API? You can get attributes and values fairly trivially with it.

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

Related

Restrict number of table pages in Thymeleaf + Java

Can't get text from dropdown options

how to display span class field

Build <ul> list recursively

Unable to parse value from HTML using jsoup

Categories

Resources