Usually we have many internal links in a file. I want to parse a html file such that i get the headings of a page and its corresponding data in a map.
Steps i did:
1) Got all the internal reference elements
2) Parsed the document for the id = XXX where XXX == (element <a href="#XXX").
3) it takes me to the <span id="XXX">little text here </span> <some tags here too ><p> actual text here </p> <p> here too </p>
4) How to go from <span> to <p> ???
5) I tried going to parent of span and thought that its one of the child is <p> too... its true. But it also involves <p> of other internal links too.
EDIT: added an sample html file portion:
<li class="toclevel-1 tocsection-1"><a href="#Enforcing_mutual_exclusion">
<span class="tocnumber">1</span> <span class="toctext">Enforcing mutual exclusion</span> </a><ul>
<li class="toclevel-2 tocsection-2"><a href="#Hardware_solutions">
<span class="tocnumber">1.1</span> <span class="toctext">Hardware solutions</span>
</a></li>
<li class="toclevel-2 tocsection-3"><a href="#Software_solutions">
<h2><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit§ion=1" title="Edit section: Enforcing mutual exclusion">
edit</a>]</span> <span class="mw-headline" id="Enforcing_mutual_exclusion">
<comment --------------------------------------------------------------------
**see the id above = Enforcing_mutual_exclusion** which is same as first internal
link . Jsoup takes me to this span element. i want to access every <p> element after
this <span> tag before another <span> tag with id="any of the internal links"
------------------------------------------------------------------------------!>
Enforcing mutual exclusion</span></h2>
<p>There are both software and hardware solutions for enforcing mutual exclusion.
The different solutions are shown below.</p>
<h3><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit§ion=2" title="Edit section: Hardware solutions">
edit</a>]</span> <span class="mw-headline" id="Hardware_solutions">Hardware
solutions</span></h3>
<p>On a <a href="/wiki/Uniprocessor" title="Uniprocessor" class="mw-
redirect">uniprocessor</a> system a common way to achieve mutual exclusion inside
kernels is
disable <a href="/wiki/Interrupt" title="Interrupt">
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public final class Website {
private URL websiteURL ;
private Document httpDoc ;
LinkedHashMap<String, ArrayList<String>> internalLinks =
new LinkedHashMap<String, ArrayList<String>>();
public Website(URL __websiteURL) throws MalformedURLException, IOException, Exception{
if(__websiteURL == null)
throw new Exception();
websiteURL = __websiteURL;
httpDoc = Jsoup.parse(connect());
System.out.println("Parsed the http file to Document");
}
/* Here is my function: i first gets all the internal links in internalLinksElements.
I then get the href name of <a ..> tag so that i can search for it in documnet.
*/
public void getDataWithHeadingsTogether(){
Elements internalLinksElements;
internalLinksElements = httpDoc.select("a[href^=#]");
for(Element element : internalLinksElements){
// some inline links were bad. i only those having span as their child.
Elements spanElements = element.select("span");
if(!spanElements.isEmpty()){
System.out.println("Text(): " + element.text()); // this can not give what i want
// ok i get the href tag name that would be the id
String href = element.attr("href") ;
href = href.replace("#", "");
System.out.println(href);
// selecting the element where we have that id.
Element data = httpDoc.getElementById(href);
// got the span
if(data == null)
continue;
Elements children = new Elements();
// problem is here.
while(children.isEmpty()){
// going to its element unless gets some data.
data = data.parent();
System.out.println(data);
children = data.select("p");
}
// its giving me all the data of file. thats bad.
System.out.println(children.text());
}
}
}
/**
*
* #return String Get all the headings of the document.
* #throws MalformedURLException
* #throws IOException
*/
#SuppressWarnings("CallToThreadDumpStack")
public String connect() throws MalformedURLException, IOException{
// Is this thread safe ? url.openStream();
BufferedReader reader = null;
try{
reader = new BufferedReader( new InputStreamReader(websiteURL.openStream()));
System.out.println("Got the reader");
} catch(Exception e){
e.printStackTrace();
System.out.println("Bye");
String html = "<html><h1>Heading 1</h1><body><h2>Heading 2</h2><p>hello</p></body></html>";
return html;
}
String inputLine, result = new String();
while((inputLine = reader.readLine()) != null){
result += inputLine;
}
reader.close();
System.out.println("Made the html file");
return result;
}
/**
*
* #param argv all the command line parameters.
* #throws MalformedURLException
* #throws IOException
*/
public static void main(String[] argv) throws MalformedURLException, IOException, Exception{
System.setProperty("proxyHost", "172.16.0.3");
System.setProperty("proxyPort","8383");
System.out.println("Sending url");
// a html file or any url place here ------------------------------------
URL url = new URL("put a html file here ");
Website website = new Website(url);
System.out.println(url.toString());
System.out.println("++++++++++++++++++++++++++++++++++++++++++++++++");
website.getDataWithHeadingsTogether();
}
}
I think you need to understand that the <span>s that you are locating are children of header elements, and that the data you want to store is made up of siblings of that header.
Therefore, you need to grab the <span>'s parent, and then use nextSibling to collect nodes that are your data for that <span>. You need to stop collecting data when you run out of siblings, or you encounter another header element, because another header indicates the start of the next item's data.
Related
I have a Homepage were we want to track if the location is empty/free. The Website is provided from an external Source as Service from them.
I already did the login on the Homepage via application and when I check the whole Document doc, it shows everything and what i want to have is also included.
Right now i have tried to specify the data I want to have (UID and Status) but i dont know how to do it.
I have no idea what to choose in the,
Elements data = doc.select("a");
I have tried using div.tiles and div.tableauDeBoard but didn't work.
Following is the Expected Output:
2652 free
2653 free
and so on
I hope i did everything right, its my first time posting here.
String URL = "..." //URL in there to shorten code
Document doc = Jsoup.connect(URL).get();
Elements data = doc.select("a");
System.out.println(data.outerHTML());
<div id="tableauDeBoard" class="porlet-body" style="min-height: 40px;">
<div class="tiles">
<a href="..."
class="tile-v2 undefined popovers" data-content="Status : free<hr>time : 4h22<hr>last change : 05.11.2019 at 18:46<hr>UID : 2652<hr> Typ : P<hr>connection OK" data-html="true" data-placement="auto" data-container="body" data-trigger="hover" data-original-title="" title="">
<div class="tile-code-ville"></div>
<div class="tile-id-automate">2652</div>
<div class="tile-infos"><span class="tile-icon-transmission">
<img src="....png">
</span><span class="tile-icon-jauge"></span></div></a>
Same code 7 time with different UID/Time/Status
</div>
You can try the following way,
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup{
public static void main(String[] args) throws InterruptedException {
StringBuilder html = new StringBuilder();
html.append("<div id=\"tableauDeBoard\" class=\"porlet-body\" style=\"min-height: 40px;\">");
html.append(" <div class=\"tiles\">");
html.append(" <a href=\"...\"");
html.append(
"class=\"tile-v2 undefined popovers\" data-content=\"Status : free<hr>time : 4h22<hr>last change : 05.11.2019 at 18:46<hr>UID : 2652<hr> Typ : P<hr>connection OK\" data-html=\"true\" data-placement=\"auto\" data-container=\"body\" data-trigger=\"hover\" data-original-title=\"\" title=\"\">\r\n"
+ "");
html.append("<div class=\"tile-code-ville\"></div>");
html.append("<div class=\"tile-id-automate\">2652</div>\r\n");
html.append(" <div class=\"tile-infos\"><span class=\"tile-icon-transmission\">");
html.append("<img src=\"....png\">\r\n");
html.append("</span><span class=\"tile-icon-jauge\"></span></div></a>\r\n");
html.append("document.write(<style type='text/css'>div,iframe { top: 0; position:absolute; }</style>');");
html.append("</div>");
html.append("</head><body></body> </html>");
Document doc = Jsoup.parse(html.toString());
Elements allClassElements = doc.getElementsByClass("tiles"); //fetching the elements of the class "tiles"
for (Element ele : allClassElements) {
Elements links = ele.getElementsByTag("a"); // Finding the anchor tag which contains the required data
for (Element link : links) {
String str = link.attr("data-content"); // to get the status value
//Without Regex
String oo = str.substring(str.indexOf(":") + 1, str.indexOf("<hr>"));
System.out.println(link.text() + " " + oo.replaceAll("\\s+", "")); //link.text contains id value
// Using Regex
Pattern p = Pattern.compile("\\:.*?\\<");
Matcher m = p.matcher(str);
if (m.find())
System.out.println(link.text() + " "
+ m.group().subSequence(1, m.group().length() - 1).toString().replaceAll("\\s+", ""));
}
}
}
}
Console Output:
2652 free
2652 free
For more details around fetching the data using Jsoup, try visiting jsoup cookbook at
https://jsoup.org/cookbook/extracting-data/attributes-text-html
Simply this is what I am trying to do :
(I want to use jsoup)
pass only one url to parse
search for date(s) which are mentioned inside the contents of web page
Extracts at least one date from the each page contents
convert that date into standard format
So, Point #1
What I have now :
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?
Then Point #2 What I have now:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.
I have done point #4.
So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.
Thanks in Advance !
Updated :
So here how I want :
public static void main(String[] args){
try {
// using USER AGENT for giving information to the server that I am a browser not a bot
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
// My only one url which I want to parse
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
// Creating a jsoup.Connection to connect the url with USER AGENT
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
// retrieving the parsed document
Document htmlDocument = connection.get();
/* Now till this part, I have A parsed document of the url page which is in plain-text format right?
* If not, in which type or in which format it is stored in the variable 'htmlDocument'
* */
/* Now, If 'htmlDocument' holds the text format of the web page
* Why do i need elements to find dates, because dates can be normal text in a web page,
* So, how I am going to find an element tag for that?
* As an example, If i wanted to collect text from <p> paragraph tag,
* I would use this :
*/
// I am not sure is it correct or not
//***************************************************/
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
//***************************************************//
/* But I do not want any elements to find to gather dates on the page
* I just want to search the whole text document for date
* So, I need a regex formatted date string which will be passed as a input for a search method
* this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
*/
// At the end we will use only one date from our search result and format it in a standard form
/*
* That is it.
*/
/*
* I was trying something like this
*/
//final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
Here is one possible solution i found:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Created by ruben.alfarodiaz on 21/12/2016.
*/
#RunWith(JUnit4.class)
public class StackTest {
#Test
public void findDates() {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
try {
String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
//with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");
//Here we find all document elements which have some element with the searched pattern
Elements elements = htmlDocument.getElementsMatchingText(pattern);
//in this loop we are going to filter from all original elements to find only the leaf elements
List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
finalElements.stream().forEach(elem ->
System.out.println("Node: " + elem.html())
);
}catch(Exception ex){
ex.printStackTrace();
}
}
//Method to decide if the current element is a leaf or contains others dates inside
private boolean isLastElem(Element elem, Pattern pattern) {
return elem.getElementsMatchingText(pattern).size() <= 1;
}
}
The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities
Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance
<html>
<body>
<div>
20/11/2017
</div>
</body>
</html>
If we find for the pattern dd/mm/yyyy the library will return 3 elements
html, body and div, but we are just interested in div
I am using jsoup to parse html document . I need the value of P tag just after the SPAN tag which contains id attribute.
I am trying with the following code
Elements spanList = body.select("span");
if (spanList != null) {
for (Element element1 : spanList) {
if (element1.attr("id").contains("midArticle")) {
Element element = element1.after("<p>"); // This line is wrong
if (element != null) {
String text = element.text();
if (text != null && !text.isEmpty()) {
out.println(text);
}
}
}
}
}
The html sample code
<span id="midArticle_9"></span>
<p>"The Director owes it to the American people to immediately provide the full details of what he is now examining," Podesta said in a statement. "We are confident this will not produce any conclusions different from the one the FBI reached in July." </p>
<span id="midArticle_10"></span>
<p>Clinton has repeatedly apologized for using the private email server in her home instead of a government email account for her work as secretary of state from 2009 to 2013. She has said she did not knowingly send or receive classified information.</p>
i hope this resolves your issue...
public static void main(String[] args) {
String html = "<span id=\"midArticle_9\"></span><p>\"The Director owes it to the American people to immediately provide the full details of what he is now examining,\" Podesta said in a statement. \"We are confident this will not produce any conclusions different from the one the FBI reached in July.\" </p><span id=\"midArticle_10\"></span><p>Clinton has repeatedly apologized for using the private email server in her home instead of a government email account for her work as secretary of state from 2009 to 2013. She has said she did not knowingly send or receive classified information.</p>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("span");
for (Element element : elements) {
System.out.println(element.nextElementSibling().text());
}
}
The row contains
row number -- name surname -- instructor name-- E
</tr>
<tr height=20 style='height:15.0pt'>
<td height=20 class=xl6429100 align=right width=28 style='height:15.0pt;
border-top:none;width:21pt'>row number</td>
<td class=xl8629100 width=19 style='border-top:none;border-left:none;
width:14pt'> </td>
<td class=xl6529100 width=137 style='border-top:none;border-left:none;
width:103pt'>name</td>
<td class=xl6529100 width=92 style='border-top:none;border-left:none;
width:69pt'>surname</td>
<td class=xl7929100 style='border-top:none;border-left:none'>instructor name</td>
<td class=xl8129100 style='border-top:none'>grade</td>
I want to retrieve only one row from this html file to control my own grade. I get the source of the html by using java but now how can I reach the row that I want? I will find the surname first. In this part of the table how can I reach the grade coloumn?
here is my code;
import java.net.*;
import java.io.*;
public class staj {
public static void main(String[] args) throws Exception {
URL staj = new URL("http://www.cs.bilkent.edu.tr/~sekreter/SummerTraining/2014G/CS399.htm");
BufferedReader in = new BufferedReader(new InputStreamReader(staj.openStream()));
String inputLine;
String grade;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains(mysurname))
//grade = WHAT?
}
in.close();
}
And also, is using java efficient and appropriate for this aim? Which language would be better?
You should definitely use Jsoup library to extract what you need from HTML document - http://jsoup.org/
I've created a sample code that demonstrates an example of extracting data from the table you provided in a description: https://gist.github.com/wololock/15f511fd9d7da9770f1d
public static void main(String[] args) throws IOException {
String url = "http://www.cs.bilkent.edu.tr/~sekreter/SummerTraining/2014G/CS399.htm";
String username = "Samet";
Document document = Jsoup.connect(url).get();
Elements rows = document.select("tr:contains("+username+")");
for (Element row : rows) {
System.out.println("---------------");
System.out.printf("No: %s\n", row.select("td:eq(0)").text());
System.out.printf("Evaluator: %s\n", row.select("td:eq(4)").text());
System.out.printf("Status: %s\n", row.select("td:eq(5)").text());
}
}
Take a look on this:
document.select("tr:contains("+username+")");
Jsoup allows you to use jquery-like methods and selectors to extract data from html documents. In this example selector you extracts only those tr elements that contain given username in nested elements. When you have a list of those rows you can simply extract the data. Here we use:
row.select("td:eq(n)")
where :eq(n) means select n-th td element nested in tr. Here is the output:
---------------
No: 85
Evaluator: Buğra Gedik
Status: E
---------------
No: 105
Evaluator: Çiğdem Gündüz Demir
Status: E
I'm writing an application in java using import org.jdom.*;
My XML is valid,but sometimes it contains HTML tags. For example, something like this:
<program-title>Anatomy & Physiology</program-title>
<overview>
<content>
For more info click here
<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.  Online studies options are available.</p>
</content>
</overview>
<key-information>
<category>Health & Human Services</category>
So my problem is with the < p > tags inside the overview.content node.
I was hoping that this code would work :
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
System.out.println(content.getText());
but it returns blank.
How do I return all the text ( nested tags and all ) from the overview.content node ?
Thanks
content.getText() gives immediate text which is only useful fine with the leaf elements with text content.
Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )
public static void main(String[] args) throws Exception {
SAXBuilder builder = new SAXBuilder();
String xmlFileName = "a.xml";
Document doc = builder.build(xmlFileName);
Element root = doc.getRootElement();
Element overview = root.getChild("overview");
Element content = overview.getChild("content");
XMLOutputter outp = new XMLOutputter();
outp.setFormat(Format.getCompactFormat());
//outp.setFormat(Format.getRawFormat());
//outp.setFormat(Format.getPrettyFormat());
//outp.getFormat().setTextMode(Format.TextMode.PRESERVE);
StringWriter sw = new StringWriter();
outp.output(content.getContent(), sw);
StringBuffer sb = sw.getBuffer();
System.out.println(sb.toString());
}
Output
For more info clickhere<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.  Online studies options are available.</p>
Do explore other formatting options and modify above code to your need.
"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "
You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as <p> or embedded in a CDATA section to be treated as text.
Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.
Well, maybe that's what you need:
import java.io.StringReader;
import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;
public class HowToGetNodeContentsJDOM extends XMLTestCase
{
private static final String XML = "<root>\n" +
" <program-title>Anatomy & Physiology</program-title>\n" +
" <overview>\n" +
" <content>\n" +
" For more info click here\n" +
" <p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.  Online studies options are available.</p>\n" +
" </content>\n" +
" </overview>\n" +
" <key-information>\n" +
" <category>Health & Human Services</category>\n" +
" </key-information>\n" +
"</root>";
private static final String EXPECTED = "For more info click here\n" +
"<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.  Online studies options are available.</p>";
#Test
public void test() throws Exception
{
XMLUnit.setIgnoreWhitespace(true);
Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
String out = new XMLOutputter().outputString(content);
assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
}
}
Output:
PASSED: test on instance null(HowToGetNodeContentsJDOM)
===============================================
Default test
Tests run: 1, Failures: 0, Skips: 0
===============================================
I am using JDom with generics: http://www.junlu.com/list/25/883674.html
Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...
If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.
The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.
Try this:
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());
If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.
Not particularly pretty but works fine (using JDOM API):
public static String getRawText(Element element) {
if (element.getContent().size() == 0) {
return "";
}
StringBuffer text = new StringBuffer();
for (int i = 0; i < element.getContent().size(); i++) {
final Object obj = element.getContent().get(i);
if (obj instanceof Text) {
text.append( ((Text) obj).getText() );
} else if (obj instanceof Element) {
Element e = (Element) obj;
text.append( "<" ).append( e.getName() );
// dump all attributes
for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
}
text.append(">");
text.append( getRawText( e )).append("</").append(e.getName()).append(">");
}
}
return text.toString();
}
Prashant Bhate's solution is nicer though!
If you want to output the content of some JSOM node just use
System.out.println(new XMLOutputter().outputString(node))