JSoup parsing a text file containing a html table with Java - java

I am really unsure how I can get the information I need to place into a database, the code below just prints the whole file.
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
System.out.println(doc.toString());
My HTML is here from line 61 and I am needing to get the items under the column headings but also grab the MMSI number which is not under a column heading but in the href tag. I haven't used JSoup other than to get the HTML from the web page. I can only really see tutorials to use php and I'd rather not use it.

To get those information, the best way is to use Jsoup's selector API. Using selectors, your code will look something like this (pseudeocode!):
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements matches = doc.select("<your selector here>");
for( Element element : matches )
{
// do something with found elements
}
There's a good documentation available here: Use selector-syntax to find elements. If you get stuck nevertheless, please describe your problem.
Here are some hints for that selector, you can use:
// Select the table with class 'shipinfo'
Elements tables = doc.select("table.shipinfo");
// Iterate over all tables found (since it's only one, you can use first() instead
for( Element element : tables )
{
// Select all 'td' tags of that table
Elements tdTags = element.select("td");
// Iterate over all 'td' tags found
for( Element td : tdTags )
{
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}

Related

How to get data from a URL in new lines using JSOUP?

I'm scrapping IMDB chart of 250 movies. I want to store each movie name in an array, but I don't know why it puts all the movie names into the first index, i.e Array[0].
Below is my code.
Can anyone please help me out. I've to complete another project and this is the main thing that is needed.
If you can direct me any website or tutorial I'll be very thankful to you.
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").userAgent("Mozilla").get();
int counterVariable = 0;
for (Element el : doc.select(".lister-list")) {
mString[counterVariable] = el.select(".titleColumn").text();
totalNumberOfLines++;
counterVariable++;
}
} catch (Exception e) {
System.out.println("Sorry website couldn't be opened");
System.out.println(e);
}
System.out.println(mString[0]);// It's putting all the names into this index
The problem is that you have only one element matching selector .lister-list, so iterating over it does not make much sense. When you call el.select(".titleColumn").text(); Jsoup concatenates text from all matching elements. This is why you get all results in one element. Instead you can try to select all td tags with class tittleColumn that are children of tr element that are child of .lister-list
for (Element el : doc.select(".lister-list > tr > td.titleColumn")) {
mString[counterVariable] = el.text();
totalNumberOfLines++;
counterVariable++;
}
More about jsoup css selectors you can learn here.

JSoup - How to parse nested texts?

I'm parsing html of a website with JSoup. I want to parse this part:
<td class="lastpost">
This is a text 1<br>
Website Page - 1
</td>
I want like this:
String text = "This is a text 1";
String textNo = "Website Page - 1";
String link = "post/13594";
How can I get the parts like this?
Your code would only get all the text that is in the td elements that you are selecting. If you want to store the text in separate variables, you should grab the parts separately like the following code. Extra comments added so you can understand how/why it is getting each piece.
// Get the first td element that has class="lastpost"
Element lastPost = document.select("td.lastpost").first();
// Get the first a element that is a child of the td
Element linkElement = lastPost.getElementsByTag("a").first();
// This text is the first child node of td, get that node and call toString
String text = lastPost.childNode(0).toString();
// This is the text within the a (link) element
String textNo = linkElement.text();
// This text is the href attribute value of the a (link) element
String link = linkElement.attr("href");

JSOUP Element.html("<th>test</th>") ignore th tags

I work on a html templating engine based on jsoup.
JSOUP ignore th and td flags if element is not inside table;
To deal with this, I change parser to :
final Document docToWrite = Jsoup.parse(docToRead.outerHtml(),"", Parser.xmlParser());
But I didn't find any solution to fill an Element with html that contain a td or a th:
element.html("<th>test</th>");
return only test, because JSOUP is cleaning html by removing unused tags
How can I solve this?
Thank you
If you element is 'th', then calling:
element.html("<th>test</th>") // th.innerHTML = "<th>test</th>"
should produce dirty html:
<th><th>test</th></th>
which is correctly cleared up by JSoup to:
<th>test</th> // th.innerHTML == "test"
To fill element with innerHTML == "<th>test</th>" your element has to be a <tr> tag.
// Given
String s = "<th>test</th>";
assert element.tag() == "tr";
// When
element.html(s);
// Then
assert element.html().equals(s);

Sanitize HTML string

I have an HTML sting like:
<p dir="ltr"><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>bold</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>all</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>in</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u>one</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></p>
I want to sanitize the html like <b><i><u> bold all in one </b></i></u>
I tried this method: webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
But it is of no use. The html remains clumsy. What should I do to mend the same? Is there any other Regex or JSON way?
But it is of no use. The html remains clumsy. What should I do to mend
the same? Is there any other Regex or JSON way?
Regex may help here, but in general they serve not very well as Html parser if things get complex. Jsoup is a great Html library, and i really can recommend it.
Unfortunately your html is still valid html, so the solution is tricky.
Best you start with the Jsoup documentation, especially the one of it's Selector syntax.
Here's something for starting:
final String html = ... // your html from above
// Parse the html string into a document
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
/*
* Select all elements, which ...
*
* (a) have a text (= at least not empty)
* (b) has no childs it's own
*
* Iterate over those found and print them.
*/
for( Element element : doc.select("*:matches(^..+?$):not(:has(*))") )
{
System.out.println(element);
}
Result:
<u>bold</u>
<u>all</u>
<u>in</u>
<u>one</u>
If you need literally <b><i><u> bold all in one </b></i></u>:
final String html = ... // your html from above
// As above
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
// All text of the document
String text = doc.text();
// Create an element and it's childs
Element element = new Element(Tag.valueOf("b"), "");
element.appendElement("i").appendElement("u").text(text);
System.out.println(element);
Result:
<b><i><u>bold all in one</u></i></b>
You could try below method to remove unwanted html tags:
public String stripHtml(String html)
{
return Html.fromHtml(html).toString();
}

How to extract full URLs from all paragraphs in a webpage using jsoup

How do I extract full URL's from all paragraphs on a web page using jsoup? I am able to extract only the relative URL's.
Expected:
http://fr.wikipedia.org/wiki/Husni_al-Zaim
Actual: /Husni_al-Zaim
My Code:
Elements links = doc.select("p");
Elements linkss = links.select("a");
for (Element link : linkss) {
if (link.text().matches("^[A-Z].+") == true) {
list.add(new NamedLink(link.attr("href"), link.text()));
}
}
Use .absUrl("href") instead of .attr("href"). This only works when you get the document from a webpage or parse the full file from disk (and thus do not massage portions from HTML to text and back as in your example).
Document document = Jsoup.connect("http://stackoverflow.com").get();
Elements paragraphLinks = document.select("p a");
for (Element paragraphLink : paragraphLinks) {
String absUrl = paragraphLink.absUrl("href");
// ...
}

Categories