Search Function in HTML - java

How can I search text in HTMLDocument and then return the index and last index of that word/sentence but ignoring tags when searching..
Searching: stackoverflow
html: <p class="red">stack<b>overflow</b></p>
this should return index 15 and 31.
Just like in browsers when searching in webpages.

If you want to do that in Java, here are rough example using Jsoup. But of course you should implement the detail so that the code can parse properly for any given html.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";
String search = "stackoverflow";
Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow
if(search.matches(pPlainText)){
System.out.println("text found in html");
String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"
//search the text in pElementString
int start = pElementString.indexOf(firstWord); // 15
int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
System.out.println(start + " >> " + end);
}else{
System.out.println("cannot find searched text");
}

Related

Parsing html in Jsoup

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.
I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
SpannableStringBuilder text = new SpannableStringBuilder();
if (value.contains("</span>")) {
Document document = Jsoup.parse(value);
Elements elements = document.getElementsByTag("span");
if (elements != null) {
int i = 0;
int start = 0;
for (Element ele : elements) {
String styleName = type + "." + ele.attr("class");
text.append(ele.text());
int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
text.append(ele.nextSibling().toString());
start = text.length();
i++;
}
}
return text;
}
I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".
Need output such as:
- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>
Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}
output:
0 ->
There are
1 -> two
2 -> workers from the
3 -> Front of House
By the way: I don't know how to get rid of the first line break before There are.

Get text snippets from google results

I want to extract the snippets from the google results, I'm using the following code that parse the google results page:
Scanner scanner = new Scanner(System.in);
System.out.println("Please enter the search term.");
String searchTerm = scanner.nextLine();
System.out.println("Please enter the number of results. Example: 5 10 20");
int num = scanner.nextInt();
scanner.close();
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();
Elements results = doc.select("//div//div//span[contains(#class, 'st')]/text()");
for (Element result : results) {
String linkText = result.text();
System.out.println("Text::" + linkText );//1000+ ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
}
it extract the resulted url and the caption, the problem is that the snippets are in html tags that are in "lower level", like in the attached image:
So how can i extract them ?!
With a xpath query :
'//em[.="Stack Overflow"]/following-sibling::text()'
or
'//em[text()="Stack Overflow"]/following-sibling::text()'

JSOUP parsing for multiple rows

I am trying to parse information from a particular website using JSOUP.
So far I can parse and display a single row, as the website has a lot of html and I am quite new to this I was wondering is there a way to parse all table rows on the page containing the word "fixturerow".
Here is my parser code:
Document doc =Jsoup.connect("http://www.irishrugby.ie/club/ulsterbankleagueandcup/fixtures.php").get();
Elements kelime = doc.select("tr#fixturerow0");
for(Element sectd:kelime){
Elements tds = sectd.select("td");
String result = tds.get(0).text();
String result1 = tds.get(1).text();
String result2 = tds.get(2).text();
String result3 = tds.get(3).text();
String result4 = tds.get(4).text();
String result5 = tds.get(5).text();
String result6 = tds.get(6).text();
String result7 = tds.get(7).text();
System.out.println("Date: " + result);
System.out.println("Time: " + result1);
System.out.println("League: " + result2);
System.out.println("Home Team: " + result3);
System.out.println("Score: " + result4);
System.out.println("Away Team: " + result5);
System.out.println("Venue: " + result6);
System.out.println("Ref: " + result7);
}`
Thanks for your time!
You can use the ^= (starts-with) selector:
Elements kelime = doc.select("tr[id^=fixturerow]");
This will return all elements with an id that starts with fixturerow.
You may have better luck if you use a selector that looks for id's that start-with the text of interest. So try changing
Elements kelime = doc.select("tr#fixturerow0");
to
Elements kelime = doc.select("tr[id^=fixturerow]");
Where ^= means that the text of interest starts with the text that follows.

How to use JSoup to get hyperlink href?

I have the following jsFiddle
http://jsfiddle.net/B5zvV/
I am trying to use JSoup to obtain the value of the hyperlink's href string on Line 238:
<a href="/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450">
Hence, the desired result would be to obtain a String with a value of:
/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450
Here's my code:
Document doc = Jsoup.connect("http://myapp.example.com/fizz.html").get()
Elements elems = doc.getElementsByAttributeValueContaining("href", "repositoryId")
When I run this, the value of elems is empty: why, and what do I need to do to get the desired String?
The getElementsByAttributeValueContaining() method will return multiple values in this case because many hrefs has repositoryId. If you are particular about line 238 then that a is enclosed inside an li with class item item-default. There is only one such li and two a tags inside it. Just take the first one like
String html = "<li class=\"item item-default\" data-item-id=\"28049450\" id=\"item-28049450\">"
+ "<a href=\"/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450\">"
+ "<h3 class=\"item-title\">MCAppRepo <span class=\"item-default-marker grey\">(default)</span></h3>"
+ "</a>"
+ "<a href=\"/chain/admin/config/confirmDeleteRepository.action?planKey=AB-CSD&repositoryId=28049450\" class=\"delete\" title=\"Remove repository\">"
+ "<span class=\"assistive\">Delete</span>"
+ "</a>"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.item.item-default > a");
System.out.println(elems.first().attr("href"));

how to remove anchor tag and make it text

String k= <html>
<a target="_blank" href="http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&
amp;id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&
amp;aa=">number S.O.I93(E), dated the 30th March, 1999
</html>
I'm getting this HTML in a String and I want to remove the anchor tag so that data is also removed from link.
I just want display it as text not as a link.
how to do this i m trying to do so much not able to do please send me code regarding that i m
creating app for Android this issue i m getting in android on web view.
use JSoup, and jSoup.parse()
You can use the following example (don't remember where i've found it, but it works) using replace method to modify the string before showing it:
k = replace ( k, "<a target=\"_blank\" href=", "");
String replace(String _text, String _searchStr, String _replacementStr) {
// String buffer to store str
StringBuffer sb = new StringBuffer();
// Search for search
int searchStringPos = _text.indexOf(_searchStr);
int startPos = 0;
int searchStringLength = _searchStr.length();
// Iterate to add string
while (searchStringPos != -1) {
sb.append(_text.substring(startPos, searchStringPos)).append(_replacementStr);
startPos = searchStringPos + searchStringLength;
searchStringPos = _text.indexOf(_searchStr, startPos);
}
// Create string
sb.append(_text.substring(startPos,_text.length()));
return sb.toString();
}
To substitute all the target with an empty line:
k = replace ( k, "<a target=\"_blank\" href=\"http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&aa=\">", "");
No escape is needed for slash.

Categories