How to check if html document contains string

How to check if html document contains string - java

What would be a fast way to check if an URL contains a given string? I tried jsoup and pattern matching, but is there a faster way.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest {
public static void main(String[] args) throws Exception {
String url = "https://en.wikipedia.org/wiki/Hawaii";
Document doc = Jsoup.connect(url).get();
String html = doc.html();
Pattern pattern = Pattern.compile("<h2>Contents</h2>");
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
System.out.println("Found it");
}
}
}

It depends. If your patterns is really only a simple substring to be found exactly in the page content, then both methods you suggest are overkill. If that is indeed the case you should get the page without parsing it in JSoup. You still can use Jsoup if you want to get the page, just don't start the parser:
Connection con = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii");
Response res = con.execute();
String rawPageStr = res.body();
if (rawPageStr.contains("<h2>Contents</h2>")){
//do whatever you need to do
}
If the pattern is indeed a regular expression, use this:
Pattern pattern = Pattern.compile("<h2>\\s*Contents\\s*</h2>");
Matcher matcher = pattern.matcher(rawPageStr);
This makes only sense, if you do not need to parse much more of the page. However, if you actually want to perform a structured search of the DOM via CSS selectors, JSoup is not a bad choice, although a SAX based approach like TagSoup probably could be a bit faster.
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii").get();
Elements h2s = doc.select("h2");
for (Element h2 : h2s){
if (h2.text().equals("Contents")){
//do whatever & more
}
}

Related

Regex get text between tags

I try to get the text between a tag in JAVA.
`
<td colspan="2" style="font-weight:bold;">HELLO TOTO</td>
<td>Function :</td>
`
I would like to use a regex to extract "HELLO TOTO" but not "Function :"
I already tried something like this
`
String btwTags = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td>\n" + "<td>Function :</td>";
Pattern pattern = Pattern.compile("<td(.*?)>(.*?)</td>");
Matcher matcher = pattern.matcher(btwTags);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
`
but the result is the same as the input.
Any ideas ?
I tried this regex (?<=<td>)(.*?)(?=</td>) too but it only catch "Function:"
I don't know of to set that he could be something after the open <td ...>
Already thanks in advance

Don't use RegEx to parse HTML, its a very bad idea...
to know why check this link:
RegEx match open tags except XHTML self-contained tags
you can use Jsoup to achieve this :
String html; // your html code
Document doc = Jsoup.parse(html);
System.out.println(doc.select("td[colspan=2]").text());

You can use a Regex for very basic HTML parsing. Here's the easiest Java regex I could find :
"(?i)<td[^>]+>([^<]+)<\\/td>"
It matches the first td tag with attributes and a value. "HELLO TOTO" is in group 1.
Here's an example.
For anything more complex, a parser like Jsoup would be better.
But even a parser could fail if the HTML isn't valid or if the structure for which you wrote the code has been changed.

I had provided solution without using REGEX Hope that would be helpful..
public class Solution{
public static void main(String ...args){
String str = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td><td>Function :</td>";
String [] garray = str.split(">|</td>");
for(int i = 1;i < garray.length;i+=2){
System.out.println(garray[i]);
}
}
}
Output :: HELLO TOTO
Function :
I am just using split function to delimit at given substrings .Regex is slow and often confuse.
cheers happy coding...

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

Extending this question
How to extract an image src from RSS feed
for JAVA, answer is already made for ios, but to make it work in JAVA there is not enough solutions made for it.
RSS Feeds parsing the direct tag is known for me, but parsing tag inside another tag is quite complicated like this below
<description>
<![CDATA[
<img width="745" height="410" src="http://example.com/image.png" class="attachment-large wp-post-image" alt="alt tag" style="margin-bottom: 15px;" />description text
]]>
</description>
How to split up the src tag alone?

Take a look at jsoup. I think it's what you need.
EDIT:
private String extractImageUrl(String description) {
Document document = Jsoup.parse(description);
Elements imgs = document.select("img");
for (Element img : imgs) {
if (img.hasAttr("src")) {
return img.attr("src");
}
}
// no image URL
return "";
}

You could try to use a regular expression to get the value,
give a look to this little example, I hope it's help you.
For more info about regular expression you can find more info here.
http://www.tutorialspoint.com/java/java_regular_expressions.htm
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test{
public static void main(String []args){
String regularExpression = "src=\"(.*)\" class";
String html = "<description> <![CDATA[ <img width=\"745\" height=\"410\" src=\"http://example.com/image.png\" class=\"attachment-large wp-post-image\" alt=\"alt tag\" style=\"margin-bottom: 15px;\" />description text ]]> </description>";
// Create a Pattern object
Pattern pattern = Pattern.compile(regularExpression);
// Now create matcher object.
Matcher matcher = pattern.matcher(html);
if (matcher.find( )) {
System.out.println("Found value: " + matcher.group(1) );
//It's prints Found value: http://example.com/image.png
}
}
}

Mysql query to remove html tags while inserting into or selecting from a table

I have some data enclosed in HTML tags. I want to insert this data into a database table as plain text, without the HTML tags in it.
Could you please provide any MySQL code that will remove the HTML tags, either when inserting the data, or when retrieving the data.
The insertion and retrieval of the data occurs from a JSP page.

It's usually a recommendable practice to only sanitize HTML when you need to view it, otherwise store the HTML "as is" in the database.
For removing html tags, you can use htmlCleaner

This is something that's much more easily done in the Java code, before you get to the actual database. For instance, something like:
public static String removeHtmlTag(String input) {
if (input == null) {
return input;
}
input = replaceRegexp(input, "<[ \\r\\t\\n]*/?html([ \\r\\t\\n][^>]*>|>)", "");
return input;
}
private static String replaceRegexp(String input, String pattern, String replacement) {
Pattern regexp = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
Matcher matcher = regexp.matcher(input);
if (matcher.find()) {
return matcher.replaceAll(replacement);
}
return input;
}

Java Matcher Class

I need a pattern matcher to get the page id value in the below text which is coming from a http response body.
<meta name="ajs-page-id" content="262250">
What i'm after is to get the content value from this line that will always be generated in responsebody.

Pattern pat = Pattern.compile("<meta\\sname=\"ajs-page-id\"\\scontent=\"(\\d+)\">");
That is obviously a very literal pattern... but group(1) should return the number as a string.
Haven't tested.

Use an HTML parser like jsoup to parse and search for the part. You should not be using regular expressions for this.
e.g.,
String htmlStr = "<meta name=\"ajs-page-id\" content=\"262250\">";
Document doc = Jsoup.parse(htmlStr);
Element meta = doc.select("meta[name=ajs-page-id]").first();
if (meta != null)
{
System.out.println(meta.attr("content"));
}

Java : replacing text URL with clickable HTML link

I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !

Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression

public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.

For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)

Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda

Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")

In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to check if html document contains string - java

Related

Regex get text between tags

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

Mysql query to remove html tags while inserting into or selecting from a table

Java Matcher Class

Java : replacing text URL with clickable HTML link

Categories

Resources