Java Matcher Class - java

I need a pattern matcher to get the page id value in the below text which is coming from a http response body.
<meta name="ajs-page-id" content="262250">
What i'm after is to get the content value from this line that will always be generated in responsebody.

Pattern pat = Pattern.compile("<meta\\sname=\"ajs-page-id\"\\scontent=\"(\\d+)\">");
That is obviously a very literal pattern... but group(1) should return the number as a string.
Haven't tested.

Use an HTML parser like jsoup to parse and search for the part. You should not be using regular expressions for this.
e.g.,
String htmlStr = "<meta name=\"ajs-page-id\" content=\"262250\">";
Document doc = Jsoup.parse(htmlStr);
Element meta = doc.select("meta[name=ajs-page-id]").first();
if (meta != null)
{
System.out.println(meta.attr("content"));
}

Related

Regex get text between tags

I try to get the text between a tag in JAVA.
`
<td colspan="2" style="font-weight:bold;">HELLO TOTO</td>
<td>Function :</td>
`
I would like to use a regex to extract "HELLO TOTO" but not "Function :"
I already tried something like this
`
String btwTags = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td>\n" + "<td>Function :</td>";
Pattern pattern = Pattern.compile("<td(.*?)>(.*?)</td>");
Matcher matcher = pattern.matcher(btwTags);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
`
but the result is the same as the input.
Any ideas ?
I tried this regex (?<=<td>)(.*?)(?=</td>) too but it only catch "Function:"
I don't know of to set that he could be something after the open <td ...>
Already thanks in advance
Don't use RegEx to parse HTML, its a very bad idea...
to know why check this link:
RegEx match open tags except XHTML self-contained tags
you can use Jsoup to achieve this :
String html; // your html code
Document doc = Jsoup.parse(html);
System.out.println(doc.select("td[colspan=2]").text());
You can use a Regex for very basic HTML parsing. Here's the easiest Java regex I could find :
"(?i)<td[^>]+>([^<]+)<\\/td>"
It matches the first td tag with attributes and a value. "HELLO TOTO" is in group 1.
Here's an example.
For anything more complex, a parser like Jsoup would be better.
But even a parser could fail if the HTML isn't valid or if the structure for which you wrote the code has been changed.
I had provided solution without using REGEX Hope that would be helpful..
public class Solution{
public static void main(String ...args){
String str = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td><td>Function :</td>";
String [] garray = str.split(">|</td>");
for(int i = 1;i < garray.length;i+=2){
System.out.println(garray[i]);
}
}
}
Output :: HELLO TOTO
Function :
I am just using split function to delimit at given substrings .Regex is slow and often confuse.
cheers happy coding...

extract element from xml string

<media:thumbnail url="http:// mysite.com/wp-content/uploads/2013/11/mes1-300x186.png" width="320" length="125399" type="image/jpg"/>
How to extract the URL from this xml ? If I have the above as string ?
Supposing you have the XML string stored in String str, one lazy and simple way to retrieve the URL (if you always expect the same input text format) could be:
String url = str.split("\"")[1];
Some times, as a quick "one time solution" you can use regular expressions.
String xml="<media:thumbnail url=\"http:// mysite.com/wp-content/uploads/2013/11/ mes1-300x186.png\" width=\"320\" length=\"125399\" type=\"image/jpg\"/>";
Pattern pattern=Pattern.compile("url\\s*=\\s*\\\"(.*?)\\\"");
Matcher m=pattern.matcher(xml);
if(m.find()){
String urlValue=m.group(1);
System.out.println(urlValue);
}

Search for a substring between certain characters at unknown index

I have a String such as:
<div class="photo-box biz-photo-box pb-60s">
<a href="/biz/the-kerry-piper-willowbrook">
<img class="photo-img" alt="" height="60" src="http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg" width="60">
</a>
How can I retrieve the url
http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg
from this String?
I thought about string.indexOf() but the number of characters before and after url can vary therefore I don't know at which index this substring starts and this could be messy. Any best approach?
Use Jsoup to scrape/parse HTML from a URL, file, or string and use its jQuery like selector syntax.
String htmlStr="<div class=\"photo-box biz-photo-box pb-60s\">"
+ "<a href=\"/biz/the-kerry-piper-willowbrook\">"
+ "<img class=\"photo-img\" alt=\"\" height=\"60\" src=\"http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg\" width=\"60\">"
+ "</a>";
org.jsoup.nodes.Document doc=org.jsoup.Jsoup.parse(htmlStr);
String src=doc.select("img").attr("src");
System.out.println(src);
If you don't want to use an HTML parser, you could construct a regular expression and used the regex package to match the only the data you need.
Something like,
Pattern pattern = Pattern.compile("<img.*?src=\"([^\"]+)\"",Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(data);
while(m.find()) {
srcUrl = m.group(1));
}

Mysql query to remove html tags while inserting into or selecting from a table

I have some data enclosed in HTML tags. I want to insert this data into a database table as plain text, without the HTML tags in it.
Could you please provide any MySQL code that will remove the HTML tags, either when inserting the data, or when retrieving the data.
The insertion and retrieval of the data occurs from a JSP page.
It's usually a recommendable practice to only sanitize HTML when you need to view it, otherwise store the HTML "as is" in the database.
For removing html tags, you can use htmlCleaner
This is something that's much more easily done in the Java code, before you get to the actual database. For instance, something like:
public static String removeHtmlTag(String input) {
if (input == null) {
return input;
}
input = replaceRegexp(input, "<[ \\r\\t\\n]*/?html([ \\r\\t\\n][^>]*>|>)", "");
return input;
}
private static String replaceRegexp(String input, String pattern, String replacement) {
Pattern regexp = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
Matcher matcher = regexp.matcher(input);
if (matcher.find()) {
return matcher.replaceAll(replacement);
}
return input;
}

Java : replacing text URL with clickable HTML link

I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !
Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression
public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.
For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)
Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda
Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")
In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);

Categories