This website contains different Url, But i want my application should vist urls only which contains specific keyword like "drugs" like
if urls are
http://website.com/countryname/drug/info/A
http://website.com/countryname/Browse/Alphabet/D?cat=company
it should visit first URL.so how to match a specific keyword drug in url.I know it can be done using regexp also,but have but i am new to it
I am using Java here
You can check if string contains a word with method contains().
if(myString.contains("drugs"))
If you need only URLs containing /drug/ try to do something like this:
Pattern p = Pattern.compile("/drug(/|$)");
Matcher m = p.matcher(myURLString);
if(m.find())
{
something_to_do
}
(/|$) means that after /drug can be only a slash ( / ) or nothing at all (dollar means end of the line).So this regex will find all if your string is like .../drug/... or .../drug
Use split() as such:
final String[] words = input.replaceFirst("https?://", "").split("/+");
for (final String word: words)
if ("whatyouwant".equals(word))
//do what is necessary since the word matches
If your code is called very often, you may want to make Patterns out of https?:// and /+ and use Matchers.
Related
Let's say I have a link like below along with a bunch of other links
http://testttt.com/met?tag1=x&tag2=y&tag3=z%20a
I would like to extract the entire link if it starts with http://testttt.com/met
I tried doing the following but it didn't work
Pattern pattern = Pattern.compile("http://testttt.com/met?[a-zA-Z][0-9]");
Matcher match = pattern.matcher("http://testttt.com/met?tag1=x&tag2=y&tag3=z%20a");
if (match.find()) {
System.out.println("match found");
}
Why not just use
if (str.startsWith("http://testttt.com/met")) {
...
}
If your string only contains the url, use the answer proposed by Reimeus. If you're trying to extract the url from a bunch of text, you can use this pattern:
Pattern pattern = Pattern.compile("http://testttt\\.com/met\\??[^\\s]*");
It contains all the necessary escapes and and matches everything up to the next whitespace.
I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}
Given a String that begin's with the symbols: {" and ends with: "}. There are other punctuation's present in between the line aswell, like: , ' or "" etc. How to use java regex utility to know whether the given String starts with: {". I am trying to return the Boolean value by using:
Pattern.matches(begin, string)
where
begin = "[\\p{Punct}&&[{]]"
and
string = {"name":"Aman"},{"surname":"Gupta"}.
(Please suggest regex option than JSON) I want to do it by using regex only. Please suggest a way how to achieve this.
You should try smth like this:
Pattern p = Pattern.compile("\{.*?\}");
Matcher m = p.matcher(/*your string here*/);
while (m.find()){
String substringInBraces = m.group();
/*do smth with your substring*/
}
This will give you a substring of anything that might be between two nearest curly braces.
You might be interested in reading this and this
Pattern.compile("^{").matcher(string).find()
I don't know why you insist on using \\p{Punct}, it's totally unnecessary here.
Note that Pattern.matches() wants to match the entire string, so it is not useful when you only want to match something at the start of a string.
What would be the most efficient way to cover all cases for a retrieve of folder1/folder22
from:
http://localhost:8080/folder1/folder22/file.jpg
or
http://domain.com/folder1/folder22/file.jpg
or
http://127.0.0.0.1:8080/folder1/folder22/file.jpg
so there may be one or more folders/sub-folders. Basically I would like to strip the domain name and port if available and the file name at the end.
Thank for your time.
What about the URL class and getPath()?
Maybe it's not the most efficient way, but one of the simplest I think:
String[] urls = {
"http://localhost:8080/folder1/folder22/file.jpg",
"http://domain.com/folder1/folder22/file.jpg",
"http://127.0.0.0.1:8080/folder1/folder22/file.jpg" };
for (String url : urls)
System.out.println(new File(new URL(url).getPath()).getParent());
You should probably use Java's URL parser for this, but if it has to be a regex:
\b(?=/).*(?=/[^/\r\n]*)
will match /folder1/folder22 in all your examples.
try {
Pattern regex = Pattern.compile("\\b(?=/).*(?=/[^/\r\n]*)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Explanation:
\b: Assert position at a word boundary (this will work before a single slash, but not between slashes or after a :)
(?=/): Assert that the next character is a slash.
.*: Match anything until...
(?=/[^/\r\n]*): ...exactly one last / (and anything else except slashes or newlines) follows.
^.+/([^/]+/[^/]+)/[^/]+$
The best way to get the last two directories from a url is the following:
preg_match("/\/([^\/]+\/){2}[^\/]+$/", $path, $matches);
If matched, And $matches[1] will always contain what you want, no matter filename of full url.
How do I match an URL string like this:
img src = "https://stackoverflow.com/a/b/c/d/someimage.jpg"
where only the domain name and the file extension (jpg) is fixed while others are variables?
The following code does not seem working:
Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
}
There were a couple of issues with the regex matching the sample string you gave. You were close, though. Here's your code fixed to make it work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TCPChat {
static public void main(String[] args) {
String url = "<img src=\"http://stackoverflow.com/a/b/c/d/someimage.jpg\">";
Pattern p = Pattern.compile("<img src=\"http://stackoverflow.com/.*jpg\">");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
System.out.println(s);
}
}
}
First, I would use the group() method to retrieve the matched text, not toString(). But it's probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.
Second, I wouldn't assume src was the first attribute in the <img> tag. On SO, for example, it's usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can't match beyond the end of the tag. [^<>]+ will probably suffice.
Third, I would use something more restrictive than .* to match the unknown part to the path. There's always a chance that you'll find two URLs on one line, like this:
<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">
In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.
There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?
...and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it's essential that you understand their limitations.
Here's my revised version of your regex (as a Java string literal):
"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"