java regexp parse partial title tag - java

Ok, quick question. I'm a bit of a newbie at Java, and I have an assignment in which I have to get the name of a person from the title tag of a page. I know my regex, but I can't (or don't know how) to escape some characters.
Example
<title>Mr. Somebody | Department in which he's in</title>
So, basically I need a regexp that would get me the "Mr. Somebody". I've tried :
Pattern pat = Pattern.compile("<title>(.+?)|");
Matcher mat = pat.matcher(data);
boolean found = false;
while (!found && mat.find()) {
name = mat.group(0);
found = true;
}
System.out.println("Found a name : " + name);
My problem is, that no matter what I've tried, the most I could get was the first character. Do you think that a more simpler approach with indexOf and substrings would be better, or is a regexp still viable?
I know that usually regexps are not suitable for parsing html tags, but I'm considering this search more of a string search, because I'm not interested in the whole tag (or other tags that might be contained within).
Any kind of help is greatly appreciated :)

You need to escape the pipe because it's a character with a special meaning in regex. Try:
<title>(.+?)\\|
| means "or" which means that the regex will try to match with either <title>(.+?) or nothing (there's nothing after the |.
When it tries to match with <title>(.+?), it will get only the first character because .+? is lazy (it matches as little as possible).
Alternatively, you can use a negated class:
<title>([^\\|]+)
[^\\|]+ will match any character except a pipe.

It should work
Pattern pat = Pattern.compile("<title>(.*?)\\|");
and use
mat.group(1) instead of mat.group(o);

Here's a way to do it that will avoid using Pattern and Matcher, if you want:
String name = "<title>Mr. Somebody | Department in which he's in</title>";
name = name.substring(7).replaceAll("\\|.*", "");
The substring(7) will remove the first tag, then replaceAll will remove everything from the pipe character onwards (replace with empty string).

Maybe this it what you want:
(?<=<title>)(.+?(?=[|].+?))(?=.+?</title>)
It returns Mr. Somebody. You can test it here for example.

Here is a way :
<\s*title[^>]*>\s*([^\|]+)
Takes away leading white space.
Handles any possible weird attributes that someone may add to a title tag, i.e. <title data-cookies="I hide cookies here :P">I like titles</title>
Handles any whitespace added before title, i.e. < title > is still valid.

Related

Regex to find XML tag in multiline string

Here is a simple function I wrote to get the value from a tag.
public static String getTagAValue(String xmlAsString) {
Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
Matcher matcher = pattern.matcher(xmlAsString);
if (matcher.find()) {
return matcher.group(1);
} else {
return null;
}
}
It is not finding a match and returning null.
XML Sample
<xml>
<sample>
<TagA>result</TagA>
</sample>
</xml>
Note, here I used 4 spaces for tabs, but the real string would contain tabs.
Don't use regular expressions to parse XML: it's the wrong tool for the job.
Classic answer here: RegEx match open tags except XHTML self-contained tags
The answer you have accepted gives wrong answers, for example:
It doesn't accept whitespace in places where whitespace is allowed, such as before ">"
It will match a commented-out element, or one that appears in a CDATA section
It does a greedy match, so it will find the LAST matching end tag, not the first one.
However hard you try, you will never get it 100% right.
And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.
To do the job properly and professionally, use an XML parser.
You probably want to enable that the RegExp works on multi-line:
Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);
Documentation explains the parameter Pattern.DOTALL:
Enables dotall mode. In dotall mode, the expression . matches any
character, including a line terminator. By default this expression
does not match line terminators.
Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

How to find optional group with some prefix using Regex

This is my pattern regex:
"subcategory.html?.*id=(.*?)&.*title=(.+)?"
for below input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale
I want to capturebelow group
group one (id) : 3000080292
group two (title) : BabySale
For which it is working fine. The problem is I want to make second group i.e. value of title to be optional, so that even if title is not present, regex should match and get me value of group 1(id). But for input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&
Regex match is failing even if group one is present. So my question is how to make second group optional here?
Maybe make the entire substring optional?
Try subcategory.html?.*id=(.*?)&.*(?:title=(.+)?)?
Also note that your (and my) regex might be matching too much. For example, the dot here should probably be escaped: subcategory\.html instead of subcategory.html or you will match subcategory€html, too. Your question mark says the l of html is optional; you are probably saved by the .* ("match anything"), that follows.
Last but not least, the final .* means that even this will match (which you probably don't want to match):
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale&Lorem Ipsum Sit Atem http://&%$
It's usually a bad idea to match .* as it will nearly always match too much. Consider using character classes instead of the dot, and to anchor he beginning (^) and end ($) of the string... :)
One of the possible ways is to use something like:
subcategory\.html\?.*id=(.*?)&(.*title=(.+)?)?
(.*title=(.+)?)? is optional now.
please see an example here.
As suggested by #Christian it is better to make .*title non capturing group and it won't be part of the result.
subcategory\.html\?.*id=(.*?)&(?:.*title=(.+)?)?
If you know that parameter id comes before optional title then you can use this regex to capture id and optional title parameters:
subcategory\.html\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?
RegEx Demo
In Java use this regex:
final String regex = "subcategory\\.html\\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?";

Finding whole word only in Java string search

I'm running into the problem of finding a searched pattern within a larger pattern in my Java program. For example, I'll try and find all for loops, but will stumble upon formula. Most of the suggestions I've found talk about using regular expression searches like
String regex = "\\b"+keyword+"\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(searchString);
or some variant of this. The issue I'm running into is that I'm crawling through code, not a book-like text where there are spaces on either side of every word. For example, this will miss for(, which I would like to find. Is there another clever way to find whole words only?
Edit: Thanks for the suggestions. How about cases in which there the keyword starts on the first entry of the string? For example,
class Vec {
public:
...
};
where I'm searching for class (or alternatively public). The patterns suggested by Thanga, Austin Lee, npinti, and Kai Iskratsch do not work in this case. Any ideas?
In your case, the issue is that the \b flag will look for punctuation marks, white spaces and the beginning or end of the string. An opening bracket does not fall within any of these categories, and is thus omitted.
The easiest way to fix this would be to replace "\\b"+keyword+"\\b" with "[\\b(]"+keyword+"[\\b)]".
In regex syntax, the square brackets denote a set of which the regex engine will attempt to match any character it contains.
As per this previous SO question, it would seem that \b and [\b] are not the same. Whilst \b represents a word boundary, [\b] represents a backspace character. To fix this, simply replace "\\b"+keyword+"\\b" with "(\b|\()"+keyword+"(\b|\))".
Regex should match 0 or more chars. The below code change will fix the issue
String regex = ".*("+keyword+").*";
You could modify your regex to search for multiple characters afterwords, for example
[^\w]+"for"+[^\w] using the Pattern class in Java.
For your reference:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Basically you will have to adapt your regex to all the possible patterns it can find. But considering your actually dealing with code, you are better of building a parser/tokenizer for that language, or using one that already exists. Then all you have to do is run through the tokens to find the the ones you want.

Java // No match with RegExp and square brackets

I have a string like
Berlin -> Munich [label="590"]
and now I'm searching a regular expression in Java that checks if a given line (like above) is valid or not.
Currently, my RegExp looks like \\w\\s*->\\s*\\w\\s*\\[label=\"\\d\"\\]"
However, it doesn't work and I've found out that \\w\\s*->\\s*\\w\\s* still works but when adding \\[ it can't find the occurence (\\w\\s*->\\s*\\w\\s*\\[).
What I also found out is that when '->' is removed it works (\\w\\s*\\s*\\w\\s*\\[)
Is the arrow the problem? Can hardly imagine that.
I really need some help on this.
Thank you in advance
This is the correct regular expression:
"\\w+\\s*->\\s*\\w+\\s*\\[label=\"\\d+\"\\]"
What you report about matches and non-matches of partial regular expressions is very unlikely, not possible with the Berlin/Munich string.
Also, if you are really into German city names, you might have to consider names like Castrop-Rauxel (which some wit has called the Latin name of Wanne-Eickel ;-) )
Try this
String message = "Berlin -> Munich [label=\"590\"]";
Pattern p = Pattern.compile("\\w+\\s*->\\s*\\w+\\s*\\[label=\"\\d+\"\\]");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
System.out.println(matcher.group());
}
You need to much more than one token of characters and numbers.

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

Categories