GWT RegExp - multiple matches - java

I want to find all the "code" matches in my input string (With GWT RegExp). When I call the "regExp.exec(inputStr)" method it only returns the first match, even when I call it multiple times:
String input = "ff <code>myCode</code> ff <code>myCode2</code> dd <code>myCode3</code>";
String patternStr = "<code[^>]*>(.+?)</code\\s*>";
// Compile and use regular expression
RegExp regExp = RegExp.compile(patternStr);
MatchResult matcher = regExp.exec(inputStr);
boolean matchFound = (matcher != null); // equivalent to regExp.test(inputStr);
if (matchFound) {
// Get all groups for this match
for (int i=0; i<matcher.getGroupCount(); i++) {
String groupStr = matcher.getGroup(i);
System.out.println(groupStr);
}
}
How can I get all the matches?
Edit: Like greedybuddha noted: A regex is not really suited to parse (X)HTML. I gave JSOUP a try and it is much more convienient than with a regex. My code with jsoup now looks like this. I am renaming all code tags and apply them a CSS-Class:
String input = "ff<code>myCode</code>ff<code>myCode2</code>";
Document doc = Jsoup.parse(input, "UTF-8");
Elements links = doc.select("code"); // a with href
for(Element link : links){
System.out.println(link.html());
link.tagName("pre");
link.addClass("prettify");
}
System.out.println(doc);

Compile the regular expression with the "g" flag, for global matching.
RegExp regExp = RegExp.compile(patternStr,"g");
I think you will also want "m" for multiline matching, "gm".
That being said, for HTML/XML parsing you should consider using JSoup or another alternative.

Related

Extract YouTube ID with or without RegEx

Please let me know how to get youtube ID without going to regular expression?
Using above method following URL, didn't work
http://www.youtube.com/e/dQw4w9WgXcQ
http://www.youtube.com/watch?feature=player_embedded&v=dQw4w9WgXcQ
public static String extractYTId(String youtubeUrl) {
String video_id = "";
try {
if(youtubeUrl != null && youtubeUrl.trim().length() > 0 && youtubeUrl.startsWith("http")) {
String expression = "^.*((youtu.be" + "\\/)" + "|(v\\/)|(\\/u\\/w\\/)|(embed\\/)|(watch\\?))\\??v?=?([^#\\&\\?]*).*"; // var regExp = /^.*((youtu.be\/)|(v\/)|(\/u\/\w\/)|(embed\/)|(watch\?))\??v?=?([^#\&\?]*).*/;
//String expression = "^.*(?:youtu.be\\/|v\\/|e\\/|u\\/\\w+\\/|embed\\/|v=)([^#\\&\\?]*).*";
CharSequence input = youtubeUrl;
Pattern pattern = Pattern.compile(expression, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(input);
if(matcher.matches()) {
String groupIndex1 = matcher.group(7);
if(groupIndex1 != null && groupIndex1.length() == 11)
video_id = groupIndex1;
}
}
} catch(Exception e) {
Log.e("YoutubeActivity", "extractYTId " + e.getMessage());
}
return video_id;
}
Other links working fine
http://www.youtube.com/v/0zM3nApSvMg?fs=1&hl=en_US&rel=0
​​http://www.youtube.com/embed/0zM3nApSvMg?rel=0
http://www.youtube.com/watch?v=0zM3nApSvMg&feature=feedrec_grec_index
http://www.youtube.com/watch?v=0zM3nApSvMg
http://youtu.be/0zM3nApSvMg
http://www.youtube.com/watch?v=0zM3nApSvMg#t=0m10s
http://youtu.be/dQw4w9WgXcQ
http://www.youtube.com/embed/dQw4w9WgXcQ
http://www.youtube.com/v/dQw4w9WgXcQ
http://www.youtube.com/watch?v=dQw4w9WgXcQ
​​​​http://www.youtube-nocookie.com/v/6L3ZvIMwZFM?version=3&hl=en_US&rel=0
You can use following RegEx
^(?:(?:https?:\/\/)?(?:www\.)?)?(youtube(?:-nocookie)?\.com|youtu\.be)\/.*?(?:embed|e|v|watch\?.*?v=)?\/?([a-z0-9]+)
RegEx Breakup:
^: Start of the line anchor
(?:(?:https?:\/\/)?(?:www\.)?)?:
(?:https?:\/\/)?: Match http:// or https:// optionally
(?:www\.)?)?: Match www. zero or one time
(youtube(?:-nocookie)?\.com|youtu\.be)\/: Match either
youtube.com or youtube-nocookie.com or youtu.be followed by /
.*?: Lazy match. Match until the next pattern satisfies.
(?:embed|e|v|watch\?.*?v=)?\/?:
(?:embed|e|v|watch\?.*?v=)?: Match embed or e or v or from watch? to v= or nothing
\/?: Match / zero or one time
([a-z0-9]+): Match one or more alphanumeric characters and add that in the captured group.
Live DemoUsing JavaScript
var regex = /^(?:(?:https?:\/\/)?(?:www\.)?)?(youtube(?:-nocookie)?\.com|youtu\.be)\/.*?(?:embed|e|v|watch\?.*?v=)?\/?([a-z0-9]+)/i;
// An array of all the youtube URLs
var youtubeLinks = [
'http://www.youtube.com/e/dQw4w9WgXcQ',
'http://www.youtube.com/watch?feature=player_embedded&v=dQw4w9WgXcQ',
'http://www.youtube.com/v/0zM3nApSvMg?fs=1&hl=en_US&rel=0',
'http://www.youtube.com/embed/0zM3nApSvMg?rel=0',
'http://www.youtube.com/watch?v=0zM3nApSvMg&feature=feedrec_grec_index',
'http://www.youtube.com/watch?v=0zM3nApSvMg',
'http://youtu.be/0zM3nApSvMg',
'http://www.youtube.com/watch?v=0zM3nApSvMg#t=0m10s',
'http://youtu.be/dQw4w9WgXcQ',
'http://www.youtube.com/embed/dQw4w9WgXcQ',
'http://www.youtube.com/v/dQw4w9WgXcQ',
'http://www.youtube.com/watch?v=dQw4w9WgXcQ',
'http://www.youtube-nocookie.com/v/6L3ZvIMwZFM?version=3&hl=en_US&rel=0'
];
// An object to store the results
var youtubeIds = {};
// Iterate over the youtube URLs
youtubeLinks.forEach(function(url) {
// Get the value of second captured group to extract youtube ID
var id = "<span class='youtubeId'>" + (url.match(regex) || [0, 0, 'No ID present'])[2] + "</span>";
// Add the URL and the extracted ID in the result object
youtubeIds[url] = id;
});
// Log the object in the browser console
console.log(youtubeIds);
// To show the result on the page
document.getElementById('output').innerHTML = JSON.stringify(youtubeIds, 0, 4);
.youtubeId {
color: green;
font-weight: bold;
}
<pre id="output"></pre>
Your regex is designed for youtu.be domain, of course it doesn't work with youtube.com one.
Construct java.net.URL (https://docs.oracle.com/javase/7/docs/api/java/net/URL.html) from your URL string
Use URL#getQuery() to get the query part
Check Parse a URI String into Name-Value Collection for a ways to decode query part into a name-value map, and get value for name 'v'
If there is no 'query' part (like in http://www.youtube.com/e/dQw4w9WgXcQ), then use URL#getPath() (which will give you /e/dQw4w9WgXcQ) and parse your video ID from it, e. g., by skipping first 3 symbols: url.getPath().substring(3)
Update. Why not regex? Because standard JDK URL parser is much more robust. It is being tested by the whole Java community, while RegExp-based reinvented wheel is only tested by your own code.
I like to use this function for all YouTube video ids. I pass through the url and return only the id. Check the fiddle below.
var ytSrc = function( url ){
var regExp = /^.*((youtu.be\/)|(v\/)|(\/u\/\w\/)|(embed\/)|(watch\?))\??v?=?([^#\&\?]*).*/;
var match = url.match(regExp);
if (match&&match[7].length==11){
return match[7];
}else{
alert("Url incorrecta");
}
}
https://jsfiddle.net/keinchy/tL4thwd7/1/

Need help to form a regex in java

I want to find a regx and occurrences of it in the page source using language Java. The value I am trying to search is as given in the program below.
There might be one or more spaces between tags. I am not able to form a regx for this value. Can some one please help me to find the regx for this value?
My program which checks regx is as given below-
String regx=""<img height=""1"" width=""1"" style=""border-style:none;"" alt="""" src=""//api.adsymptotic.com/api/s/trackconversion?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel""/>";
WebDrive driver = new FirefoxDriver();
driver.navigate().to("abc.xom");
int count=0, found=0;
source = driver.getPageSource();
source = source.replaceAll("\\s+", " ").trim();
pattern = Pattern.compile(regx);
matcher = pattern.matcher(source);
while(matcher.find())
{
count++;
found=1;
}
if(found==0)
{
System.out.println("Maximiser not found");
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Fail";
}
else
{
System.out.println("Maximiser is found" + count);
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Pass";
}
count=0; found=0;
Hard to tell without the original text and expected result, but your Pattern clearly won't compile as is.
You should single-escape double quotes (\") and double-escape special characters (i.e. \\?) for your code and your Pattern to compile.
Something in the lines of:
String regx="<img height=\"1\" width=\"1\" style=\"border-style:none;\" " +
"alt=\"\" src=\"//api.adsymptotic.com/api/s/trackconversion" +
"\\?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b" +
"&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel\"/>";
Also consider scraping markup with appropriate framework (i.e. JSoup for HTML) instead of regex.

Unable to parse Multiple lined XML Message using Java "Pattern" and "Matcher"

I am unable to parse Multi-lined XML message payload using Pattern.compile(regex).However If I make same message Single line it Gives me expected result.For Example,IF I parse
<Document> <RGOrdCust50K5s0F> AccName AccNo AccAddress </RGOrdCust50K50F> </Document>
It gives me RGOrdCust50K50F> tag value as : AccName AccNo AccAddress but if I use multiple lines like
<Document> <RGOrdCust50K50F>AccNo
AccName
AccAddress </RGOrdCust50K50F></Document>
it through ava.lang.IllegalStateException: No match found
The Testcase code I am using to test this is as below
public class ParseXMLMessage {
public static void main(String[] args) {
String fldName = "RGOrdCust50K50F";
String message="<?xml version=1.0 encoding=UTF-8?> <Document><RGOrdCust50K50F>1234
ABCD
LONDON,UK </RGOrdCust50K50F></Document>";
String fldValue = getTagValue(fldName, message);
System.out.println("fldValue:"+fldValue);
}
private static String getTagValue(String tagName, String message) {
String regex = "(?<=<" + tagName + ">).*?(?=</" + tagName + ">)";
System.out.println("regex:"+regex);
Pattern pattern = Pattern.compile(regex);
System.out.println("pattern:"+pattern);
Matcher matcher = pattern.matcher(message);
System.out.println("matcher:"+matcher);
matcher.find(0);
String tagValue = null;
try {
tagValue = matcher.group();
} catch (IllegalStateException isex) {
System.out.println("No Tag/Match found " + isex.getMessage());
}
return tagValue;
}
}
As a business requirment I need to make message muli-lined but when i make message mutiple lined I get exception.
I am unable to fix this issue Kindly suggest if there IS ANY ISSUE WITH 'REGEX' expression I am using do I need to Use '/n' in Regex express to resolve this issue.Kindly assist
If you are parsing XML, use an XML parser to do it - your REGEX will get increasingly complex and frail as you find more and more situations that it can't handle adequately.
There are a large number of mature and stable XML processing libraries. I tend to stick with what I know and jdom has a very shallow learning curve and will handle this sort of processing very easily.
Issue depends on '.' metacharacter. See http://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html
. Any character (may or may not match line terminators)
Try to use following code:
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE| Pattern.DOTALL);
Check following topic: java regex string matches and multiline delimited with new line

InnerHTML for Java

I would like JavaScript style innerHTML in Java. For instance, I want to get 'TRUE' from the string below:
String control = "<div class='myclass'>TRUE</div>";
But my pattern seems to be off as find() returns false. Ideas anyone?
Pattern pattern = Pattern.compile(">(.*?)<");
Matcher matcher = pattern.matcher(control);
if(matcher.find()) {
result = matcher.group(1);
}
get rid of the question mark:
public static void main(String[] args) {
String control = "<div class='myclass'>TRUE</div>";
Pattern pattern = Pattern.compile(">(.*)<");
Matcher matcher = pattern.matcher(control);
String result = null;
if(matcher.find()) {
result = matcher.group(1);
}
System.out.print(result);
}
BTW it would be better to learn how to use java's DOM objects and XPath classes.
Either use Jquery or if you really insist on doing it in Java, try using JSoup to strip out the HTML and return on the safe stuff

substring between two delimiters

I have a string as : "This is a URL http://www.google.com/MyDoc.pdf which should be used"
I just need to extract the URL that is starting from http and ending at pdf :
http://www.google.com/MyDoc.pdf
String sLeftDelimiter = "http://";
String[] tempURL = sValueFromAddAtt.split(sLeftDelimiter );
String sRequiredURL = sLeftDelimiter + tempURL[1];
This gives me the output as "http://www.google.com/MyDoc.pdf which should be used"
Need help on this.
This kind of problem is what regular expressions were made for:
Pattern findUrl = Pattern.compile("\\bhttp.*?\\.pdf\\b");
Matcher matcher = findUrl.matcher("This is a URL http://www.google.com/MyDoc.pdf which should be used");
while (matcher.find()) {
System.out.println(matcher.group());
}
The regular expression explained:
\b before the "http" there is a word boundary (i.e. xhttp does not match)
http the string "http" (be aware that this also matches "https" and "httpsomething")
.*? any character (.) any number of times (*), but try to use the least amount of characters (?)
\.pdf the literal string ".pdf"
\b after the ".pdf" there is a word boundary (i.e. .pdfoo does not match)
If you would like to match only http and https, try to use this instead of http in your string:
https?\: - this matches the string http, then an optional "s" (indicated by the ? after the s) and then a colon.
why don't you use startsWith("http://") and endsWith(".pdf") mthods of String class.
Both the method returns boolean value, if both returns true, then your condition succeed else your condition is failed.
Try this
String StringName="This is a URL http://www.google.com/MyDoc.pdf which should be used";
StringName=StringName.substring(StringName.indexOf("http:"),StringName.indexOf("which"));
You can use Regular Expression power for here.
First you have to find Url in original string then remove other part.
Following code shows my suggestion:
String regex = "\\b(http|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
String str = "This is a URL http://www.google.com/MyDoc.pdf which should be used";
String[] splited = str.split(regex);
for(String current_part : splited)
{
str = str.replace(current_part, "");
}
System.out.println(str);
This snippet code cans retrieve any url in any string with any pattern.
You cant add customize protocol such as https to protocol part in above regular expression.
I hope my answer help you ;)
public static String getStringBetweenStrings(String aString, String aPattern1, String aPattern2) {
String ret = null;
int pos1,pos2;
pos1 = aString.indexOf(aPattern1) + aPattern1.length();
pos2 = aString.indexOf(aPattern2);
if ((pos1>0) && (pos2>0) && (pos2 > pos1)) {
return aString.substring(pos1, pos2);
}
return ret;
}
You can use String.replaceAll with a capturing group and back reference for a very concise solution:
String input = "This is a URL http://www.google.com/MyDoc.pdf which should be used";
System.out.println(input.replaceAll(".*(http.*?\\.pdf).*", "$1"));
Here's a breakdown for the regex: https://regexr.com/3qmus

Categories