Unable to parse Multiple lined XML Message using Java "Pattern" and "Matcher" - java

I am unable to parse Multi-lined XML message payload using Pattern.compile(regex).However If I make same message Single line it Gives me expected result.For Example,IF I parse
<Document> <RGOrdCust50K5s0F> AccName AccNo AccAddress </RGOrdCust50K50F> </Document>
It gives me RGOrdCust50K50F> tag value as : AccName AccNo AccAddress but if I use multiple lines like
<Document> <RGOrdCust50K50F>AccNo
AccName
AccAddress </RGOrdCust50K50F></Document>
it through ava.lang.IllegalStateException: No match found
The Testcase code I am using to test this is as below
public class ParseXMLMessage {
public static void main(String[] args) {
String fldName = "RGOrdCust50K50F";
String message="<?xml version=1.0 encoding=UTF-8?> <Document><RGOrdCust50K50F>1234
ABCD
LONDON,UK </RGOrdCust50K50F></Document>";
String fldValue = getTagValue(fldName, message);
System.out.println("fldValue:"+fldValue);
}
private static String getTagValue(String tagName, String message) {
String regex = "(?<=<" + tagName + ">).*?(?=</" + tagName + ">)";
System.out.println("regex:"+regex);
Pattern pattern = Pattern.compile(regex);
System.out.println("pattern:"+pattern);
Matcher matcher = pattern.matcher(message);
System.out.println("matcher:"+matcher);
matcher.find(0);
String tagValue = null;
try {
tagValue = matcher.group();
} catch (IllegalStateException isex) {
System.out.println("No Tag/Match found " + isex.getMessage());
}
return tagValue;
}
}
As a business requirment I need to make message muli-lined but when i make message mutiple lined I get exception.
I am unable to fix this issue Kindly suggest if there IS ANY ISSUE WITH 'REGEX' expression I am using do I need to Use '/n' in Regex express to resolve this issue.Kindly assist

If you are parsing XML, use an XML parser to do it - your REGEX will get increasingly complex and frail as you find more and more situations that it can't handle adequately.
There are a large number of mature and stable XML processing libraries. I tend to stick with what I know and jdom has a very shallow learning curve and will handle this sort of processing very easily.

Issue depends on '.' metacharacter. See http://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html
. Any character (may or may not match line terminators)
Try to use following code:
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE| Pattern.DOTALL);
Check following topic: java regex string matches and multiline delimited with new line

Related

How to navigate to correct "URL" from outlook email content

I'm successfully reading outlook email from JAVAX mail. But when i try to get the "Link" available in email body it's not giving the exact URL, instead it gives the URL with some extra characters like "=3D?*/". I tried to use below code but it didn't help me.
public List<String> getUrlsFromMessage(Message message, String linkText) throws Exception {
String html = getMessageContent(message);
List<String> allMatches = new ArrayList<String>();
// (<a [^>]+>)
Matcher matcher = Pattern.compile(" (<a [^>]+>)" + linkText + "</a>").matcher(html);
while (matcher.find()) {
String aTag = matcher.group(1);
allMatches.add(aTag.substring(aTag.indexOf("http"), aTag.indexOf("\">")));
}
return allMatches;
}
Also I changed the pattern to
Pattern linkPattern = Pattern.compile(" <a\\b[^>]*href=\"([^\"]*)[^>]*>(.*?)</a>",
Pattern.CASE_INSENSITIVE | Pattern.DOTALL);`
But still it gives me the wrong URL.
Finally I found a solution to retrieve the exact URL using StringBuilder. What i did was i removed the unwanted characters from the string until i get the correct URL. This may not be a good coding practice but this was the only work around which works for me.
StringBuilder build = new StringBuilder(link);
build.deleteCharAt(43);// Shift the positions front.
build.deleteCharAt(51);
build.deleteCharAt(51);
driver.get(build.toString());

Two separate patterns and matchers (java)

I'm working on a simple bot for discord and the first pattern reading works fine and I get the results I'm looking for, but the second one doesn't seem to work and I can't figure out why.
Any help would be appreciated
public void onMessageReceived(MessageReceivedEvent event) {
if (event.getMessage().getContent().startsWith("!")) {
String output, newUrl;
String word, strippedWord;
String url = "http://jisho.org/api/v1/search/words?keyword=";
Pattern reading;
Matcher matcher;
word = event.getMessage().getContent();
strippedWord = word.replace("!", "");
newUrl = url + strippedWord;
//Output contains the raw text from jisho
output = getUrlContents(newUrl);
//Searching through the raw text to pull out the first "reading: "
reading = Pattern.compile("\"reading\":\"(.*?)\"");
matcher = reading.matcher(output);
//Searching through the raw text to pull out the first "english_definitions: "
Pattern def = Pattern.compile("\"english_definitions\":[\"(.*?)]");
Matcher matcher2 = def.matcher(output);
event.getTextChannel().sendMessage(matcher2.toString());
if (matcher.find() && matcher2.find()) {
event.getTextChannel().sendMessage("Reading: "+matcher.group(1)).queue();
event.getTextChannel().sendMessage("Definition: "+matcher2.group(1)).queue();
}
else {
event.getTextChannel().sendMessage("Word not found").queue();
}
}
}
You had to escape the [ character to \\[ (once for the Java String and once for the Regex). You also did forget the closing \".
the correct pattern looks like this:
Pattern def = Pattern.compile("\"english_definitions\":\\[\"(.*?)\"]");
At the output, you might want to readd \" and start/end.
event.getTextChannel().sendMessage("Definition: \""+matcher2.group(1) + "\"").queue();

Need help to form a regex in java

I want to find a regx and occurrences of it in the page source using language Java. The value I am trying to search is as given in the program below.
There might be one or more spaces between tags. I am not able to form a regx for this value. Can some one please help me to find the regx for this value?
My program which checks regx is as given below-
String regx=""<img height=""1"" width=""1"" style=""border-style:none;"" alt="""" src=""//api.adsymptotic.com/api/s/trackconversion?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel""/>";
WebDrive driver = new FirefoxDriver();
driver.navigate().to("abc.xom");
int count=0, found=0;
source = driver.getPageSource();
source = source.replaceAll("\\s+", " ").trim();
pattern = Pattern.compile(regx);
matcher = pattern.matcher(source);
while(matcher.find())
{
count++;
found=1;
}
if(found==0)
{
System.out.println("Maximiser not found");
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Fail";
}
else
{
System.out.println("Maximiser is found" + count);
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Pass";
}
count=0; found=0;
Hard to tell without the original text and expected result, but your Pattern clearly won't compile as is.
You should single-escape double quotes (\") and double-escape special characters (i.e. \\?) for your code and your Pattern to compile.
Something in the lines of:
String regx="<img height=\"1\" width=\"1\" style=\"border-style:none;\" " +
"alt=\"\" src=\"//api.adsymptotic.com/api/s/trackconversion" +
"\\?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b" +
"&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel\"/>";
Also consider scraping markup with appropriate framework (i.e. JSoup for HTML) instead of regex.

GWT RegExp - multiple matches

I want to find all the "code" matches in my input string (With GWT RegExp). When I call the "regExp.exec(inputStr)" method it only returns the first match, even when I call it multiple times:
String input = "ff <code>myCode</code> ff <code>myCode2</code> dd <code>myCode3</code>";
String patternStr = "<code[^>]*>(.+?)</code\\s*>";
// Compile and use regular expression
RegExp regExp = RegExp.compile(patternStr);
MatchResult matcher = regExp.exec(inputStr);
boolean matchFound = (matcher != null); // equivalent to regExp.test(inputStr);
if (matchFound) {
// Get all groups for this match
for (int i=0; i<matcher.getGroupCount(); i++) {
String groupStr = matcher.getGroup(i);
System.out.println(groupStr);
}
}
How can I get all the matches?
Edit: Like greedybuddha noted: A regex is not really suited to parse (X)HTML. I gave JSOUP a try and it is much more convienient than with a regex. My code with jsoup now looks like this. I am renaming all code tags and apply them a CSS-Class:
String input = "ff<code>myCode</code>ff<code>myCode2</code>";
Document doc = Jsoup.parse(input, "UTF-8");
Elements links = doc.select("code"); // a with href
for(Element link : links){
System.out.println(link.html());
link.tagName("pre");
link.addClass("prettify");
}
System.out.println(doc);
Compile the regular expression with the "g" flag, for global matching.
RegExp regExp = RegExp.compile(patternStr,"g");
I think you will also want "m" for multiline matching, "gm".
That being said, for HTML/XML parsing you should consider using JSoup or another alternative.

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)
You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");
This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";
This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

Categories