match the n th number in an HTML regex java - java

Hello i need to find the second occurrence of a match in a string
I have a string like "
<span class="test">
example
</span>
<span class="test">
example1
</span>
<span class="test">
example2
</span>
i need to extract the example 1 from the content i tried (?:<span class="test"){2}(.*?)</span> but its not working.
Please dont say that not to use HTML parser with regex. I am aware of that i have no choice.

The following regex:
<span class="test">\s*(.*?)\s*</span>
Will produce the following captures:
[0] => example
[1] => example1
[2] => example2
You can reference whichever one you like.
But if for some reason you can't reference a specific capture (I can't imagine why not, so this is kind of academic), then the following will return the second one:
<span class="test">(?s).*?</span>\s*<span class="test">\s*(.*?)\s*</span>
Note the use of "single line mode", specified by (?s). This means the . will also match new-line characters. In Java this can be enabled by using the DOTALL option if you're using the .compile() approach.

Try this:
(?:<span class="test".*?</span>)\s*<span[^>]*>\s*(.*?)\s*</span>
The desired result is the only matched group. For this to work you need to use the DOTALL flag.

Try this:
String text = "<span class=\"test\"> example</span>\n<span class=\"test\"> example1</span>\n<span class=\"test\"> example2</span>";
Matcher m1 = Pattern.compile("<span class=\\\"test\\\">(.*?)<\\/span>").matcher(text);
ArrayList<String> matches = new ArrayList<String>();
while(m1.find()){
matches.add(m1.group(1).trim());
}
System.out.println(matches.get(1));

Related

Remove last character of string in Apache Velocity templating language not working

I am trying to convert markdown hyperlinks into html hyperlinks in the Apache Velocity template language (for Marketo). I am nearly there by splitting on ']' and then removing the first character from the remaining '[link text' piece, and the first and last characters from the remaining '(url)' piece.
It will let me remove the first character in each, but doesn't like my code for removing the last character. This is simple code so I don't know why it isn't working.
#set( $refArr = $reference.split(']',2) )
<li>
<a href=$refArr[1].substring(1,$refArr[1].length()-1)>$refArr[0].substring(1)</a>
</li>
It just doesn't like the '-1' part, see error. Velocity is supposed to have full Java method access, but it appears that it may be confusing Java for html.
Cannot get email content- <div>An error occurred when procesing the email Body! </div> <p>Encountered "-1" near</p>
I've also tried using regex with the replace method as well but that doesn't work either, whether with the '(' character escaped, double escaped, or not escaped.
Apparently you have to use the MathTool class for Velocity:
#set( $refArr = $reference.split(']',2) )
<li>
<a href=$refArr.get(1).substring(1,$math.sub($refArr.get(1).length(),1))>$refArr.get(0).substring(1)</a>
</li>
Your code should work in recent versions, otherwise you can do it in two steps:
#set($len = $refArr.get(1).length() - 1)
<a href=$refArr[1].substring(1,$len)>$refArr[0].substring(1)</a>

Regex pattern to identify attribute value

I have a source code file which I am trying to read using a automatic Regex processor Class in java.
Although I am unable to form a correct regex pattern to get the values if it appears multiple times in the line.
The input text is:
<input name="id" type="radio" bgcolor="<bean:write name='color'/>" value="<bean:write name='nameProp' property='nameVal'/>" <logic:equal name="checkedStatus" value="0">checked</logic:equal>>
And I want the matcher.find to output following terms:
<bean:write name='color'/>
<bean:write name='nameProp' property='nameVal'/>
Kindly help to form the regex pattern for this scenario.
Use this regex to find those terms:
<bean:write[^\/]*\/>
It will search for the words <bean:write and then everything up until a />
Use it like this:
List<String> matches = new ArrayList<>();
Matcher m = Pattern.compile("<bean:write[^\\/]*\\/>")
.matcher(inputText);
while (m.find()) {
matches.add(m.group());
}
Regex101 Tested
I caution you though with parsing HTML with regex. If you need anything more complicated than this, you should probably consider using an XML parser instead. See this famous answer.

Selenium Find By Xpath returning the wrong element

I am trying to select a channel from a series of channles that are displayed in a HTML table. I'm using the following Selenium method to select the link
WebElement channel = driver.findElement(By.xpath("//span[contains(text(),Sales)]"));
channel.click();
However it's selecting the first channel in the list (Account Management) instead. I would expect that it would either select the correct channel or throw an error, rather than select the wrong one. The following is the full xpath of the channel I want:
/html/body/div[2]/div[2]/form/div/table/tbody[2]/tr/td/ul/li[2]/a/span
The list of channels is defined like this in the HTML code:
<form action="nextpage.do" method="post" name="selectChannelForm">
<div class="de">
<h2>Select channel</h2>
<table id="selectChannelForm">
<tbody id=""></tbody>
<tbody id="">
<tr rowtype="container">
<td class="desecond" colspan="3">
<ul>
<li>
<a id="selected_a" href="nextpage.do?selectedChannel=123">
<span>Account Management</span></a>
</li>
<li>
<a id="selected_a" href="nextpage.do?selectedChannel=456">
<span>Sales</span></a>
</li>
<li>
<a id="selected_a" href="nextpage.do?selectedChannel=789">
<span>Complaints</span></a>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
<input type="hidden" value="selectChannelForm" name="formid">
</div>
First - your mistake!
You forgot the quotation marks around "Sales", just change your code a bit and it will work:
WebElement channel = driver.findElement(By.xpath("//span[contains(text(),'Sales')]"));
channel.click();
Second - xpath bug?
You're right that it is weird, that you are not getting an error message but instead the first element that is a span.
This might acutally be a bug in xpath. The contains functions realizes that your second argument is no a string, but instead of returning false, it returns true.
It actually hits all three of the span items. You only get the first as a result because you used the findElement function.
Try this and you will see the quirk:
System.out.println(driver.findElements(By.xpath("//span[contains(text(),Sales)]")).size());
Result will be:
3
Third - might be "as designed"
Having a look at the w3c definition you will find the following line:
If the value of $arg2 is the zero-length string, then the function
returns true.
Then on the xpath-site of microsoft you will find another interesting hint to the puzzle:
If an argument is not of type string, it is first converted to a
string and then evaluated.
Putting all this information together, I guess, xpath interprets your non-string/non-variable second parameter as an empty string and therefore returns true for all span elements since you were searching for //span.
UPDATE
From #MichaelKay in the comments we learn, that my "guess" was pretty close:
In XPath and XQuery, a bare name like "hello" means child::hello, and
if you're not using schema-awareness, then the system will just look
for children called hello, and if there aren't any, it will return an
empty node-set.
Conclusion: The behaviour the OP sees is as designed, even though it seems pretty non-intuitive.
The xpath that you are using is missing quotes "" around Sales text. text() function takes an argument that is a string and a string can be formed using quotes. Update your xpath in the following way -
driver.findElement(By.xpath("//span[contains(text(),'Sales')]")).click();
Or if you want to assign it to a WebElement, then do put in your quotes -
WebElement channel = driver.findElement(By.xpath("//span[contains(text(),'Sales')]"));
channel.click();
If at all you want to write nested double quotes or nested single quotes then use an escape character \ to write it. Here's how -
WebElement channel = driver.findElement(By.xpath("//span[contains(text(),\"Sales\")]"));
channel.click();
Hope this helps.
The xpath need to be modified as "//span[contains(text(),'Sales')]" .
As we can see below in the method definitions, contains method will return true only If second parameter is also a text.
Source: https://en.wikipedia.org/wiki/XPath
contains(s1, s2)
returns true if s1 contains s2.
text()
finds a node of type text

Regular expression to remove Some HTML tags but keep Span tag

Is there an expression which will get the value between two HTML tags? Also if Span tag is there then I need to keep as it is
input
<table><tr>
<td>abc<td/> <span class="abc">Test</span>
</tr>
</table>
Output
abc <span class"abc"> Test</span>
I tried following solution but it remove tag also
String input="<table><tr><td>abc<td/> <span>Test</span></tr></table>";
String newValue = input.replaceAll("<[^>]*>", "");
System.out.println(newValue);
Output of above code
abc Test
but Output require
abc <span class"abc"> Test</span>
You can use a negative lookahead (?!...) that means not followed by to test the tag. Exemple in java syntax:
<(?!/?span\\b)[^>]*>
I think this will do what you are looking for:
str.replaceAll("<(?!\\/?span)[^>]+>", "")
This will look for a <, then look ahead to see if it contains /span or span before coming up to the next >... and replace all of that with nothing.
Example:
String str = "<table><tr><td>abc<td/> <span class=\"abc\">Test</span></tr></table>\";";
System.out.println(str.replaceAll("<(?!\\/?span)[^>]+>", ""));
//prints: abc <span class="abc">Test</span>";

Using regex, how to change the attribute values from one pattern to another

I have a html file in which the html elements have name as follows :
<input type="text" name="HCFA_DETAIL_SUPPLEMENTAL" value="" size="64" />
My requirement is to rename the name attribute value in java naming convention as follows :
<input type="text" name="hcfaDetailSupplemental" value="" size="64" />
Since there are large number of such elements, I want to accomplish that using regex. Can anyone suggest my how to achieve that using regex ?
Do not use regular expressions to go over HTML (why here). Using an appropriate framework such as HTML Parser should do the trick.
A series of samples to get you started are available here.
Using jQuery to get the name, and then regexes to replace all the _[a-z] occurances:
$('input').each(function () {
var s = $(this).attr('name').toLowerCase();
while (s.match("_[a-z]"))
s = s.replace(new RegExp("_[a-z]"), s.match("_[a-z]").toString().toUpperCase());
$(this).attr('name', s);
});
In most cases using regex with html is bad practice, but if you must use it, then here is one of solutions.
So first you can find text in name="XXX" attribute. It can be done by using this regex (?<=name=")[a-zA-Z_]+(?="). When you find it, replace "_" by "" and don't forget to lowercase rest of letters. Now you can replace old value by new one using same regex we used before.
This should do the trick
String html="<input type=\"text\" name=\"HCFA_DETAIL_SUPPLEMENTAL\" value=\"\" size=\"64\"/>";
String reg="(?<=name=\")[a-zA-Z_]+(?=\")";
Pattern pattern=Pattern.compile(reg);
Matcher matcher=pattern.matcher(html);
if (matcher.find()){
String newName=matcher.group(0);
//System.out.println(newName);
newName=newName.toLowerCase().replaceAll("_", "");
//System.out.println(newName);
html=html.replaceFirst(reg, newName);
}
System.out.println(html);
//out -> <input type="text" name="hcfadetailsupplemental" value="" size="64"/>

Categories