Modify large string - java

I have a large string in the following format -
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
Id like to store all occurances of the value that occurs before .html. So above html becomes something like 12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
Do I need a regular expression? or some kind of replace method.
Thanks

You don't actually need a regular expression, but you could use the underlying Matcher class:
final String searchString = "12345.html";
final String txt =
"<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">\n"
+ "<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">";
final Matcher matcher = Pattern.compile(searchString, Pattern.LITERAL).matcher(txt);
final StringBuilder sb = new StringBuilder();
while(matcher.find()){
if(sb.length() > 0) sb.append(',');
sb.append(matcher.group());
}
System.out.println(sb.toString());
Output:
12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html

You can use an HTML parser like Jsoup.
Document doc = Jsoup.parse(yourString);
Elements els = doc.select("a");
for(Element el: els){
//this only if needs the number without the HTML
//if not, only el.attr("href")
if(el.attr("href").contains(".html")){
String[] parts = el.attr("href").split(".html");
System.out.println(parts[0]);
}
}
Don't use regex to parse HTML.

If you are accessing this string inside the java code, you can split the string on "=' delimeter. It will result in a bunch of strings. One string will look like "
So the steps are:
1. split the string which will result in string array.
2. Iterate over the resulting array and look for the pattern ">

Related

Jsoup parser remove words with '<' and '>'

I'm using the Jsoup.parse() to remove html tags from a String. But my string as a word like <name> also.
The problem is Jsoup.parse() remove that too. I'ts because that text has < and >. I can't just remove < and > from the text too. How can I do this.
String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct
String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag
I'm using the Jsoup.parse() to remove html tags from a String.
You want to use the Jsoup#clean method. You'll also need a little manual work after because Jsoup will still see <name> as an HTML tag.
// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" };
int nbExceptions = myExceptions.length;
// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);
// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);
// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}
System.out.println(">>" + s2);
OUTPUT
>><name>
References
How to remove HTML tags from a string with Jsoup?
Whitelist javadoc
clean method javadoc

How to remove tag from a string

I have a string i.e
String test = "<p> My company is best in world. I love my company </p>";
I have to remove both the tags <p> and </p>.
I tried using
String replacingPtag = test.replaceAll("<p>", "");
String r1 = replacingPtag.replaceAll("</p>", "");
This code removed the <p> tag but not </p>.
How can I remove both forms of the tag?
try this regex
String res = test.replaceAll("</?p>", "");

GWT RegExp - multiple matches

I want to find all the "code" matches in my input string (With GWT RegExp). When I call the "regExp.exec(inputStr)" method it only returns the first match, even when I call it multiple times:
String input = "ff <code>myCode</code> ff <code>myCode2</code> dd <code>myCode3</code>";
String patternStr = "<code[^>]*>(.+?)</code\\s*>";
// Compile and use regular expression
RegExp regExp = RegExp.compile(patternStr);
MatchResult matcher = regExp.exec(inputStr);
boolean matchFound = (matcher != null); // equivalent to regExp.test(inputStr);
if (matchFound) {
// Get all groups for this match
for (int i=0; i<matcher.getGroupCount(); i++) {
String groupStr = matcher.getGroup(i);
System.out.println(groupStr);
}
}
How can I get all the matches?
Edit: Like greedybuddha noted: A regex is not really suited to parse (X)HTML. I gave JSOUP a try and it is much more convienient than with a regex. My code with jsoup now looks like this. I am renaming all code tags and apply them a CSS-Class:
String input = "ff<code>myCode</code>ff<code>myCode2</code>";
Document doc = Jsoup.parse(input, "UTF-8");
Elements links = doc.select("code"); // a with href
for(Element link : links){
System.out.println(link.html());
link.tagName("pre");
link.addClass("prettify");
}
System.out.println(doc);
Compile the regular expression with the "g" flag, for global matching.
RegExp regExp = RegExp.compile(patternStr,"g");
I think you will also want "m" for multiline matching, "gm".
That being said, for HTML/XML parsing you should consider using JSoup or another alternative.

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)
You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");
This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";
This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories