Java Regex for Finding a Pattern and Getting Value in It? - java

I am working on a plugin. I will parse HTML files. I have a naming convention like that:
<!--$include="a.html" -->
or
<!--$include="a.html"-->
is similar
According to this pattern(similar to server side includes) I want to search an HTML file.
Question is that:
Find that pattern and get value (a.html at my example, it is variable)
It should be like:
while(!notFinishedWholeFile){
fileName = findPatternFunc(htmlFile)
replaceFunc(fileName,something)
}
PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.
Any ideas?

You mean like this?
<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->

Read a file into a string then
str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);
will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)
If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.
Using regex for this task is fine performance wise, but see e.g.
How to use regular expressions to parse HTML in Java? and Java Regex performance etc.

I have used that pattern:
"<!--\\$include=\"(.+)(.)(html|htm)\"-->"

Related

How to search for a string and append to each occurance in Java

Currently I am working on a project and I am trying to see which String method would be most appropriate to use or how to approach this. I am trying to prepend a string to each occurrence of this specific string. For example, I am extracting HTML and for each /img/image1.png I find I want to append a url to it.
However, there are images that are already like that for example www.anylink.com/img/image2.png which do not need appending but are in the string in which I pulled. I looked at replaceAll() method but not sure if this allows for appending in replacement and also not sure if I need regex to search for instances where only /img/ exists(no url) and not the full url since only local hosted images I want to append to. I am looking for some suggestions as I am not sure how to begin this code after research.
Thank you.
I think that the method replaceAll() in String is enough for what you need.
You just need to write the correct regular expression.
If you write some examples, I can suggest the regex.
For example something like:
System.out.println("<div><img src=\"/test/this.png\" /></div>".replaceAll("src=\"/(.*)\"", "src=\"www.google.com$1\""));

regex to replace the value of a key in a json file

I want to make a regex so I can do a "Search/Replace" over a json file with many object. Every object has a key named "resource" containing a URL.
Take a look at these examples:
"resource":"/designs/123/image.jpg"
"resource":"/designs/221/elephant.gif"
"resource":"/designs/icon.png"
I want to make a regex to replace the whole url with a string like this: localhost:8080/filepath.
This way, the result would be:
"resource":"localhost:8080/designs/123/image.jpg"
"resource":"localhost:8080/designs/221/elephant.gif"
"resource":"localhost:8080/designs/icon.png"
I'm just starting with regular expressions and I'm completely lost. I was thinking that one valid idea would be to write something starting with this pattern "resource":"
How could I write the regular expression?
The easiest method is probably just to replace "resource":"/ with "resource":"localhost:8080/. You don't even need a regex for this (but if you do you just have to escape some stuff).
With vim this would be
:%s/"resource":"\(.*\)"/"resource":"localhost:8080\1"
this should be easily transferable to java.

Escape quote when using StringBuilder

I'm using StringBuidler in Java to build a HTML page.
I want to know how to escape all quotes (") without placing a "\" every time?
For example, every time when I append a string like this :
StringBuilder a ;
a.append(<div id = \"Name\" ...>)
I want to write directly :
a.append(<div id = "Name" ..>
Thanks.
Short answer: There is no way around this in Java
Long answer: Java does not have multiple ways to enclose Strings. You always do it with double quotes, so if you want to have double quotes in your String you have to escape them.
But if they really annoy you you can apply some trickery:
put your Strings in a text file and read them from there.
use a different character instead of the quote character and use replace to put in the proper quotes. Of course your replacement character must not appear anywhere else in the string.
Write the code in question in a different programming language like Groovy, which has different ways to delimit Strings.
Since you seem to generate HTML: use a proper templating engine, which really is option 1 on steroids.
When building a HTML template, the easiest solution is to use a text file.
You can do this as
a simple text file where you replace() tags with code you want to alter
use a properties file for the sections of text to inline.
use a library which has a fluent API for generating HTML
use velocity to perform the substitution for you.
use one of the other many web page formats like JSP.
However, there is no way to avoid escaping " in Java code. The only alternative is you use another character like ” (Alt-Graphic-B) which you replace at the end.
You can't, which is only one of the reasons it's a bad idea to fill a StringBuilder with HTML code by hand.
It exists in other language than Java, but with Java is not possible.
With coffescript, you can, for example :
html = """
<div id="Name" > ... </div>
"""
There's no proper way to do it, but you might be able to put a rarely used substitute character (a tilde or something) in your String and then call .replace() on it.
Ideally, you should be loading the data from a file if you want the raw string.

remove html tags with using StringTokenizer

Here is my string:
String str = "<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font>
<font
size="4"><strong>";
I want to remove all html tags in a string with using StringTokenizer. But I don't understand how to use StringTokenizer for this situation. Because when I use str.replaceAll("\\<.*?>",""), it is not efficient to remove all tags because some tags will be on the next line of string, as seen the string above. But I want to do it for all situations between < and >. How can I do it? (I want to achieve it using StringTokenizer). Thanks..
As a general rule, you shouldn't parse HTML with anything except an HTML parsing library. Writing your own parser creates a security risk and exposes your applications to possible attack vectors like Cross Site Scripting and various other bugs. Again: don't parse HTML with regex or a simple tokenizer. An exception to this rule may be if you have a small set of known HTML data inputs and you will use your code on that data only. In this scenario, you can and should verify that your code is doing the correct thing for each input.
That said, your original regex is very close. The dot wildcard matches everything except newlines, and so if we add to your regex the possibility of newlines in addition to the dot wildcard, we get positive results on your test string.
String result = str.replaceAll("<(.|\r|\n|\f)*?>","");
DO NOT USE THIS CODE ON UNKNOWN INPUT! DO NOT USE IT IN PRODUCTION! IT IS NOT A SAFE OR CORRECT APPROACH TO PARSING HTML.
Trying to process HTML with regexes or StringTokenizer alone is... painful.
This answer is compulsory reading before you go any further.
If your HTML files are simple, you might get away with removing the newlines, then applying a regex, then reformatting the HTML - or try multiline regexes.
But you should really look at using a proper HTML parser. See this question (and probably many others...)
It is better to use an HTML parser library instead of StringTokenizer. Please have a look at the below demonstration:
Download jsoup-1.6.1.jar core library from http://jsoup.org/download.
Add this library to your classpath.
Play with your HTML as you like. Example below is the code for converting HTML content to text format:
import org.jsoup.Jsoup;
public class HtmlParser {
public static String removeAllHtml(String htmlContent) {
return Jsoup.parse(htmlContent).text();
}
public static void main(String[] args) {
String htmlContent = "<pre><font size=\"5\"><strong><u>LVI . The Day of Battle</u></strong></font><fontsize=\"4\"><strong>";
System.out.println(removeAllHtml(htmlContent));
}
}

Need a little help on this regular expression

I have a Java string which looks like this, it is actually an XML tag:
"article-idref="527710" group="no" height="267" href="pc011018.pct" id="pc011018" idref="169419" print-rights="yes" product="wborc" rights="licensed" type="photo" width="322" "
Now I want to remove the article-idref="52770" segment by using regular expression, I came up with the following one:
trimedString.replaceAll("\\article-idref=.*?\"","");
but it doesn't seem to work, could anybody give me an idea on where I got wrong in my regular expression? I need this to be represented as a String in my Java class, so probably HTMLParser won't help me a lot here.
Thanks in advance!
Try this:
trimedString.replaceAll("article-idref=\"[^\"]*\" *","");
I corrected the regular expression by adding quotes and a word boundary (to prevent false matches). Also, in case you didn't, remember to reassign to your string after the replacement:
trimmedString = trimmedString.replaceAll("\\barticle-idref=\".*?\"", "");
See it working at ideone.
Also since this is from an XML document it might be better to use an XML parser to extract the correct attributes instead of a regular expression. This is because XML is quite a complex data format to parse correctly. The example in your question is simple enough. However a regular expression could break on a more complex case, such as a document that includes XML comments. This could be an issue if you are reading data from an untrusted source.
if you are sure the article-idref is allways at the beginning try this:
// removes everything from the beginning to the first whitespace
trimedString = trimedString.replaceFirst("^\\s","");
Be sure to assign the result to trimedString again, since replace does not midify the string itself but returns another string.

Categories